A deluge of experiments

The Atlantic reports on the data deluge and its value for innovation.1 I particularly liked how Erik Brynjolfsson and Andrew McAfee, who wrote the Atlantic piece, highlight the value of experimentation for addressing causal questions — and that many of the questions we care about are causal.2

In writing about experimentation, they report that Hal Varian, Google’s Chief Economist, estimates that Google runs “100-200 experiments on any given day”. This struck me as incredibly low! I would have guessed more like 10,000 or maybe more like 100,000.

The trick of course is how one individuates experiments. Say Google has an automatic procedure whereby each ad has a (small) random set of users who are prevented from seeing it and are shown the next best ad instead. Is this one giant experiment? Or one experiment for each ad?

This is a bit of a silly question.3

But when most people — even statisticians and scientists — think of an experiment in this context, they think of something like Google or Amazon making a particular button bigger. (Maybe somebody thought making that button bigger would improve a particular metric.) They likely don’t think of automatically generating an experiment for every button, such that a random sample see that particular button slightly bigger. It’s these latter kinds of procedures that lead to thinking about tens of thousands of experiments.

That’s the real deluge of experiments.

  1. I don’t know that I would call much of it ‘innovation’. There is some outright innovation, but a lot of that is in the general strategies for using the data. There is much more gained in minor tweaking and optimization of products and services. []
  2. Perhaps they even overstate the power of simple experiments. For example, they do not mention the fact that many times the results these kinds of experiments often change over time, so that what you learned 2 months ago is no longer true. []
  3. Note that two single-factor experiments over the same population with independent random assignment can be regarded as a single experiment with two factors. []

Ethical persuasion profiling?

Persuasion profiling — estimating the effects of available influence strategies on an individual and adaptively selecting the strategies to use based on these estimates — sounds a bit scary. For many, ‘persuasion’ is a dirty word and ‘profiling’ generally doesn’t have positive connotations; together they are even worse! So why do we use this label?

In fact, Maurits Kaptein and I use this term, coined by BJ Fogg, precisely because it sounds scary. We see the potential for quite negative consequences of persuasion profiling, so we try to alert our readers to this.1

On the other hand, we also think that, not only is persuasion profiling sometimes beneficial, but there are cases where choosing not to adapt to individual differences in this way might itself be unethical.2 If a company marketing a health intervention knows that there is substantial variety in how people respond to the strategies used in the intervention — such that while the intervention has positive effects on average, it has negative effects for some — it seems like they have two ethical options.

First, they can be honest about this in their marketing, reminding consumers that it doesn’t work for everyone or even trying to market it to people it is more likely to work for. Or they could make this interactive intervention adapt to individuals — by persuasion profiling.

Actually for the first option to really work, the company needs to at least model how these responses vary by observable and marketable-to characteristics (e.g., demographics). And it may be that this won’t be enough if there is too much heterogeneity: even within some demographic buckets, the intervention may have negative effects for a good number of would-be users. On the other hand, by implementing persuasion profiling, the intervention will help more people, and the company will be able to market it more widely — and more ethically.

A simplified example that is somewhat compelling to me at least, but certainly not airtight. In another post, I’ll describe how somewhat foreseeable, but unintended, consequences should also give one pause.

  1. You might say we tried to build in a warning for anyone discussing or promoting this work. []
  2. We argue this in the paper we presented at Persuasive Technology 2010. The text below reprises some of what we said about our “Example 4” in that paper.
    Kaptein, M. & Eckles, D. (2010). Selecting effective means to any end: Futures and ethics of persuasion profiling. Proceedings of Persuasive Technology 2010, Lecture Notes in Computer Science. Springer. []

Traits, adaptive systems & dimensionality reduction

Psychologists have posited numerous psychological traits and described causal roles they ought to play in determining human behavior. Most often, the canonical measure of a trait is a questionnaire. Investigators obtain this measure for some people and analyze how their scores predict some outcomes of interest. For example, many people have been interested in how psychological traits affect persuasion processes. Traits like need for cognition (NFC) have been posited and questionnaire items developed to measure them. Among other things, NFC affects how people respond to messages with arguments for varying quality.

How useful are these traits for explanation, prediction, and adaptive interaction? I can’t address all of this here, but I want to sketch an argument for their irrelevance to adaptive interaction — and then offer a tentative rejoinder.

Interactive technologies can tailor their messages to the tastes and susceptibilities of the people interacting with and through them. It might seem that these traits should figure in the statistical models used to make these adaptive selections. After all, some of the possible messages fit for, e.g., coaching a person to meet their exercise goals are more likely to be effective for low NFC people than high NFC people, and vice versa. However, the standard questionnaire measures of NFC cannot often be obtained for most users — certainly not in commerce settings, and even people signing up for a mobile coaching service likely don’t want to answer pages of questions. On the other hand, some Internet and mobile services have other abundant data available about their users, which could perhaps be used to construct an alternative measure of these traits. The trait-based-adaptation recipe is:

  1. obtain the questionnaire measure of the trait for a sample,
  2. predict this measure with data available for many individuals (e.g., log data),
  3. use this model to construct a measure for out-of-sample individuals.

This new measure could then be used to personalize the interactive experience based on this trait, such that if a version performs well (or poorly) for people with a particular score on the trait, then use (or don’t use) that version for people with similar scores.

But why involve the trait at all? Why not just personalize the interactive experience based on the responses of similar others? Since the new measure of the trait is just based on the available behavioral, demographic, and other logged data, one could simply predict responses based on those measure. Put in geometric terms, if the goal is to project the effects of different message onto available log data, why should one project the questionnaire measure of the trait onto the available log data and then project the effects onto this projection? This seems especially unappealing if one doesn’t fully trust the questionnaire measure to be accurate or one can’t be sure about which the set of all the traits that make a (substantial) difference.

I find this argument quite intuitively appealing, and it seems to resonate with others.1 But I think there are some reasons the recipe above could still be appealing.

One way to think about this recipe is as dimensionality reduction guided by theory about psychological traits. Available log data can often be used to construct countless predictors (or “features”, as the machine learning people call them). So one can very quickly get into a situation where the effective number of parameters for a full model predicting the effects of different messages is very large and will make for poor predictions. Nothing — no, not penalized regression, not even a support vector machine — makes this problem go away. Instead, one has to rely on the domain knowledge of the person constructing the predictors (i.e., doing the “feature engineering”) to pick some good ones.

So the tentative rejoinder is this: established psychological traits might often make good dimensions to predict effects of different version of a message, intervention, or experience with. And they may “come with” suggestions about what kinds of log data might serve as measures of them. They would be expected to be reusable across settings. Thus, I think this recipe is nonetheless deserves serious attention.

  1. I owe some clarity on this to some conversations with Mike Nowak and Maurits Kaptein. []

Will the desire for other perspectives trump the “friendly world syndrome”?

Some recent journalism at NPR and The New York Times has addressed some aspects of the “friendly world syndrome” created by personalized media. A theme common to both pieces is that people want to encounter different perspectives and will use available resources to do so. I’m a bit more skeptical.

Here’s Natasha Singer at The New York Times on cascades of memes, idioms, and links through online social networks (e.g., Twitter):

If we keep seeing the same links and catchphrases ricocheting around our social networks, it might mean we are being exposed only to what we want to hear, says Damon Centola, an assistant professor of economic sociology at the Massachusetts Institute of Technology.

“You might say to yourself: ‘I am in a group where I am not getting any views other than the ones I agree with. I’m curious to know what else is out there,’” Professor Centola says.

Consider a new hashtag: diversity.

This is how Singer ends this article in which the central example is “icantdateyou” leading Egypt-related idioms as a trending topic on Twitter. The suggestion here, by Centola and Singer, is that people will notice they are getting a biased perspective of how many people agree with them and what topics people care about — and then will take action to get other perspectives.

Why am I skeptical?

First, I doubt that we really realize the extent to which media — and personalized social media in particular — bias their perception of the frequency of beliefs and events. Even though people know that fiction TV programs (e.g., cop shows) don’t aim to represent reality, heavy TV watchers (on average) substantially overestimate the percent of adult men employed in law enforcement.1 That is, the processes that produce the “friendly world syndrome” function without conscious awareness and, perhaps, even despite it. So people can’t consciously choose to seek out diverse perspectives if they don’t know they are increasingly missing them.

Second, I doubt that people actually want diversity of perspectives all that much. Even if I realize divergent views are missing from my media experience, why would I seek them out? This might be desirable for some people (but not all), and even for those, the desire to encounter people who radically disagree has its limits.

Similar ideas pop up in a NPR All Things Considered segment by Laura Sydell. This short piece (audio, transcript) is part of NPR’s “Cultural Fragmentation” series.2 The segment begins with the worry that offline bubbles are replicated online and quotes me describing how attempts to filter for personal relevance also heighten the bias towards agreement in personalized media.

But much of the piece has actually focuses on how one person — Kyra Gaunt, a professor and musician — is using Twitter to connect and converse with new and different people. Gaunt describes her experience on Twitter as featuring debate, engagement, and “learning about black people even if you’ve never seen one before”. Sydell’s commentary identifies the public nature of Twitter as an important factor in facilitating experiencing diverse perspectives:

But, even though there is a lot of conversation going on among African Americans on Twitter, Professor Gaunt says it’s very different from the closed nature of Facebook because tweets are public.

I think this is true to some degree: much of the content produced by Facebook users is indeed public, but Facebook does not make it as easily searchable or discoverable (e.g., through trending topics). But more importantly, Facebook and Twitter differ in their affordances for conversation. Facebook ties responses to the original post, which means both that the original poster controls who can reply and that everyone who replies is part of the same conversation. Twitter supports replies through the @reply mechanism, so that anyone can reply but the conversation is fragmented, as repliers and consumers often do not see all replies. So, as I’ve described, even if you follow a few people you disagree with on Twitter, you’ll most likely see replies from the other people you follow, who — more often than not — you agree with.

Gaunt’s experience with Twitter is certainly not typical. She has over 3,300 followers and follows over 2,400, so many of her posts will generate replies from people she doesn’t know well but whose replies will appear in her main feed. And — if she looks beyond her main feed to the @Mentions page — she will see the replies from even those she does not follow herself. On the other hand, her followers will likely only see her posts and replies from others they follow.3

Nonetheless, Gaunt’s case is worth considering further, as does Sydell:

SYDELL: Gaunt says she’s made new friends through Twitter.

GAUNT: I’m meeting strangers. I met with two people I had engaged with through Twitter in the past 10 days who I’d never met in real time, in what we say in IRL, in real life. And I met them, and I felt like this is my tribe.

SYDELL: And Gaunt says they weren’t black. But the key word for some observers is tribe. Although there are people like Gaunt who are using social media to reach out, some observers are concerned that she is the exception to the rule, that most of us will be content to stay within our race, class, ethnicity, family or political party.

So Professor Gaunt is likely making connections with people she would not have otherwise. But — it is at least tempting to conclude from “this is my tribe” — they are not people with radically different beliefs and values, even if they have arrived at those beliefs and values from a membership in a different race or class.

  1. Gerbner, G., Gross, L., Morgan, M., & Signorielli, N. (1980). The “Mainstreaming” of America: Violence Profile No. 11. Journal of Communication, 30(3), 10-29. []
  2. I was also interviewed for the NPR segment. []
  3. One nice feature in “new Twitter” — the recently refresh of the Twitter user interface — is that clicking on a tweet will show some of the replies to it in the right column. This may offer an easier way for followers to discover diverse replies to the people they follow. But it is also not particularly usable, as it is often difficult to even trace what a reply is a reply to. []

Aardvark’s use of Wizard of Oz prototyping to design their social interfaces

The Wall Street Journal’s Venture Capital Dispatch reports on how Aardvark, the social question asking and answering service recently acquired by Google, used a Wizard of Oz prototype to learn about how their service concept would work without building all the tech before knowing if it was any good.

Aardvark employees would get the questions from beta test users and route them to users who were online and would have the answer to the question. This was done to test out the concept before the company spent the time and money to build it, said Damon Horowitz, co-founder of Aardvark, who spoke at Startup Lessons Learned, a conference in San Francisco on Friday.

“If people like this in super crappy form, then this is worth building, because they’ll like it even more,” Horowitz said of their initial idea.

At the same time it was testing a “fake” product powered by humans, the company started building the automated product to replace humans. While it used humans “behind the curtain,” it gained the benefit of learning from all the questions, including how to route the questions and the entire process with users.

This is a really good idea, as I’ve argued before on this blog and in a chapter for developers of mobile health interventions. What better way to (a) learn about how people will use and experience your service and (b) get training data for your machine learning system than to have humans-in-the-loop run the service?

My friend Chris Streeter wondered whether this was all done by Aardvark employees or whether workers on Amazon Mechanical Turk may have also been involved, especially in identifying the expertise of the early users of the service so that the employees could route the questions to the right place. I think this highlights how different parts of a service can draw on human and non-human intelligence in a variety of ways — via a micro-labor market, using skilled employees who will gain hands-on experience with customers, etc.

I also wonder what UIs the humans-in-the-loop used to accomplish this. It’d be great to get a peak. I’d expect that these were certainly rough around the edges, as was the Aardvark customer-facing UI.

Aardvark does a good job of being a quite sociable agent (e.g., when using it via instant messaging) that also gets out of the way of the human–human interaction between question askers and answers. I wonder how the language used by humans to coordinate and hand-off questions may have played into creating a positive para-social interaction with vark.