A deluge of experiments

The Atlantic reports on the data deluge and its value for innovation.1 I particularly liked how Erik Brynjolfsson and Andrew McAfee, who wrote the Atlantic piece, highlight the value of experimentation for addressing causal questions — and that many of the questions we care about are causal.2

In writing about experimentation, they report that Hal Varian, Google’s Chief Economist, estimates that Google runs “100-200 experiments on any given day”. This struck me as incredibly low! I would have guessed more like 10,000 or maybe more like 100,000.

The trick of course is how one individuates experiments. Say Google has an automatic procedure whereby each ad has a (small) random set of users who are prevented from seeing it and are shown the next best ad instead. Is this one giant experiment? Or one experiment for each ad?

This is a bit of a silly question.3

But when most people — even statisticians and scientists — think of an experiment in this context, they think of something like Google or Amazon making a particular button bigger. (Maybe somebody thought making that button bigger would improve a particular metric.) They likely don’t think of automatically generating an experiment for every button, such that a random sample see that particular button slightly bigger. It’s these latter kinds of procedures that lead to thinking about tens of thousands of experiments.

That’s the real deluge of experiments.

  1. I don’t know that I would call much of it ‘innovation’. There is some outright innovation, but a lot of that is in the general strategies for using the data. There is much more gained in minor tweaking and optimization of products and services. []
  2. Perhaps they even overstate the power of simple experiments. For example, they do not mention the fact that many times the results these kinds of experiments often change over time, so that what you learned 2 months ago is no longer true. []
  3. Note that two single-factor experiments over the same population with independent random assignment can be regarded as a single experiment with two factors. []

Frege’s judgment stroke

Are the conditions required to assert something conventions? Can they be formalized? Donald Davidson on whether convention is foundational to communication:

But Frege was surely right when he said, “There is no word or sign in language whose function is simply to assert something.” Frege, as we know, set out to rectify matters by inventing such a sign, the turnstile ⊢’ [sometimes called Frege’s ‘judgment stroke’ or ‘assertion sign’]. And here Frege was operating on the basis of a sound principle: if there is a conventional feature of language, it can be made manifest in the symbolism. However, before Frege invented the assertion sign he ought to have asked himself why no such sign existed before. Imagine this: the actor is acting a scene in which there is supposed to be a fire. (Albee’s Tiny Alice, for example.) It is his role to imitate as persuasively as he can a man who is trying to warn others of a fire. “Fire!” he screams. And perhaps he adds, at the behest of the author, “I mean it! Look at the smoke!” etc. And now a real fire breaks out, and the actor tries vainly to warn the real audience. “Fire!” he screams, “I mean it! Look at the smoke!” etc. If only he had Frege’s assertion sign.

It should be obvious that the assertion sign would do no good, for the actor would have used it in the first place, when he was only acting. Similar reasoning should convince us that it is no help to say that the stage, or the proscenium arch, creates a conventional setting that negates the convention of assertion. For if that were so, the acting convention could be put into symbols also; and of course no actor or director would use it. The plight of the actor is always with us. There is no known, agreed upon, publically recognizable convention for making assertions. Or, for that matter, giving orders, asking questions, or making promises. These are all things we do, often successfully, and our success depends in part on our having made public our intention to do them. But it was not thanks to a convention that we succeeded.1

  1. Davidson, Donald. (1984). Communication and convention. Synthese 59 (1), 3-17. []

Lossy better than lossless in online bootstrapping

Sometimes an approximate method is in some important sense better than the “exact” one — and not just because it is easier or faster.

In statistical inference, a standard example here is the Agresti-Coull confidence interval for a binomial proportion: the “exact” interval from inverting the binomial test is conservative — giving overly wide intervals with more than the advertised coverage — but the standard (approximate) Wald interval is too narrow.1 The Agresti-Coull confidence interval, which is a modification of the Wald interval that can be justified on Bayesian grounds, has better performance than either.2

Even knowing this, like many people I suspect, I am a sucker for the “exact” over the approximate. The rest of this post gives another example of “approximate is better than exact” that Art Owen and I recently encountered in our work on bootstrapping big data with multiple dependencies.

The bootstrap is a computational method for estimating the sampling uncertainty of a statistic.3 When using bootstrap resampling or bagging, one normally draws observations without replacement from the sample to form a bootstrap replicate. Each replicate then consists of zero or more copies of each observation. If one wants to bootstrap online — that is, one observation at a time — or generally without synchronization costs in a distributed processing setting, machine learning folks have used the Poisson approximation to the binomial. This approximate bootstrap works as follows: for each observation in the sample, take a Poisson(1) draw and include that many of that observation in this replicate.

Since this is a “lossy” approximation, investigators have sometimes considered and advocated “lossless” alternatives (Lee & Clyde, 2004). The Bayesian bootstrap, in which each replicate is a n-dimensional draw from the Dirichlet, can be done exactly online: for each observation, take a Exp(1) draw and use it as the weight for that observation for this replicate.4 Being a sucker for methods labeled “lossless” or “exact”, I implemented this method in Hive at Facebook, and used it instead of the already available Poisson method. I even chortled to others, “Now we have an exact version implemented to use instead!”

But is this the best of all possible distributions for bootstrap reweighting? Might there be some other, better distribution (with mean 1 and variance 1)? In particular, what distribution minimizes our uncertainty about the variance of the mean, given the same number of bootstrap replicates?

We examined this question (Owen and Eckles, 2011, section 3.3) and found that the Poisson(1) weights give a sharper estimate of the variance than the Exp(1) weights: the lossy approximation to the standard resampling bootstrap is better than the exact Bayesian reweighting bootstrap. Interestingly, both of these are beat by using “double-or-nothing” U{0, 2} weights — that is, something close to half-sampling.5 Furthermore, the Poisson(1) and U{0, 2} versions are more general, since they don’t require using weighting (observations can duplicated) and, when using them as weights, they don’t require using floating point numbers.6

Agresti, A. and Coull, B. A. (1998). Approximate Is Better than “Exact” for Interval Estimation of Binomial Proportions. American Statistician, 5 (2): 119-126

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.

Hesterberg, T., et al. Bootstrap Methods and Permutation Tests. In: Introduction to the Practice of Statistics.

Lee, H. K. H. and Clyde, M. A. (2004). Lossless online Bayesian bagging. Journal of Machine Learning Research, 5:143–151.

Owen, A. B. and Eckles, D. (2011). Bootstrapping data arrays of arbitrary order. http://arxiv.org/abs/1106.2125

Oza, N. (2001). Online bagging and boosting. In Systems, man and cybernetics, 2005 IEEE international conference on, volume 3, pages 2340–2345. IEEE.

  1. The Wald interval also gives zero-width intervals when observations are all either y=1 or y=0. []
  2. That is, it contains the true value closer to 100 * (1 – alpha)% of the time than the others. This example is a favorite of Art Owen’s. []
  3. Hesterberg et al. (PDF) is a good introduction to the bootstrap. Efron (1979) is the first paper on the bootstrap. []
  4. In what sense is this “exact” or “lossless”? This online method is exactly the same as the offline Bayesian bootstrap in which one takes a n-dimensional draw from the Dirichlet. On the other hand, the Poisson(1) method is often seen as an online approximation to the offline bootstrap. []
  5. Why is this? See the paper, but the summary is that U{0, 2} has the lowest kurtosis, and Poisson(1) has lower kurtosis than Exp(1). []
  6. This is especially useful if one is doing factorial weighting, as we do in the paper, where multiplication of weights for different grouping factors is required. []

Against between-subjects experiments

A less widely known reason for using within-subjects experimental designs in psychological science. In a within-subjects experiment, each participant experiences multiple conditions (say, multiple persuasive messages), while in a between-subjects experiment, each participant experiences only one condition.

If you ask a random social psychologist, “Why would you run a within-subjects experiment instead of a between-subjects experiments?”, the most likely answer is “power” — within-subjects experiments provide more power. That is, with the same number of participants, within-subjects experiments allow investigators to more easily tell that observed differences between conditions are not due to chance.1

Why do within-subjects experiments increase power? Because responses by the same individual are generally dependent; more specifically, they are often positively correlated. Say an experiment involves evaluating products, people, or policy proposals under different conditions, such as the presence of different persuasive cues or following different primes. It is often the case that participants who rate an item high on a scale under one condition will rate other items high on that scale under other condition. Or participants with short response times for one task will have relatively short response times for another task. Et cetera. This positive association might be due to stable characteristics of people or transient differences such as mood. Thus, the increase in power is due to heterogeneity in how individuals respond to the stimuli.

However, this advantage of within-subjects designs is frequently overridden in social psychology by the appeal of between-subjects designs. The latter are widely regarded as “cleaner” as they avoid carryover effects — in which one condition may effect responses to subsequent conditions experienced by the same participant. They can also be difficult to design when studies involve deception — even just deception about the purpose of the study — and one-shot encounters. Because of this, between-subjects designs are much more common in social psychology than within-subjects designs: investigators don’t regard the complexity of conducting within-subjects designs as worth it for the gain in power, which they regard as the primary advantage of within-subjects designs.

I want to point out another — but related — reason for using within-subjects designs: between-subjects experiments often do not allow consistent estimation of the parameters of interest. Now, between-subjects designs are great for estimating average treatment effects (ATEs), and ATEs can certainly be of great interest. For example, if one is interested how a design change to a web site will effect sales, an ATE estimated from an A-B test with the very same population will be useful. But this isn’t enough for psychological science for two reasons. First, social psychology experiments are usually very different from the circumstances of potential application: the participants are undergraduate students in psychology and the manipulations and situations are not realistic. So the ATE from a psychology experiment might not say much about the ATE for a real intervention. Second, social psychologists regard themselves as building and testing theories about psychological processes. By their nature, psychological processes occur within individuals. So an ATE won’t do — in fact, it can be a substantially biased estimate of the psychological parameter of interest.

To illustrate this problem, consider an example where the outcome of an experiment is whether the participant says that a job candidate should be hired. For simplicity, let’s say this is a binary outcome: either they say to hire them or not. Their judgements might depend on some discrete scalar X. Different participants may have different thresholds for hiring the applicant, but otherwise be effected by X in the same way. In a logistic model, that is, each participant has their own intercept but all the slopes are the same. This is depicted with the grey curves below.2

Comparison of marginal and conditional logit functions

Marginal (blue) and conditional (grey) expectation functions

These grey curves can be estimated if one has multiple observations per participant at different values of X. However, in a between-subjects experiment, this is not the case. As an estimate of a parameter of the psychological process common to all the participants, the estimated slope from a between-subjects experiment will be biased. This is clear in the figure above: the blue curve (the marginal expectation function) is shallower than any of the individual curves.

More generally, between-subjects experiments are good for estimating ATEs and making striking demonstrations. But they are often insufficient for investigating psychological processes since any heterogeneity — even only in intercepts — produces biased estimates of the parameters of psychological processes, including parameters that are universal in the population.

I see this as a strong motivation for doing more within-subjects experiments in social psychology. Unlike the power motivation for within-subjects designs, this isn’t solved by getting a larger sample of individuals. Instead, investigators need to think carefully about whether their experiments estimate any quantity of interest when there is substantial heterogeneity — as there generally is.3

  1. And to more precisely estimate these differences. Though social psychologist often don’t care about estimation, since many social psychological theories are only directional. []
  2. This example is very directly inspired by Alan Agresti’s Categorical Data Analysis, p. 500. []
  3. The situation is made a bit “better” by the fact that social psychologists are often only concerned with determining the direction of effects, so maybe aren’t worried that their estimates of parameters are biased. Of course, this is a problem in itself if the direction of the effect varies by individual. Here I have only treated the simpler case of universal function subject to a random shift. []

Marginal evidence for psychological processes

Some comments on problems with investigating psychological processes using estimates of average (i.e. marginal) effects. Hence the play on words in the title.

Social psychology makes a lot of being theoretical. This generally means not just demonstrating an effect, but providing evidence about the psychological processes that produce it. Psychological processes are, it is agreed, intra-individual processes. To tell a story about a psychological process is to posit something going on “inside” people. It is quite reasonable that this is how social psychology should work — and it makes it consistent with much of cognitive psychology as well.

But the evidence that social psychology uses to support these theories about these intra-individual processes is largely evidence about effects of experimental conditions (or, worse, non-manipulated measures) averaged across many participants. That is, it is using estimates of marginal effects as evidence of conditional effects. This is intuitively problematic. Now, there is no problem when using experiments to study effects and processes that are homogenous in the population. But, of course, they aren’t: heterogeneity abounds. There is variation in how factors affect different people. This is why the causal inference literature has emphasized the differences among the average treatment effect, (average) treatment effect on the treated, local average treatment effect, etc.

Not only is this disconnect between marginal evidence and conditional theory trouble in the abstract, we know it has already produced many problems in the social psychology literature.1 Baron and Kenny (1986) is the most cited paper published in the Journal of Personality and Social Psychology, the leading journal in the field. It paints an rosy picture of what it is like to investigate psychological processes. The methods of analysis it proposes for investigating processes are almost ubiquitous in social psych.2 The trouble is that this approach is severely biased in the face of heterogeneity in the processes under study. This is usually described as problem of correlated error terms, omitted-variables bias, or adjusting for post-treatment variables. This is all true. But, in the most common uses, it is perhaps more natural to think of it as a problem of mixing up marginal (i.e. average) and conditional effects.3

What’s the solution? First, it is worth saying that average effects are worth investigating! Especially if you are evaluating a intervention or drug that might really be used — or if you are working at another level of analysis than psychology. But if psychological processes are your thing, you must do better.

Social psychologists sometimes do condition on individual characteristics, but often this is a measure of a single trait (e.g., need for cognition) that cannot plausibly exhaust all (or even much) of the heterogeneity in the effects under study. Without much larger studies, they cannot condition on more characteristics because of estimation problems (too many parameters for their N). So there is bound to be substantial heterogeneity.

Beyond this, I think social psychology could benefit from a lot more within-subjects experiments. Modern statistical computing (e.g., tools for fitting mixed-effects or multilevel models) makes it possible — even easy — to use such data to estimate effects of the manipulated factors for each participant. If they want to make credible claims about processes, then within-subjects designs — likely with many measurements of each person — are a good direction to more thoroughly explore.

  1. The situation is bad enough that I (and some colleagues) certainly don’t even take many results in social psych as more than providing a possibly interesting vocabulary. []
  2. Luckily, my sense is that they are waning a bit, partially because of illustrations of the method’s bias. []
  3. To translate to the terms used before, note that we want to condition on unobserved (latent) heterogeneity. If one doesn’t, then there is omitted variable bias. This can be done with models designed for this purpose, such as random effects models. []