Using covariates to increase the precision of randomized experiments

A simple difference-in-means estimator of the average treatment effect (ATE) from a randomized experiment is a good start, but may often leave a lot of additional precision on the table. Even if you haven’t used covariates (pre-treatment variables observed for your units) in the design of the experiment (e.g., this is often difficult to do in streaming random assignment in Internet experiments; see our paper), you can use them to increase the precision of your estimates in the analysis phase. Here are some simple ways to do that. I’m not including a whole range of more sophisticated/complicated approaches. And, of course, if you don’t have any covariates for the units in your experiments — or they aren’t very predictive of your outcome, this all won’t help you much.


Prior to the experiment you could do stratified randomization (i.e. blocking) according to some categorical covariate (making sure that there there are same number of, e.g., each gender, country, paid/free accounts in each treatment). But you can also do something similar after: compute an ATE within each stratum and then combine the strata-level estimates, weighting by the total number of observations in each stratum. For details — and proofs showing this often won’t be much worse than blocking, consult Miratrix, Sekhon & Yu (2013).

Regression adjustment with a single covariate

Often what you most want to adjust for is a single numeric covariate,1 such as a lagged version of your outcome (i.e., your outcome from some convenient period before treatment). You can simply use ordinary least squares regression to adjust for this covariate by regressing your outcome on both a treatment indicator and the covariate. Even better (particularly if treatment and control are different sized by design), you should regress your outcome on: a treatment indicator, the covariate centered such that it has mean zero, and the product of the two.2 Asymptotically (and usually in practice with a reasonably sized experiment), this will increase precision and it is pretty easy to do. For more on this, see Lin (2012).

Higher-dimensional adjustment

If you have a lot more covariates to adjust for, you may want to use some kind of penalized regression. For example, you could use the Lasso (L1-penalized regression); see Bloniarz et al. (2016).

Use out-of-sample predictions from any model

Maybe you instead want to use neural nets, trees, or an ensemble of a bunch of models? That’s fine, but if you want to be able to do valid statistical inference (i.e., get 95% confidence intervals that actually cover 95% of the time), you have to be careful. The easiest way to be careful in many Internet industry settings is just to use historical data to train the model and then get out-of-sample predictions Yhat from that model for your present experiment. You then then just subtract Yhat from Y and use the simple difference-in-means estimator. Aronow and Middleton (2013) provide some technical details and extensions. A simple extension that makes this more robust to changes over time is to use this out-of-sample Yhat as a covariate, as described above.3

  1. As Winston Lin notes in the comments and as is implicit in my comparison with post-stratification, as long as the number of covariates is small and not growing with sample size, the same asymptotic results apply. []
  2. Note that if the covariate is binary or, more generally, categorical, then this exactly coincides with the post-stratified estimator considered above. []
  3. I added this sentence in response to Winston Lin’s comment. []

Adjusting biased samples

Nate Cohn at The New York Times reports on how one 19-year-old black man is having an outsized impact on the USC/LAT panel’s estimates of support for Clinton in the U.S. presidential election. It happens that the sample doesn’t have enough other people with similar demographics and voting history (covariates) to this panelist, so he is getting a large weight in computing the overall averages for the populations of interest, such as likely voters:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.

And he has been held up as proof by conservatives — including outlets like Breitbart News and The New York Post — that Mr. Trump is excelling among black voters. He has even played a modest role in shifting entire polling aggregates, like the Real Clear Politics average, toward Mr. Trump.

As usual, Andrew Gelman suggests that the solution to this problem is a technique he calls “Mr. P” (multilevel regression and post-stratification). I wanted to comment on some practical tradeoffs among common methods. Maybe these are useful notes, which can be read alongside another nice piece by Nate Cohn on how different adjustment methods can yield very different polling results.


Complete post-stratification is when you compute the mean outcome (e.g., support for Clinton) for each stratum of people, such as 18-24-year-old black men, defined by the covariates X. Then you combine these weighting by the size of each group in the population of interest. This really only works when you have a lot of data compared with the number of strata — and the number of strata grows very fast in the number of covariates you want to adjust for.

Modeling sample inclusion and weighting

When people talk about survey weighting, often what they mean is weighting by inverse of the estimated probability of inclusion in the sample. You model selection into the survey S using, e.g., logistic regression on the covariates X and some interactions. This can be done with regularization (i.e., priors, shrinkage) since many of the terms in the model might be estimated with very few observations. Especially without enough regularization, this can result in very large weights when you don’t have enough of some particular type in your sample.

Modeling the outcome and integrating

You fit a model predicting the response (e.g., support for Clinton) Y with the covariates X. You regularize this model in some way so that the estimate for each person is going to “borrow strength” from other people with similar Xs. So now you have a fitted responses Yhat for each unique X. To get an estimate for a particular population of interest, integrate out over the distribution of X in that population. Gelman’s preferred version “Mr. P” uses a multilevel (aka hierarchical Bayes, random effects) model for the outcome, but other regularization methods may often be appealing.

This is nice because there can be some substantial efficiency gains (i.e. more precision) by making use of the outcome information. But there are also some practical issues. First, you need a model for each outcome in your analysis, rather than just having weights you could use for all outcomes and all recodings of outcomes. Second, the implicit weights that this process puts on each observation can vary from outcome to outcome — or even for different codings (i.e. a dichotomization of answers on a numeric scale) of the same outcome. In a reply to his post, Gelman notes that you would need a different model for each outcome, but that some joint model for all outcomes would be ideal. Of course, the latter joint modeling approach, while appealing in some ways (many statisticians love having one model that subsumes everything…) means that adding a new outcome to analysis would change all prior results.


Side note: Other methods, not described here, also work towards the aim of matching characteristics of the population distribution (e.g., iterative proportional fitting / raking). They strike me as overly specialized and not easy to adapt and extend.

A deluge of experiments

The Atlantic reports on the data deluge and its value for innovation.1 I particularly liked how Erik Brynjolfsson and Andrew McAfee, who wrote the Atlantic piece, highlight the value of experimentation for addressing causal questions — and that many of the questions we care about are causal.2

In writing about experimentation, they report that Hal Varian, Google’s Chief Economist, estimates that Google runs “100-200 experiments on any given day”. This struck me as incredibly low! I would have guessed more like 10,000 or maybe more like 100,000.

The trick of course is how one individuates experiments. Say Google has an automatic procedure whereby each ad has a (small) random set of users who are prevented from seeing it and are shown the next best ad instead. Is this one giant experiment? Or one experiment for each ad?

This is a bit of a silly question.3

But when most people — even statisticians and scientists — think of an experiment in this context, they think of something like Google or Amazon making a particular button bigger. (Maybe somebody thought making that button bigger would improve a particular metric.) They likely don’t think of automatically generating an experiment for every button, such that a random sample see that particular button slightly bigger. It’s these latter kinds of procedures that lead to thinking about tens of thousands of experiments.

That’s the real deluge of experiments.

  1. I don’t know that I would call much of it ‘innovation’. There is some outright innovation, but a lot of that is in the general strategies for using the data. There is much more gained in minor tweaking and optimization of products and services. []
  2. Perhaps they even overstate the power of simple experiments. For example, they do not mention the fact that many times the results these kinds of experiments often change over time, so that what you learned 2 months ago is no longer true. []
  3. Note that two single-factor experiments over the same population with independent random assignment can be regarded as a single experiment with two factors. []

Against between-subjects experiments

A less widely known reason for using within-subjects experimental designs in psychological science. In a within-subjects experiment, each participant experiences multiple conditions (say, multiple persuasive messages), while in a between-subjects experiment, each participant experiences only one condition.

If you ask a random social psychologist, “Why would you run a within-subjects experiment instead of a between-subjects experiments?”, the most likely answer is “power” — within-subjects experiments provide more power. That is, with the same number of participants, within-subjects experiments allow investigators to more easily tell that observed differences between conditions are not due to chance.1

Why do within-subjects experiments increase power? Because responses by the same individual are generally dependent; more specifically, they are often positively correlated. Say an experiment involves evaluating products, people, or policy proposals under different conditions, such as the presence of different persuasive cues or following different primes. It is often the case that participants who rate an item high on a scale under one condition will rate other items high on that scale under other condition. Or participants with short response times for one task will have relatively short response times for another task. Et cetera. This positive association might be due to stable characteristics of people or transient differences such as mood. Thus, the increase in power is due to heterogeneity in how individuals respond to the stimuli.

However, this advantage of within-subjects designs is frequently overridden in social psychology by the appeal of between-subjects designs. The latter are widely regarded as “cleaner” as they avoid carryover effects — in which one condition may effect responses to subsequent conditions experienced by the same participant. They can also be difficult to design when studies involve deception — even just deception about the purpose of the study — and one-shot encounters. Because of this, between-subjects designs are much more common in social psychology than within-subjects designs: investigators don’t regard the complexity of conducting within-subjects designs as worth it for the gain in power, which they regard as the primary advantage of within-subjects designs.

I want to point out another — but related — reason for using within-subjects designs: between-subjects experiments often do not allow consistent estimation of the parameters of interest. Now, between-subjects designs are great for estimating average treatment effects (ATEs), and ATEs can certainly be of great interest. For example, if one is interested how a design change to a web site will effect sales, an ATE estimated from an A-B test with the very same population will be useful. But this isn’t enough for psychological science for two reasons. First, social psychology experiments are usually very different from the circumstances of potential application: the participants are undergraduate students in psychology and the manipulations and situations are not realistic. So the ATE from a psychology experiment might not say much about the ATE for a real intervention. Second, social psychologists regard themselves as building and testing theories about psychological processes. By their nature, psychological processes occur within individuals. So an ATE won’t do — in fact, it can be a substantially biased estimate of the psychological parameter of interest.

To illustrate this problem, consider an example where the outcome of an experiment is whether the participant says that a job candidate should be hired. For simplicity, let’s say this is a binary outcome: either they say to hire them or not. Their judgements might depend on some discrete scalar X. Different participants may have different thresholds for hiring the applicant, but otherwise be effected by X in the same way. In a logistic model, that is, each participant has their own intercept but all the slopes are the same. This is depicted with the grey curves below.2

Comparison of marginal and conditional logit functions

Marginal (blue) and conditional (grey) expectation functions

These grey curves can be estimated if one has multiple observations per participant at different values of X. However, in a between-subjects experiment, this is not the case. As an estimate of a parameter of the psychological process common to all the participants, the estimated slope from a between-subjects experiment will be biased. This is clear in the figure above: the blue curve (the marginal expectation function) is shallower than any of the individual curves.

More generally, between-subjects experiments are good for estimating ATEs and making striking demonstrations. But they are often insufficient for investigating psychological processes since any heterogeneity — even only in intercepts — produces biased estimates of the parameters of psychological processes, including parameters that are universal in the population.

I see this as a strong motivation for doing more within-subjects experiments in social psychology. Unlike the power motivation for within-subjects designs, this isn’t solved by getting a larger sample of individuals. Instead, investigators need to think carefully about whether their experiments estimate any quantity of interest when there is substantial heterogeneity — as there generally is.3

  1. And to more precisely estimate these differences. Though social psychologist often don’t care about estimation, since many social psychological theories are only directional. []
  2. This example is very directly inspired by Alan Agresti’s Categorical Data Analysis, p. 500. []
  3. The situation is made a bit “better” by the fact that social psychologists are often only concerned with determining the direction of effects, so maybe aren’t worried that their estimates of parameters are biased. Of course, this is a problem in itself if the direction of the effect varies by individual. Here I have only treated the simpler case of universal function subject to a random shift. []

Marginal evidence for psychological processes

Some comments on problems with investigating psychological processes using estimates of average (i.e. marginal) effects. Hence the play on words in the title.

Social psychology makes a lot of being theoretical. This generally means not just demonstrating an effect, but providing evidence about the psychological processes that produce it. Psychological processes are, it is agreed, intra-individual processes. To tell a story about a psychological process is to posit something going on “inside” people. It is quite reasonable that this is how social psychology should work — and it makes it consistent with much of cognitive psychology as well.

But the evidence that social psychology uses to support these theories about these intra-individual processes is largely evidence about effects of experimental conditions (or, worse, non-manipulated measures) averaged across many participants. That is, it is using estimates of marginal effects as evidence of conditional effects. This is intuitively problematic. Now, there is no problem when using experiments to study effects and processes that are homogenous in the population. But, of course, they aren’t: heterogeneity abounds. There is variation in how factors affect different people. This is why the causal inference literature has emphasized the differences among the average treatment effect, (average) treatment effect on the treated, local average treatment effect, etc.

Not only is this disconnect between marginal evidence and conditional theory trouble in the abstract, we know it has already produced many problems in the social psychology literature.1 Baron and Kenny (1986) is the most cited paper published in the Journal of Personality and Social Psychology, the leading journal in the field. It paints an rosy picture of what it is like to investigate psychological processes. The methods of analysis it proposes for investigating processes are almost ubiquitous in social psych.2 The trouble is that this approach is severely biased in the face of heterogeneity in the processes under study. This is usually described as problem of correlated error terms, omitted-variables bias, or adjusting for post-treatment variables. This is all true. But, in the most common uses, it is perhaps more natural to think of it as a problem of mixing up marginal (i.e. average) and conditional effects.3

What’s the solution? First, it is worth saying that average effects are worth investigating! Especially if you are evaluating a intervention or drug that might really be used — or if you are working at another level of analysis than psychology. But if psychological processes are your thing, you must do better.

Social psychologists sometimes do condition on individual characteristics, but often this is a measure of a single trait (e.g., need for cognition) that cannot plausibly exhaust all (or even much) of the heterogeneity in the effects under study. Without much larger studies, they cannot condition on more characteristics because of estimation problems (too many parameters for their N). So there is bound to be substantial heterogeneity.

Beyond this, I think social psychology could benefit from a lot more within-subjects experiments. Modern statistical computing (e.g., tools for fitting mixed-effects or multilevel models) makes it possible — even easy — to use such data to estimate effects of the manipulated factors for each participant. If they want to make credible claims about processes, then within-subjects designs — likely with many measurements of each person — are a good direction to more thoroughly explore.

  1. The situation is bad enough that I (and some colleagues) certainly don’t even take many results in social psych as more than providing a possibly interesting vocabulary. []
  2. Luckily, my sense is that they are waning a bit, partially because of illustrations of the method’s bias. []
  3. To translate to the terms used before, note that we want to condition on unobserved (latent) heterogeneity. If one doesn’t, then there is omitted variable bias. This can be done with models designed for this purpose, such as random effects models. []