This post is cross-posted from Andrew Gelman’s Statistical Modeling, Causal Inference, and Social Science. There’s more discussion over there.
In a randomized experiment (i.e. RCT, A/B test, etc.) units are randomly assigned to treatments (i.e. conditions, variants, etc.). Let’s focus on Bernoulli randomized experiments for now, where each unit is independently assigned to treatment with probability q and to control otherwise.
Thomas Aquinas argued that God’s knowledge of the world upon creation of it is a kind of practical knowledge: knowing something is the case because you made it so. One might think that that in randomized experiments we have a kind of practical knowledge: we know that treatment was randomized because we randomized it. But unlike Aquinas’s God, we are not infallible, we often delegate, and often we are in the position of consuming reports on other people’s experiments.
So it is common to perform and report some tests of the null hypothesis that this process did indeed generate the data. For example, one can test that the sample sizes in treatment and control aren’t inconsistent with this. This is common in at least in the Internet industry (see, e.g., Kohavi, Tang & Xu on “sample ratio mismatch”), where it is often particularly easy to automate. Perhaps more widespread is testing whether the means of pre-treatment covariates in treatment and control are distinguishable; these are often called balance tests. One can do per-covariate tests, but if there are a lot of covariates then this can generate confusing false positives. So often one might use some test for all the covariates jointly at once.
Some experimentation systems in industry automate various of these tests and, if they reject at, say, p < 0.001, show prominent errors or even watermark results so that they are difficult to share with others without being warned. If we’re good Bayesians, we probably shouldn’t give up on our prior belief that treatment was indeed randomized just because some p-value is less than 0.05. But if we’ve got p < 1e-6, then — for all but the most dogmatic prior beliefs that randomization occurred as planned — we’re going to be doubtful that everything is alright and move to investigate.
In my own digital field and survey experiments, we indeed run these tests. Some of my papers report the results, but I know there’s at least one that doesn’t (though we did the tests) and another where we just state they were all not significant (and this can be verified with the replication materials). My sense is that reporting balance tests of covariate means is becoming even more of a norm in some areas, such as applied microeconomics and related areas. And I think that’s a good thing.
Interestingly, it seems that not everyone feels this way.
In particular, methodologists working in epidemiology, medicine, and public health sometimes refer to a “Table 1 fallacy” and advocate against performing and/or reporting these statistical tests. Sometimes the argument is specifically about clinical trials, but often it is more generally randomized experiments.
Stephen Senn argues in this influential 1994 paper:
Indeed the practice [of statistical testing for baseline balance] can accord neither with the logic of significance tests nor with that of hypothesis tests for the following are two incontrovertible facts about a randomized clinical trial:
1. over all randomizations the groups are balanced;
2. for a particular randomization they are unbalanced.
Now, no ‘significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized, and hence either that the trialist has practised deception and has dishonestly manipulated the allocation or that some incompetence, such as not accounting for all patients, has occurred.
In my opinion this is not the usual reason why such tests are carried out (I believe the reason is to make a statement about the observed allocation itself) and I suspect that the practice has originated through confused and false analogies with significance and hypothesis tests in general.
This highlights precisely where my view diverges: indeed the reason I think such tests should be performed is because I think that they could lead to the conclusion that “the treatment groups have not been randomized”. I wouldn’t say this always rises to the level of “incompetence” or “deception”, at least in the applications I’m familiar with. (Maybe I’ll write about some of these reasons at another time — some involve interference, some are analogous to differential attrition.)
It seems that experimenters and methodologists in social science and the Internet industry think that broken randomization is more likely, while methodologists mainly working on clinical trails put a very, very small prior probability on such events. Maybe this largely reflects the real probabilities in these areas, for various reasons. If so, part of the disagreement simply comes from cross-disciplinary diffusion of advice and overgeneralization. However, even some of the same researchers are sometimes involved in randomized experiments that aren’t subject to all the same processes as clinical trials.
Even if there is a small prior probability of broken randomization, if it is very easy to test for it, we still should. One nice feature of balance tests compared with other ways of auditing a randomization and data collection process is that they are pretty easy to take in as a reader.
But maybe there are other costs of conducting and reporting balance tests?
Indeed this gets at other reasons some methodologists oppose balance testing. For example, they argue that it fits into an, often vague, process of choosing estimators in a data-dependent way: researchers run the balance tests and make decisions about how to estimate treatment effects as a result.
This is articulated in a paper in The American Statistician by Mutz, Pemantle & Pham, which includes highlighting how discretion here creates a garden of forking paths. In my interpretation, the most considered and formalized arguments are saying is that conducting balance tests and then using that to determine which covariates to include in the subsequent analysis of treatment effects in randomized experiments has bad properties and shouldn’t be done. Here the idea is that when these tests provide some evidence against the null of randomization for some covariate, researchers sometimes then adjust for that covariate (when they wouldn’t have otherwise); and when everything looks balanced, researchers use this as a justification for using simple unadjusted estimators of treatment effects. I agree with this, and typically one should already specify adjusting for relevant pre-treatment covariates in the pre-analysis plan. Including them will increase precision.
I’ve also heard the idea that these balance tests in Table 1 confuse readers, who see a single p < 0.05 — often uncorrected for multiple tests — and get worried that the trial isn’t valid. More generally, we might think that Table 1 of a paper in a widely read medical journal isn’t the right place for such information. This seems right to me. There are important ingredients to good research that don’t need to be presented prominently in a paper, though it is important to provide information about them somewhere readily inspectable in the package for both pre- and post-publication peer review.
In light of all this, here is a proposal:
- Papers on randomized experiments should report tests of the null hypothesis that treatment was randomized as specified. These will often include balance tests, but of course there are others.
- These tests should follow the maxim “analyze as you randomize“, both accounting for any clustering or blocking/stratification in the randomization and any particularly important subsetting of the data (e.g., removing units without outcome data).
- Given a typically high prior belief that randomization occurred as planned, authors, reviewers, and readers should certainly not use p < 0.05 as a decision criterion here.
- If there is evidence against randomization, authors should investigate, and may often be able to fully or partially fix the problem long prior to peer review (e.g., by including improperly discarded data) or in the paper (e.g., by identifying the problem only affected some units’ assignments, bounding the possible bias).
- While it makes sense to mention them in the main text, there is typically little reason — if they don’t reject with a tiny p-value — for them to appear in Table 1 or some other prominent position in the main text, particularly of a short article. Rather, they should typically appear in a supplement or appendix — perhaps as Table S1 or Table A1.
This recognizes both the value of checking implications of one of the most important assumptions in randomized experiments and that most of the time this test shouldn’t cause us to update our beliefs about randomization much. I wonder if any of this remains controversial and why.