Lossy better than lossless in online bootstrapping

Sometimes an approximate method is in some important sense better than the “exact” one — and not just because it is easier or faster.

In statistical inference, a standard example here is the Agresti-Coull confidence interval for a binomial proportion: the “exact” interval from inverting the binomial test is conservative — giving overly wide intervals with more than the advertised coverage — but the standard (approximate) Wald interval is too narrow.¹ The Agresti-Coull confidence interval, which is a modification of the Wald interval that can be justified on Bayesian grounds, has “better” performance than either.²

Even knowing this, like many people I suspect, I am a sucker for the “exact” over the approximate. The rest of this post gives another example of “approximate is better than exact” that Art Owen and I recently encountered in our work on bootstrapping big data with multiple dependencies.

The bootstrap is a computational method for estimating the sampling uncertainty of a statistic.³ When using bootstrap resampling or bagging, one normally draws observations without replacement from the sample to form a bootstrap replicate. Each replicate then consists of zero or more copies of each observation. If one wants to bootstrap online — that is, one observation at a time — or generally without synchronization costs in a distributed processing setting, machine learning folks have used the Poisson approximation to the binomial. This approximate bootstrap works as follows: for each observation in the sample, take a Poisson(1) draw and include that many of that observation in this replicate.

Since this is a “lossy” approximation, investigators have sometimes considered and advocated “lossless” alternatives (Lee & Clyde, 2004). The Bayesian bootstrap, in which each replicate is a n-dimensional draw from the Dirichlet, can be done exactly online: for each observation, take a Exp(1) draw and use it as the weight for that observation for this replicate.⁴ Being a sucker for methods labeled “lossless” or “exact”, I implemented this method in Hive at Facebook, and used it instead of the already available Poisson method. I even chortled to others, “Now we have an exact version implemented to use instead!”

But is this the best of all possible distributions for bootstrap reweighting? Might there be some other, better distribution (nonnegative, with mean 1 and variance 1)? In particular, what distribution minimizes our uncertainty about the variance of the mean, given the same number of bootstrap replicates?

We examined this question (Owen and Eckles, 2011, section 3.3) and found that the Poisson(1) weights give a sharper estimate of the variance than the Exp(1) weights: the lossy approximation to the standard resampling bootstrap is better than the exact Bayesian reweighting bootstrap. Interestingly, both of these are beat by using “double-or-nothing” U{0, 2} weights — that is, something close to half-sampling.⁵ Furthermore, the Poisson(1) and U{0, 2} versions are more general, since they don’t require using weighting (observations can duplicated) and, when using them as weights, they don’t require using floating point numbers.⁶

Agresti, A. and Coull, B. A. (1998). Approximate Is Better than “Exact” for Interval Estimation of Binomial Proportions. American Statistician, 5 (2): 119-126

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.

Hesterberg, T., et al. Bootstrap Methods and Permutation Tests. In: Introduction to the Practice of Statistics.

Lee, H. K. H. and Clyde, M. A. (2004). Lossless online Bayesian bagging. Journal of Machine Learning Research, 5:143–151.

Owen, A. B. and Eckles, D. (2011). Bootstrapping data arrays of arbitrary order. http://arxiv.org/abs/1106.2125

Oza, N. (2001). Online bagging and boosting. In Systems, man and cybernetics, 2005 IEEE international conference on, volume 3, pages 2340–2345. IEEE.

The Wald interval also gives zero-width intervals when observations are all either y=1 or y=0. [↩]
That is, it contains the true value closer to 100 * (1 – alpha)% of the time than the others. This example is a favorite of Art Owen’s. [↩]
Hesterberg et al. (PDF) is a good introduction to the bootstrap. Efron (1979) is the first paper on the bootstrap. [↩]
In what sense is this “exact” or “lossless”? This online method is exactly the same as the offline Bayesian bootstrap in which one takes a n-dimensional draw from the Dirichlet. On the other hand, the Poisson(1) method is often seen as an online approximation to the offline bootstrap. [↩]
Why is this? See the paper, but the summary is that U{0, 2} has the lowest kurtosis, and Poisson(1) has lower kurtosis than Exp(1). [↩]
This is especially useful if one is doing factorial weighting, as we do in the paper, where multiplication of weights for different grouping factors is required. [↩]

Leave a Reply