Is thinking about monetization a waste of our best minds?

I just recently watched this talk by Jack Conte, musician, video artist, and cofounder of Patreon:

Jack dives into how rapidly the Internet has disrupted the business of selling reproducible works, such as recorded music, investigative reporting, etc. And how important — and exciting — it is build new ways for the people who create these works to be able to make a living doing so. Of course, Jack has some particular ways of doing that in mind — such as subscriptions and subscription-like patronage of artists, such as via Patreon.

But this also made me think about this much-repeated1 quote from Jeff Hammerbacher (formerly of Facebook, Cloudera, and now doing bioinformatics research):

“The best minds of my generation are thinking about how to make people click ads. That sucks.”

I certainly agree that many other types of research can be very important and impactful, and often more so than working on data infrastructure, machine learning, market design, etc. for advertising. However, Jack Conte’s talk certainly helped make the case for me that monetization of “content” is something that has been disrupted already but needs some of the best minds to figure out new ways for creators of valuable works to make money.

Some of this might be coming up with new arrangements altogether. But it seems like this will continue to occur partially through advertising revenue. Jack highlights how little ad revenue he often saw — even as his videos were getting millions of views. And newspapers’ have been less able to monetize online attention through advertising than they had been able to in print.

Some of this may reflect that advertising dollars were just really poorly allocated before. But improving this situation will require a mix of work on advertising — certainly beyond just getting people to click on ads — such as providing credible measurement of the effects and ROI of advertising, improving targeting of advertising, and more.

Another side of this question is that advertising remains an important part of our culture and force for attitude and behavior change. Certainly looking back on 2016 right now, many people are interested in what effects political advertising had.

So maybe it isn’t so bad if at least some of our best minds are working on online advertising.

  1. So often repeated that Hammerbacher said to Charlie Rose, “That’s going to be on my tombstone, I think.” []

Total war, and armaments as “superior goods”

Hobsbawn on industrialization, mass mobilization, and “total war” in The Age of Extremes: A History of the World, 1914-1991 (ch. 1):

Jane Austen wrote her novels during the Napoleonic wars, but no reader who did not know this already would guess it, for the wars do not appear in her pages, even though a number of the young gentlemen who pass through them undoubtedly took part in them. It is inconceivable that any novelist could write about Britain in the twentieth-century wars in this manner.

The monster of twentieth-century total war was not born full-sized. Nevertheless, from 1914 on, wars were unmistakably mass wars. Even in the First World War Britain mobilized 12.5 per cent of its men for the forces, Germany 15.4 per cent, France almost 17 per cent. In the Second World War the percentage of the total active labour force that went into the armed forces was pretty generally in the neighborhood of 20 per cent (Milward, 1979, p. 216). We may note in passing that such a level of mass mobilization, lasting for a matter of years, cannot be maintained except by a modern high-productivity industrialized economy, and – or alternatively – an economy largely in the hands of the non-combatant parts of the population. Traditional agrarian economies cannot usually mobilize so large a proportion of their labour force except seasonally, at least in the temperate zone, for there are times in the agricultural year when all hands are needed (for instance to get in the harvest). Even in industrial societies so great a manpower mobilization puts enormous strains on the labour force, which is why modern mass wars both strengthened the powers of organized labour and produced a revolution in the employment of women outside the household: temporarily in the First World War, permanently in the Second World War.

A superior good is something that one purchases more of as income rises. Here it is appealing to, at least metaphorically, see the huge expenditures on industrial armaments as revealing arms as superior goods in this sense.

Using covariates to increase the precision of randomized experiments

A simple difference-in-means estimator of the average treatment effect (ATE) from a randomized experiment is, being unbiased, a good start, but may often leave a lot of additional precision on the table. Even if you haven’t used covariates (pre-treatment variables observed for your units) in the design of the experiment (e.g., this is often difficult to do in streaming random assignment in Internet experiments; see our paper), you can use them to increase the precision of your estimates in the analysis phase. Here are some simple ways to do that. I’m not including a whole range of more sophisticated/complicated approaches. And, of course, if you don’t have any covariates for the units in your experiments — or they aren’t very predictive of your outcome, this all won’t help you much.

Post-stratification

Prior to the experiment you could do stratified randomization (i.e. blocking) according to some categorical covariate (making sure that there there are same number of, e.g., each gender, country, paid/free accounts in each treatment). But you can also do something similar after: compute an ATE within each stratum and then combine the strata-level estimates, weighting by the total number of observations in each stratum. For details — and proofs showing this often won’t be much worse than blocking, consult Miratrix, Sekhon & Yu (2013).

Regression adjustment with a single covariate

Often what you most want to adjust for is a single numeric covariate,1 such as a lagged version of your outcome (i.e., your outcome from some convenient period before treatment). You can simply use ordinary least squares regression to adjust for this covariate by regressing your outcome on both a treatment indicator and the covariate. Even better (particularly if treatment and control are different sized by design), you should regress your outcome on: a treatment indicator, the covariate centered such that it has mean zero, and the product of the two.2 Asymptotically (and usually in practice with a reasonably sized experiment), this will increase precision and it is pretty easy to do. For more on this, see Lin (2012).

Higher-dimensional adjustment

If you have a lot more covariates to adjust for, you may want to use some kind of penalized regression. For example, you could use the Lasso (L1-penalized regression); see Bloniarz et al. (2016).

Use out-of-sample predictions from any model

Maybe you instead want to use neural nets, trees, or an ensemble of a bunch of models? That’s fine, but if you want to be able to do valid statistical inference (i.e., get 95% confidence intervals that actually cover 95% of the time), you have to be careful. The easiest way to be careful in many Internet industry settings is just to use historical data to train the model and then get out-of-sample predictions Yhat from that model for your present experiment. You then then just subtract Yhat from Y and use the simple difference-in-means estimator. Aronow and Middleton (2013) provide some technical details and extensions. A simple extension that makes this more robust to changes over time is to use this out-of-sample Yhat as a covariate, as described above.3

  1. As Winston Lin notes in the comments and as is implicit in my comparison with post-stratification, as long as the number of covariates is small and not growing with sample size, the same asymptotic results apply. []
  2. Note that if the covariate is binary or, more generally, categorical, then this exactly coincides with the post-stratified estimator considered above. []
  3. I added this sentence in response to Winston Lin’s comment. []

Adjusting biased samples

Nate Cohn at The New York Times reports on how one 19-year-old black man is having an outsized impact on the USC/LAT panel’s estimates of support for Clinton in the U.S. presidential election. It happens that the sample doesn’t have enough other people with similar demographics and voting history (covariates) to this panelist, so he is getting a large weight in computing the overall averages for the populations of interest, such as likely voters:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.

And he has been held up as proof by conservatives — including outlets like Breitbart News and The New York Post — that Mr. Trump is excelling among black voters. He has even played a modest role in shifting entire polling aggregates, like the Real Clear Politics average, toward Mr. Trump.

As usual, Andrew Gelman suggests that the solution to this problem is a technique he calls “Mr. P” (multilevel regression and post-stratification). I wanted to comment on some practical tradeoffs among common methods. Maybe these are useful notes, which can be read alongside another nice piece by Nate Cohn on how different adjustment methods can yield very different polling results.

Post-stratification

Complete post-stratification is when you compute the mean outcome (e.g., support for Clinton) for each stratum of people, such as 18-24-year-old black men, defined by the covariates X. Then you combine these weighting by the size of each group in the population of interest. This really only works when you have a lot of data compared with the number of strata — and the number of strata grows very fast in the number of covariates you want to adjust for.

Modeling sample inclusion and weighting

When people talk about survey weighting, often what they mean is weighting by inverse of the estimated probability of inclusion in the sample. You model selection into the survey S using, e.g., logistic regression on the covariates X and some interactions. This can be done with regularization (i.e., priors, shrinkage) since many of the terms in the model might be estimated with very few observations. Especially without enough regularization, this can result in very large weights when you don’t have enough of some particular type in your sample.

Modeling the outcome and integrating

You fit a model predicting the response (e.g., support for Clinton) Y with the covariates X. You regularize this model in some way so that the estimate for each person is going to “borrow strength” from other people with similar Xs. So now you have a fitted responses Yhat for each unique X. To get an estimate for a particular population of interest, integrate out over the distribution of X in that population. Gelman’s preferred version “Mr. P” uses a multilevel (aka hierarchical Bayes, random effects) model for the outcome, but other regularization methods may often be appealing.

This is nice because there can be some substantial efficiency gains (i.e. more precision) by making use of the outcome information. But there are also some practical issues. First, you need a model for each outcome in your analysis, rather than just having weights you could use for all outcomes and all recodings of outcomes. Second, the implicit weights that this process puts on each observation can vary from outcome to outcome — or even for different codings (i.e. a dichotomization of answers on a numeric scale) of the same outcome. In a reply to his post, Gelman notes that you would need a different model for each outcome, but that some joint model for all outcomes would be ideal. Of course, the latter joint modeling approach, while appealing in some ways (many statisticians love having one model that subsumes everything…) means that adding a new outcome to analysis would change all prior results.

 

Side note: Other methods, not described here, also work towards the aim of matching characteristics of the population distribution (e.g., iterative proportional fitting / raking). They strike me as overly specialized and not easy to adapt and extend.

It’s better for older workers to go a little fast: DocSend in Snow Crash

My friends at DocSend have just done their public launch (article, TechCrunch Disrupt presentation). DocSend provides easy ways to get analytics for documents (e.g., proposals, pitch decks, reports, memos) you send out, answering questions like: Who actually viewed the document? Which pages did they view? How much time did they spend on each page? The most common use cases for DocSend’s current customers involve sales, marketing, and startup fundraising — mainly sending documents to people outside an organization.

From when Russ, Dave, and Tony started floating these ideas, I’ve pointed out the similarity with a often forgotten scene1 in Snow Crash, in which a character — Y.T.’s mom — is tracked by her employer (the Federal Government actually) as she reads a memo on a cost-saving program. Here’s an except from Chapter 37:

Y.T.’s mom pulls up the new memo, checks the time, and starts reading it. The estimated reading time is 15.62 minutes. Later, when Marietta [her boss] does her end-of-day statistical roundup, sitting in her private office at 9:00 P.M., she will see the name of each employee and next to it, the amount of time spent reading this memo, and her reaction, based on the time spent, will go something like this:

• Less than 10 min.: Time for an employee conference and possible attitude counseling.
• 10-14 min.: Keep an eye on this employee; may be developing slipshod attitude.
• 14-15.61 min.: Employee is an efficient worker, may sometimes miss important details.
• Exactly 15.62 min.: Smartass. Needs attitude counseling.
• 15.63-16 min.: Asswipe. Not to be trusted.
• 16-18 min.: Employee is a methodical worker, may sometimes get hung up on minor details.
• More than 18 min.: Check the security videotape, see just what this employee was up to (e.g., possible unauthorized restroom break).

Y.T.’s mom decides to spend between fourteen and fifteen minutes reading the memo. It’s better for younger workers to spend too long, to show that they’re careful, not cocky. It’s better for older workers to go a little fast, to show good management potential. She’s pushing forty. She scans through the memo, hitting the Page Down button at reasonably regular intervals, occasionally paging back up to pretend to reread some earlier section. The computer is going to notice all this. It approves of rereading. It’s a small thing, but over a decade or so this stuff really shows up on your work-habits summary.

This is pretty much what DocSend provides. And, despite the emphasis on sales etc., some of their customers are using this for internal HR training — which shifts the power asymmetry in how this technology is used from salespeople selling to companies (which can choose not to buy, etc.) to employers tracking their employees.2

To conclude, it’s worth noting that, at least for a time, product managers at Facebook — Russ’ job before starting DocSend — were required to read Snow Crash as part of their internal training. Though I don’t think the folks running PM bootcamp actually tracked whether their subordinates looked at each page.

  1. I know it’s often forgotten because I’ve tried referring to the scene with many people who have read Snow Crash— or at least claim to have read it… []
  2. Of course, there are some products that do this kind of thing. What distinguishes DocSend is how easy it makes it to add such personalized tracking to simple documents and that this is the primary focus of the product, unlike larger sales tool sets like ClearSlide. []