Academia vs. industry: Harvard CS vs. Google edition

Matt Welsh, a professor in the Harvard CS department, has decided to leave Harvard to continue his post-tenure leave working at Google. Welsh is obviously leaving a sweet job. In fact, it was not long ago that he was writing about how difficult it is to get tenure at Harvard.

So why is he leaving? Well, CS folks doing research in large distributed systems are in a tricky place, since the really big systems are all in industry. And instead of legions of experienced engineers to help build and study these systems, they have a bunch of lazy grad students! One might think, then, that this kind of (tenured) professor to industry move is limited to people creating and studying large deployments of computer systems.

There is a broader pull, I think. For researchers studying many central topics in the social sciences (e.g., social influence), there is a big draw to industry, since it is corporations that are collecting broad and deep data sets describing human behavior. To some extent, this is also a case of industry being appealing for people studying deployment of large deployments of computer systems — but it applies even to those who don’t care much about the “computer” part. In further parallels to the case with CS systems researchers, in industry they have talented database and machine learning experts ready to help, rather than social science grad students who are (like the faculty) too often afraid of math.

Not just predicting the present, but the future: Twitter and upcoming movies

Search queries have been used recently to “predict the present“, as Hal Varian has called it. Now some initial use of Twitter chatter to predict the future:

The chatter in Twitter can accurately predict the box-office revenues of upcoming movies weeks before they are released. In fact, Tweets can predict the performance of films better than market-based predictions, such as Hollywood Stock Exchange, which have been the best predictors to date. (Kevin Kelley)

Here is the paper by Asur and Huberman from HP Labs. Also see a similar use of online discussion forums.

But the obvious question from my previous post is, how much improvement do you get by adding more inputs to the model? That is, how does the combined Hollywood Stock Exchange and Twitter chatter model perform? The authors report adding the number of theaters the movie opens in to both models, but not combining them directly.

Search terms and the flu: preferring complex models

Simplicity has its draws. A simple model of some phenomena can be quick to understand and test. But with the resources we have today for theory building and prediction, it is worth recognizing that many phenomena of interest (e.g., in social sciences, epidemiology) are very, very complex. Using a more complex model can help. It’s great to try many simple models along the way — as scaffolding — but if you have a large enough N in an observational study, a larger model will likely be an improvement.

One obvious way a model gets more complex is by adding predictors. There has recently been a good deal of attention on using the frequency of search terms to predict important goings-on — like flu trends. Sharad Goel et al. (blog post, paper) temper the excitement a bit by demonstrating that simple models using other, existing public data sets outperform the search data. In some cases (music popularity, in particular), adding the search data to the model improves predictions: the more complex combined model can “explain” some of the variance not handled by the more basic non-search-data models.

This echos one big takeaway from the Netflix Prize competition: committees win. The top competitors were all large teams formed from smaller teams and their models were tuned combinations of several models. That is, the strategy is, take a bunch of complex models and combine them.

One way of doing this is just taking a weighted average of the predictions of several simpler models. This works quite well when your measure of the value of your model is root mean squared error (RMSE), since RMSE is convex.

While often the larger model “explains” more of the variance, what “explains” means here is just that the R-squared is larger: less of the variance is error. More complex models can be difficult to understand, just like the phenomena they model. We will continue to need better tools to understand, visualize, and evaluate our models as their complexity increases. I think the committee metaphor will be an interesting and practical one to apply in the many cases where the best we can do is use a weighted average of several simpler, pretty good models.

Advanced Soldier Sensor Information System and Technology

Yes, that spells ASSIST.

Check out this call for proposals from DARPA (also see Wired News). This research program is designed to create and evaluate systems that use sensors to capture soldiers’ experiences in the field, thus allowing for (spatially and temporally) distant review and analysis of this data, as well as augmenting their abilities while still in the field.

I found it interesting to consider differences in requirements between this program and others that would apply some similar technologies and involve similar interactions — but for other purposes. For example, two such uses are (1) everyday life recording for social sharing and memory and (2) rich data collection as part of ethnographic observation and participation.

When doing some observation myself, I strung my cameraphone around my neck and used Waymarkr to automatically capture a photo every minute or so. Check out the results from my visit to a flea market in San Francisco.

Photos of two ways to wear a cameraphone from Waymarkr. Incidentally, Waymarkr uses the cell-tower-based location API created for ZoneTag, a project I worked on at Yahoo! Research Berkeley.

Also, for a use more like (1) in a fashion context, see Blogging in Motion. This project (for Yahoo! Hack Day) created a “auto-blogging purse” that captures photos (again using ZoneTag) whenever the wearer moves around (sensed using GPS).