Search queries in referrer headers: Technical knowledge, privacy, and the status quo

I have been fascinated by Christopher Soghoian‘s complaint to the FTC about Google’s practices of including search query information in the HTTP referrer header.

In summary, Google has taken proactive efforts to ensure that Web site owners that get visitors from Google search receive the search terms entered by Google’s users. Meanwhile, Google has agreed that search query data is personally sensitive information and that it does not disclosure this information, except under specific, limited circumstances; this is reflected in its privacy policy. Note that Google has not just let the URL do the work, but has specifically worked to make the referrer header include search terms (and additional information) when it has adopted techniques that would otherwise prevent these disclosures from being made. (For a fuller summary, see his blog post and this WSJ article. Or this article at Search Engine Land.)

I am not going to discuss the ethics and legal issues in this particular case. Instead, I just want to draw attention to how this issue reveals the importance of technical knowledge in thinking about privacy issues.

A common response from people working in the Internet industry is that Soghoian is a non-techie that has suddenly “discovered” referrer headers. For example, Danny Sullivan writes “former FTC employee discovers browsers sends referrer strings, turns it into google conspiracy”. (Of course, Soghoian is actually technically savvy, as reading the complaint to the FTC makes clear.)

What’s going on here? Folks with technical knowledge perceive search query disclosure as the status quo (though I bet most don’t often think about the consequences of clicking on a link after a sensitive search).

But how would most Internet users be aware of this? Certainly not through Google’s statements, or through warnings from Web browsers. One of the few ways I think users might realize this is happening is through query-highlighting — on forums, mailing list archives, and spammy pages. So a super-rational user who cares to think about how that works, might guess something like this is going on. But I doubt most users would actively work out the mechanisms involved. Futhermore, their observations likely radically underdetermine the mechanism anyway, since it is quite reasonable that a Web browser could do this kind of highlighting directly, especially for formulaic sites, like forums. Even casual use of Web analytics software (such as Google Analytics) may not make it clear that this per-user information is being provided, since aggregated data could reasonably be used to present summaries of top search queries leading to a Web site.1

This should be a reminder why empirical studies of privacy attitudes and behaviors are useful: us techie folks often have severe blind spots. I don’t know that this is just a matter of differences in expectations, but rather involves differences in preferences. Over time, these expectations change our sense of the status quo, from which we can calibrate our preferences and intentions.

Google has worked to ensure that referrer headers continue to include search query information — even as it adopts techniques that would make this not happen simply by the standard inclusion of the URL there.2 A difference in beliefs about the status quo puts these actions by Google in a different context. For us techies, that is just maintaining the status quo (which may seem more desirable, since we know it’s the industry-wide standard). For others, it might seem more like Google putting advertisers and Web site owners above its promises to its users about their sensitive data.

  1. Google does separately provide aggregated query data to Web site owners. []
  2. See Danny Sullivan’s post following some changes by Google that could have ended including search queries in referrer headers. []

Economic imperialism and causal inference

And I, for one, welcome our new economist overlords…

Readers not in academic social science may take the title of this post as indicating I’m writing about the use of economic might to imperialist ends.1 Rather, economic imperialism is a practice of economists (and acolytes) in which they invade research territories that traditionally “belong” to other social scientific disciplines.2 See this comic for one way you can react to this.3

Economists bring their theoretical, statistical, and research-funding resources to bear on problems that might not be considered economics. For example, freakonomists like Levitt study sumo wrestlers and the effects of the legalization of abortion on crime. But, hey, if the Commerce Clause means that Congress can legislate everything, then, for the same reasons, economists can — no, must — study everything.

I am not an economist by training, but I have recently had reason to read quite a bit in econometrics. Overall, I’m impressed.4 Economists have recently taken causal inference — learning about cause and effect relationships, often from observational data — quite seriously. In the eyes of some, this has precipitated a “credibility revolution” in economics. Certainly, papers in economics and (especially) econometrics journals consider threats to the validity of causal inference at length.

On the other hand, causal inference in the rest of the social sciences is simultaneously over-inhibited and under-inhibited. As Judea Pearl observes in his book Causality, lack of clarity about statistical models (that social scientists often don’t understand) and causality has induced confusion about distinctions between statistical and causal issues (i.e., between estimation methods and identification).5

So, on the one had, many psychologists stick to experiments. Randomized experiments are, generally, the gold standard for investigating cause–effect relationships, so this can and often does go well. However, social psychologists have recently been obsessed with using “mediation analysis” to investigate the mechanisms by which causes they can manipulate produce effects of interest. Investigators often manipulate some factors experimentally and then measure one or more variables they believe fully or partially mediate the effect of those factors on their outcome. Then, under the standard Baron & Kenny approach, psychologists fit a few regression models, including regressing the outcome on both the experimentally manipulated variables and the simply measured (mediating) variables. The assumptions required for this analysis to identify any effects of interest are rarely satisfied (e.g., effects on individuals are homogenous).6 So psychologists are often over-inhibited (experiments only please!) and under-inhibited (mediation analysis).

Likewise, in more observational studies (in psychology, sociology, education, etc.), investigators are sometimes wary of making explicit causal claims. So instead of carefully stating the causal assumptions that would justify different causal conclusions, readers are left with phrases like “suggests” and “is consistent with” followed by causal claims. Authors then recommend that further research be conducted to better support these causal conclusions. With these kinds of recommendations awaiting, no wonder that economists find the territory ready for taking: they can just show up with econometrics tools and get to work on hard-won questions the rightly belong to others!

  1. Well, if economists have better funding sources, this might apply in some sense. []
  2. For arguments in favor of economic imperialism, see Lazear, E.P. (1999). Economic imperialism. NBER Working Paper No. 7300. []
  3. Or see this comic for imperialism by physicists. []
  4. At least by the contemporary literature on what I’ve been reading on — IVs, encouragement designs, endogenous interactions, matching estimators. But it is true that in some of these areas econometrics has been able to fruitfully borrow from work on potential outcomes in statistics and epidemiology. []
  5. Econometricians have made similar observations. []
  6. For a bit on this topic, see the discussion and links to papers here. []

Homophily and peer influence are messy business

Some social scientists have recently been getting themselves into trouble (and limelight) claiming that they have evidence of direct and indirect “contagion” (peer influence effects) in obesity, happiness, loneliness, etc. Statisticians and methodologists — and even science journalists — have pointed out their troubles. In observational data, peer influence effects are confounded with those of homophily and common external causes. That is, people are similar to other people in their social neighborhood because ties are more likely to form between similar people, and many external events that could cause the outcome are localized in networks (e.g., fast food restaurant opens down the street).

Econometricians1 have worked out the conditions necessary for peer influence effects to be identifiable.2 Very few studies have plausibly satisfied these requirements. But even if an investigator meets these requirements, it is worth remembering that homophily and peer influence are still tricky to think about — let along produce credible quantitative estimates of.

As Andrew Gelman notes, homophily can depend on network structure and information cascades (a kind of peer influence effect) to enable the homophilous relationships to form. Likewise, the success or failure of influence in a relationship can affect that relationship. For example, once I convert you to my way of thinking — let’s say, about climate change, we’ll be better friends. To me, it seems like some of the downstream consequences of our similarity should be attributed to peer influence. If I get fat and so you do, it could be peer influence in many ways: maybe that’s because I convinced you that owning a propane grill is more environmentally friendly (and then we both ended up grilling a lot more red meat). Sounds like peer influence to me. But it’s not that me getting fat caused you to.

Part of the problem here is looking only at peer influence effects in a single behavior or outcome at once. I look forward to the “clear thinking and adequate data” (Manski) that will allow us to better understand these processes in the future. Until then: scientists, please at least be modest in your claims and radical policy recommendations. This is messy business.

  1. They do statistics but speak a different language than big “S” statisticians — kind of like machine learning folks. []
  2. For example, see Manski, C. F. (2000). Economic analysis of social interactions. Journal of Economic Perspectives, 14(3):115–136. Economists call peer influence effects endogenous interactions and contextual interactions. []

Aardvark’s use of Wizard of Oz prototyping to design their social interfaces

The Wall Street Journal’s Venture Capital Dispatch reports on how Aardvark, the social question asking and answering service recently acquired by Google, used a Wizard of Oz prototype to learn about how their service concept would work without building all the tech before knowing if it was any good.

Aardvark employees would get the questions from beta test users and route them to users who were online and would have the answer to the question. This was done to test out the concept before the company spent the time and money to build it, said Damon Horowitz, co-founder of Aardvark, who spoke at Startup Lessons Learned, a conference in San Francisco on Friday.

“If people like this in super crappy form, then this is worth building, because they’ll like it even more,” Horowitz said of their initial idea.

At the same time it was testing a “fake” product powered by humans, the company started building the automated product to replace humans. While it used humans “behind the curtain,” it gained the benefit of learning from all the questions, including how to route the questions and the entire process with users.

This is a really good idea, as I’ve argued before on this blog and in a chapter for developers of mobile health interventions. What better way to (a) learn about how people will use and experience your service and (b) get training data for your machine learning system than to have humans-in-the-loop run the service?

My friend Chris Streeter wondered whether this was all done by Aardvark employees or whether workers on Amazon Mechanical Turk may have also been involved, especially in identifying the expertise of the early users of the service so that the employees could route the questions to the right place. I think this highlights how different parts of a service can draw on human and non-human intelligence in a variety of ways — via a micro-labor market, using skilled employees who will gain hands-on experience with customers, etc.

I also wonder what UIs the humans-in-the-loop used to accomplish this. It’d be great to get a peak. I’d expect that these were certainly rough around the edges, as was the Aardvark customer-facing UI.

Aardvark does a good job of being a quite sociable agent (e.g., when using it via instant messaging) that also gets out of the way of the human–human interaction between question askers and answers. I wonder how the language used by humans to coordinate and hand-off questions may have played into creating a positive para-social interaction with vark.

Public once, public always? Privacy, egosurfing, and the availability heuristic

The Library of Congress has announced that it will be archiving all Twitter posts (tweets). You can find positive reaction on Twitter. But some have also wondered about privacy concerns. Fred Stutzman, for example, points out how even assuming that only unprotected accounts are being archived this can still be problematic.1 While some people have Twitter usernames that easily identify their owners and many allow themselves to be found based on an email address that is publicly associated with their identity, there are also many that do not. If at a future time, this account becomes associated with their identity for a larger audience than they desire, they can make their whole account viewable only by approved followers2, delete the account, or delete some of the tweets. Of course, this information may remain elsewhere on the Internet for a short or long time. But in contrast, the Library of Congress archive will be much more enduring and likely outside of individual users’ control.3 While I think it is worth examining the strategies that people adopt to cope with inflexible or difficult to use privacy controls in software, I don’t intend to do that here.

Instead, I want to relate this discussion to my continued interest in how activity streams and other information consumption interfaces affect their users’ beliefs and behaviors through the availability heuristic. In response to some comments on his first post, Stutzman argues that people overestimate the degree to which content once public on the Internet is public forever:

So why is it that we all assume that the content we share publicly will be around forever?  I think this is a classic case of selection on the dependent variable.  When we Google ourselves, we are confronted with what’s there as opposed to what’s not there.  The stuff that goes away gets forgotten, and we concentrate on things that we see or remember (like a persistent page about us that we don’t like).  In reality, our online identities decay, decay being a stochastic process.  The internet is actually quite bad at remembering.

This unconsidered “selection on the dependent variable” is one way of thinking about some cases of how the availability heuristic (and use of ease-of-retrievel information more generally). But I actually think the latter is more general and more useful for describing the psychological processes involved. For example, it highlights both that there are many occurrences or interventions can can influence which cases are available to mind and that even if people have thought about cases where their content disappeared at some point, this may not be easily retrieved when making particular privacy decisions or offering opinions on others’ actions.

Stutzman’s example is but one way that the combination of the availability heuristic and existing Internet services combine to affect privacy decisions. For example, consider how activity streams like Facebook News Feed influence how people perceive their audience. News Feed shows items drawn from an individual’s friends’ activities, and they often have some reciprocal access. However, the items in the activity stream are likely unrepresentative of this potential and likely audience. “Lurkers” — people who consume but do not produce — are not as available to mind, and prolific producers are too available to mind for how often they are in the actual audience for some new shared content. This can, for example, lead to making self-disclosures that are not appropriate for the actual audience.

  1. This might not be the case, see Michael Zimmer and this New York Times article. []
  2. Why don’t people do this in the first place? Many may not be aware of the feature, but even if they are, there are reasons not to use it. For example, it makes any participation in topical conversations (e.g., around a hashtag) difficult or impossible. []
  3. Or at least this control would have to be via Twitter, likely before archiving: “We asked them [Twitter] to deal with the users; the library doesn’t want to mediate that.” []