Aardvark’s use of Wizard of Oz prototyping to design their social interfaces

The Wall Street Journal’s Venture Capital Dispatch reports on how Aardvark, the social question asking and answering service recently acquired by Google, used a Wizard of Oz prototype to learn about how their service concept would work without building all the tech before knowing if it was any good.

Aardvark employees would get the questions from beta test users and route them to users who were online and would have the answer to the question. This was done to test out the concept before the company spent the time and money to build it, said Damon Horowitz, co-founder of Aardvark, who spoke at Startup Lessons Learned, a conference in San Francisco on Friday.

“If people like this in super crappy form, then this is worth building, because they’ll like it even more,” Horowitz said of their initial idea.

At the same time it was testing a “fake” product powered by humans, the company started building the automated product to replace humans. While it used humans “behind the curtain,” it gained the benefit of learning from all the questions, including how to route the questions and the entire process with users.

This is a really good idea, as I’ve argued before on this blog and in a chapter for developers of mobile health interventions. What better way to (a) learn about how people will use and experience your service and (b) get training data for your machine learning system than to have humans-in-the-loop run the service?

My friend Chris Streeter wondered whether this was all done by Aardvark employees or whether workers on Amazon Mechanical Turk may have also been involved, especially in identifying the expertise of the early users of the service so that the employees could route the questions to the right place. I think this highlights how different parts of a service can draw on human and non-human intelligence in a variety of ways — via a micro-labor market, using skilled employees who will gain hands-on experience with customers, etc.

I also wonder what UIs the humans-in-the-loop used to accomplish this. It’d be great to get a peak. I’d expect that these were certainly rough around the edges, as was the Aardvark customer-facing UI.

Aardvark does a good job of being a quite sociable agent (e.g., when using it via instant messaging) that also gets out of the way of the human–human interaction between question askers and answers. I wonder how the language used by humans to coordinate and hand-off questions may have played into creating a positive para-social interaction with vark.

“Discovering Supertaskers”: Challenges in identifying individual differences from behavior

Some new research from the University of Utah suggests that a small fraction of the population consists of “supertaskers” whose performance is not reduced by multitasking, such as when completing tasks on a mobile phone while driving.

“Supertaskers did a phenomenal job of performing several different tasks at once,” Watson says. “We’d all like to think we could do the same, but the odds are overwhelmingly against it.” (Wired News & Science News)

The researchers, Watson and Strayer, argue that they have good evidence for the existence of this individual variation. One can find many media reports of this “discovery” of “supertaskers” (e.g., Psychology Today). I do not think this conclusion is well justified.

First, let’s consider the methods used in this research. 100 college students each completed driving tasks and an auditory task on a mobile phone — separately and in combination — over a single 1.5 hour session. The auditory task is designed to measure differences in executive attention by requiring participants do hold past items in memory while completing math tasks. The researchers identified “supertaskers” as those participants who met the following “stringent” requirements: they were both (a) in the top 25% of participants in performance in the single-task portions and (b) and not different in their dual-task performance on at least three of the four measures by more than the standard error. Since two of the four measures are associated with each of the two tasks (driving: brake reaction time, following distance; mobile phone task: memory performance, math performance), this requires that ‘’supertaskers” do as well on both measures of either the driving or mobile phone task and one measure of the other task.

There may be many issues with the validity of the inference in this work. I want to focus on one in particular: the inference from the observation of differences between participants’ performance in a single 1.5 hour session to the conclusion that there are stable, “trait” differences among participants, such that some are “supertaskers”. This conclusion is simply not justified. To illustrate this, let’s consider how the methods of this study differ from those usually (and reasonably) used by psychologists to reach such conclusions.

Psychologists often study individual differences using the following approach. First, identify some plausible trait of individuals. Second, construct a questionnaire or other (perhaps behavioral) test that measures that trait. Third, demonstrate that this test has high reliability — that is, that the differences between people are much larger than the differences between the same person taking the test at different times. Fourth, then use this test to measure the trait and see if it predicts differences in some experiment. A key point here is that in order to conclude that the test measures a stable individual difference (i.e., a trait) researchers need to establish high test-retest reliability; otherwise, the test might just be measuring differences in temporary mood.

Returning to Watson and Strayer’s research, it is easy to see the problem: we have no idea whether the variation observed should be attributed to stable individual differences (i.e., being a “supertasker”) or to unstable differences. That is, if we brought those same “supertasker” participants back into the lab and they did another session, would they still exhibit the same lack of performance difference between the single- and dual-task conditions? This research gives us no reason that expect that they would.

Watson and Strayer do some additional analysis with the aim of ruling out their observations being a fluke. One might think this addresses my criticism, but it does not. They

performed a Monte Carlo simulation in which randomly selected single-dual task pairs of variables from the existing data set were obtained for each of the 4 dependent measures and then subjected to the same algorithm that was used to classify the supertaskers.

That is, they broke apart the single-task and dual-task data for each participant and created new simulated participants by randomly sampling pairs single- and dual-task data. They found that on this analysis there would be only 1/15th of the observed ‘’supertaskers”. This is a good analysis to do. However, this just demonstrates that being labeled a “supertasker” is likely caused by the single- and dual-task data being generated by the same person in the same session. This stills leaves it quite open (and more plausible to me) that participants’ were in varying states for the session and this explains their (temporary) “supertasking”. It also allows that this greater frequency of “supertaskers” is due to participants who do well in whatever task they are given first being more likely to do well in subsequent tasks.

My aim in this post is to suggest some challenges that this kind of approach has to face. Part of my interest in this is that I’m quite sympathetic to identifying stable, observed differences in behavior and then “working backwards” to characterizing the traits that explain these downstream differences. This  exactly the approach that Maurits Kaptein and I are taking in our work on persuasion profiling: we observe how individuals respond to the use of different influence strategies and use this to (a) construct a “persuasion profile” for that individual and (b) characterize how much variation in the effects of these strategies there is in the population.

However, a critical step in this process is ruling out the alternative explanation that the observed differences are primarily due to differences in, e.g., mood, rather than stable individual differences. One way to do this is to observe the behavior in multiple sessions and multiple contexts. Another way to rule out this alternative explanation is if you observe a complex pattern of behavioral differences that previous work suggests could not be the result of temporary, unstable differences — or at least is more easily explained by previous theories about the relevant traits. That is, I’m enthusiastic about identifying stable, observed differences in behavior, but I don’t want to see researchers abandon the careful methods that have been used in the past to make the case for a new individual difference.

Watson, Strayer, and colleagues have apparently begun doing work that could be used to show the stability of the observed differences. The discussion section of their paper refers to some additional unpublished research in which they invited their “supertaskers” from this study and another study back into the lab and had them do some similar tasks measuring executive attention (but not driving) while in an fMRI machine. They report greater “coherence” in their performance in this second study and the previous study than control participants and better performance for “supertaskers” on dual-N-back tasks. But this is short of showing high test-retest reliability.

Since little is said about this work, I hesitate to conclude anything from it or criticize it. I’ve contacted the authors with the hope of learning more. My current sense is that Watson and Strayer’s entire case for “supertaskers” hinges on research of this kind.

References

Watson, J. M., & Strayer, D. L. (2010). Supertaskers: Profiles in Extraordinary Multi-tasking Ability. Psychonomic Bulletin and Review. Forthcoming. Retrieved from http://www.psych.utah.edu/lab/appliedcognition/publications/supertaskers.pdf

Search terms and the flu: preferring complex models

Simplicity has its draws. A simple model of some phenomena can be quick to understand and test. But with the resources we have today for theory building and prediction, it is worth recognizing that many phenomena of interest (e.g., in social sciences, epidemiology) are very, very complex. Using a more complex model can help. It’s great to try many simple models along the way — as scaffolding — but if you have a large enough N in an observational study, a larger model will likely be an improvement.

One obvious way a model gets more complex is by adding predictors. There has recently been a good deal of attention on using the frequency of search terms to predict important goings-on — like flu trends. Sharad Goel et al. (blog post, paper) temper the excitement a bit by demonstrating that simple models using other, existing public data sets outperform the search data. In some cases (music popularity, in particular), adding the search data to the model improves predictions: the more complex combined model can “explain” some of the variance not handled by the more basic non-search-data models.

This echos one big takeaway from the Netflix Prize competition: committees win. The top competitors were all large teams formed from smaller teams and their models were tuned combinations of several models. That is, the strategy is, take a bunch of complex models and combine them.

One way of doing this is just taking a weighted average of the predictions of several simpler models. This works quite well when your measure of the value of your model is root mean squared error (RMSE), since RMSE is convex.

While often the larger model “explains” more of the variance, what “explains” means here is just that the R-squared is larger: less of the variance is error. More complex models can be difficult to understand, just like the phenomena they model. We will continue to need better tools to understand, visualize, and evaluate our models as their complexity increases. I think the committee metaphor will be an interesting and practical one to apply in the many cases where the best we can do is use a weighted average of several simpler, pretty good models.

Multitasking among tasks that share a goal: action identification theory

Right from the start of today’s Media Multitasking Workshop1,  it’s clear that one big issue is just what people are talking about when they talk about multitasking. In this post, I want to highlight the relationship between defining different kinds of multitasking and people’s representations of the hierarchical structure of action.

It is helpful to start with a contrast between two kinds of cases.

Distributing attention towards a single goal

In the first, there is a single task or goal that involves dividing one’s attention, with the targets of attention somehow related, but of course somewhat independent. Patricia Greenfield used Pac-Man as an example: each of the ghosts must be attended to (in addition to Pac-Man himself), and each is moving independently, but each is related to the same larger goal.

Distributing attention among different goals

In the second kind of case, there are two completely unrelated tasks that divide attention, as in playing a game (e.g., solitaire) while also attending to a speech (e.g., in person, on TV). Anthony Wagner noted that in Greenfield’s listing of the benefits and costs of media multitasking, most of the listed benefits applied to the former case, while the costs she listed applied to the later. So keeping these different senses of multitasking straight is important.

Complications

But the conclusion should not be to think that this is a clear and stable distinction that slices multitasking phenomena in just the right way. Consider one ways of putting this distinction: the primary and secondary task can either be directed at the same goal or directed at different goals (or tasks). Let’s dig into this a bit more.2

Byron Reeves pointed out that sometimes “the IMing is about the game.” So we could distinguish whether the goal of the IMing is the same as the goal of the in-game task(s). But this making this kind of distinction requires identity conditions for goals or tasks that enable this distinction. As Ulrich Mayr commented, goals can be at many different levels, so in order to use goal identity as the criterion, one has to select a level in the hierarchy of goals.

Action identities and multitasking

We can think about this hierarchy of goals as the network of identities for an action that are connected with the “by” relation: one does one thing by doing (several) other things. If these goals are the goals of the person as they represent them, then this is the established approach taken by action identification theory (Vallacher & Wegner, 1987) — and this could be valuable lens for thinking about this. Action identification theory claims that people can report an action identity for what they are doing, and that this identity is the “prepotent identity”. This prepotent identity is generally the highest level identity under which the action is maintainable. This means that the prepotent identity is at least somewhat problematic if used to make this distinction between these two types of multitasking because then the distinction would be dependent on, e.g., how automatic or functionally transparent the behaviors involved are.

For example, if I am driving a car and everything is going well, I may represent the action as “seeing my friend Dave”. I may also represent my simultaneous, coordinating phone call with Dave under this same identity. But if driving becomes more difficult, then my prepotent identity will decrease in level in order to maintain the action. Then these two tasks would not share the prepotent action identity.

Prepotent action identities (i.e. the goal of the behavior as represented by the person in the moment) do not work to make this distinction for all uses. But I think that it actually does help makes some good distinctions about the experience of multitasking, especially if we examine change in action identities over time.

To return to case of media multitasking, consider the headline ticker on 24-hour news television. The headline ticker can be more or less related to what the talking heads are going on about. This could be evaluated as a semantic, topical relationship. But considered as a relationship of goals — and thus action identities — we can see that perhaps sometimes the goals coincide even when the content is quite different. For example, my goal may simply to be “get the latest news”, and I may be able to actually maintain this action — consuming both the headline ticker and the talking heads’ statements — under this high level identity. This is an importantly different case then if I don’t actually maintain the action at the level, but instead must descend to — and switch between — two (or more) lower level identities that are associated the two streams of content.

References

Vallacher, R. R., & Wegner, D. M. (1987). What do people think they’re doing? Action identification and human behavior. Psychological Review, 94(1), 3-15. 

  1. The full name is the “Seminar on the impacts of media multitasking on children’s learning and development”. []
  2. As I was writing this, the topic re-emerged in the workshop discussion. I made some comments, but I think I may not have made myself clear to everyone. Hopefully this post is a bit of an improvement. []

Being a lobster and using a hammer: “homuncular flexibility” and distal attribution

Jaron Lanier (2006) calls the ability of humans to learn to control virtual bodies that are quite different than our own “homuncular flexibility”. This is, for him, a dangerous idea. The idea is that the familiar mapping of the body represented in the cortical homunculus is only one option – we can flexibly act (and perceive) using quite other mappings, e.g., to virtual bodies. Your body can be tracked, and these movements can be used to control a lobster in virtual reality – just as one experiences (via head-mounted display, haptic feedback, etc.) the virtual space from the perspective of the lobster under your control.

This name and description makes this sound quite like science fiction. In this post, I assimilate homuncular flexibility to the much more general phenomenon of distal attribution (Loomis, 1992; White, 1970). When I have a perceptual experience, I can just as well attribute that experience – and take it as being directed at or about – more proximal or distal phenomena. For example, I can attribute it to my sensory surface, or I can attribute it to a flower in the distance. White (1970) proposed that more distal attribution occurs when the afference (perception) is lawfully related to efference (action) on the proximal side of that distal entity. That is, if my action and perception are lawfully related on “my side” of that entity in the causal tree, then I will make attributions to that entity. Loomis (1992) adds the requirement that this lawful relationship be successfully modeled. This is close, but not quite right, for if I can make distal attributions even in the absence of an actual lawful relationship that I successfully model, my (perhaps inaccurate) modeling of a (perhaps non-existent) lawful relationship will do just fine.

Just as I attribute a sensory experience to a flower and not the air between me and the flower, so the blind man or the skilled hammer-user can attribute a sensory experience to the ground or the nail, rather than the handle of the cane or hammer. On consideration, I think we can see that these phenomena are very much what Lanier is talking about. When I learn to operate (and, not treated by Lanier, 2006, sense) my lobster-body, it is because I have modeled an efference–afference relationship, yielding a kind transparency. This is a quite familiar kind of experience. It might still be a quite dangerous or exciting idea, but its examples are ubiquitous, not restricted to virtual reality labs.

Lanier paraphrases biologist Jim Boyer as counting this capability as a kind of evolutionary artifact – a spandrel in the jargon of evolutionary theory. But I think a much better just-so evolutionary story can be given: it is this capability – to make distal attributions to the limits of the efference-afference relationships we successfully model – that makes us able to use tools so effectively. At an even more basic and general level, it is this capability that makes it possible for us to communicate meaningfully: our utterances have their meaning in the context of triangulating with other people such that the content of what we are saying is related to the common cause of both of our perceptual experiences (Davidson, 1984).

References

Davidson, D. (1984). Inquiries into Truth and Interpretation. Oxford: Clarendon Press.

Lanier, J. (2006). Homuncular flexibility. Edge.

Loomis, J. M. (1992). Distal attribution and presence. Presence: Teleoperators and Virtual Environments, 1(1), 113-119.

White, B. W. (1970). Perceptual findings with the vision-substitution system. IEEE Transactions on Man-Machine Systems, 11(1), 54-58.