The Wall Street Journal’s Venture Capital Dispatch reports on how Aardvark, the social question asking and answering service recently acquired by Google, used a Wizard of Oz prototype to learn about how their service concept would work without building all the tech before knowing if it was any good.
Aardvark employees would get the questions from beta test users and route them to users who were online and would have the answer to the question. This was done to test out the concept before the company spent the time and money to build it, said Damon Horowitz, co-founder of Aardvark, who spoke at Startup Lessons Learned, a conference in San Francisco on Friday.
“If people like this in super crappy form, then this is worth building, because they’ll like it even more,” Horowitz said of their initial idea.
At the same time it was testing a “fake” product powered by humans, the company started building the automated product to replace humans. While it used humans “behind the curtain,” it gained the benefit of learning from all the questions, including how to route the questions and the entire process with users.
This is a really good idea, as I’ve argued before on this blog and in a chapter for developers of mobile health interventions. What better way to (a) learn about how people will use and experience your service and (b) get training data for your machine learning system than to have humans-in-the-loop run the service?
My friend Chris Streeter wondered whether this was all done by Aardvark employees or whether workers on Amazon Mechanical Turk may have also been involved, especially in identifying the expertise of the early users of the service so that the employees could route the questions to the right place. I think this highlights how different parts of a service can draw on human and non-human intelligence in a variety of ways — via a micro-labor market, using skilled employees who will gain hands-on experience with customers, etc.
I also wonder what UIs the humans-in-the-loop used to accomplish this. It’d be great to get a peak. I’d expect that these were certainly rough around the edges, as was the Aardvark customer-facing UI.
Aardvark does a good job of being a quite sociable agent (e.g., when using it via instant messaging) that also gets out of the way of the human–human interaction between question askers and answers. I wonder how the language used by humans to coordinate and hand-off questions may have played into creating a positive para-social interaction with vark.
Today I’m attending the Social Mobile Media Workshop at Stanford University. It’s organized by researchers from Stanford’s HStar, Tampere University of Technology, and the Naval Postgraduate School. What follows is some still jagged thoughts that were prompted by the presentation this morning, rather than a straightforward account of the presentations.1
A big theme of the workshop this morning has been transitions among production and consumption — and the critical role of annotations and context-awareness in enabling many of the user experiences discussed. In many ways, this workshop took me back to thinking about mobile media sharing, which was at the center of a good deal of my previous work. At Yahoo! Research Berkeley we were informed by Marc Davis’s vision of enabling “the billions of daily media consumers to become daily media producers.” With ZoneTag we used context-awareness, sociality, and simplicity to influence people to create, annotate, and share photos from their mobile phones (Ahern et al. 2006, 2007).
Enabling and encouraging these behaviors (for all media types) remains a major goal for designers of participatory media; and this was explicit at several points throughout the workshop (e.g., in Teppo Raisanen’s broad presentation on persuasive technology). This morning there was discussion about the technical requirements for consuming, capturing, and sending media. Cases that traditionally seem to strictly structure and separate production and consumption may be (1) in need of revision and increased flexibility or (2) actually already involve production and consumption together through existing tools. Media production to be part of a two-way communication, it must be consumed, whether by peers or the traditional producers.
As an example of the first case, Sarah Lewis (Stanford) highlighted the importance of making distance learning experiences reciprocal, rather than enforcing an asymmetry in what media types can be shared by different participants. In a past distance learning situation focused on the African ecosystem, it was frustrating that video was only shared from the participants at Stanford to participants at African colleges — leaving the latter to respond only via text. A prototype system, Mobltz, she and her colleagues have built is designed to change this, supporting the creation of channels of media from multiple people (which also reminded me of Kyte.tv).
As an example of the second case, Timo Koskinenen (Nokia) presented a trial of mobile media capture tools for professional journalists. In this case the work flow of what is, in the end, a media production practice, involves also consumption in the form of review of one’s own materials and other journalists, as they edit, consider what new media to capture.
Throughout the sessions themselves and conversations with participants during breaks and lunch, having good annotations continued to come up as a requirement for many of the services discussed. While I think our ZoneTag work (and the free suggested tags Web service API it provides) made a good contribution in this area, as has a wide array of other work (e.g., von Ahn & Dabbish 2004, licensed in Google Image Labeler), there is still a lot of progress to make, especially in bringing this work to market and making it something that further services can build on.
Ahern, S., Davis, M., Eckles, D., King, S., Naaman, M., Nair, R., et al. (2006). ZoneTag: Designing Context-Aware Mobile Media Capture. In Adjunct Proc. Ubicomp (pp. 357-366).
Ahern, S., Eckles, D., Good, N. S., King, S., Naaman, M., & Nair, R. (2007). Over-exposed?: privacy patterns and considerations in online and mobile photo sharing. In Proc. CHI 2007 (pp. 357-366). ACM Press.
Ahn, L. V., & Dabbish, L. (2004). Labeling images with a computer game. In Proc. CHI 2004 (pp. 319-326).
- Blogging something at this level of roughness is still new for me… [↩]
Amazon Mechanical Turk is a platform and market for human intelligence tasks (HITs) that are submitted by requesters and completed by workers (or “turkers”). Each HIT is associated with a payment, often a few cents. This post covers some basics of Mechanical Turk and shows its lack of designed-in support for dynamic reprioritization is problematic for some uses. I also mention some other factors that influence latency and throughput.
With mTurk one can create a HIT that asks someone to rate some search results for a query, evaluate the credibility of a Wikipedia article, draw a sheep facing left, enter names for a provided color, annotate a photo of a person with pose information, or create a storyboard illustrating a new product idea. So Mechanical Turk can be used in many ways for basic research, building a training set for machine learning, or actually enabling a (perhaps prototype) service in use through a kind of Wizard-of-Oz approach. Additionally, I’ve used mTurk to code images captured by participants in a lab experiment (more on this in another post or article).
When creating HITs, a requester can specify a QuestionForm (QF) (e.g., via command line tools or an SDK) that is then presented to the worker by Amazon. This can include images, free text answers, multiple choice, etc. Additionally one can embed Flash or Java objects in it. But the easiest way of creating HITs is to use a QF and not create a Java or Flash application of one’s own. This is especially true for HITs that are handled well by the basic question form. The other option is to create an ExternalQuestion (EQ), which is hosted on one’s own server and is displayed in an iFrame. This provides greater freedom but requires additional development and it is you that must host the page (though you can do so through Amazon’s S3). QF HITs (without embeds) also offer a familiar interface to workers (though it is possible to create a more efficient, custom interface by, e.g., making all the targets larger). So when possible, it is often preferable to use a QF rather than an EQ.
For some of the uses of mTurk for powering a service, it can be important to minimize latency for specific HITs1, including prioritizing particular new HITs over previously created HITs. For example, after some HIT has not been completed for a specific period after creation, it may still be important to complete it, but when it is completed may become less important. This can happen easily if the value of a HIT being completed has a sharp drop off after some time.
This should be done while maintaining high throughput; that is, you don’t want to reduce the rate at which your HITs are completed. When there are more HITs of the same type, workers can check a box to immediately start the next HIT of the same type when they submit the current one (see screenshot). Workers will often complete many HITs of the same type in a row. So throughput can drop substantially if any workers run out of HITs of the same type at any point: they may switch to another HIT type, or if they do your HITs once more appear, then there will be a delay. As we’ll see, these two requirements don’t seem to be well met by the platform — or at least certain uses of it.
Mechanical Turk does not provide a mechanism for prioritizing HITs of the same type, so without deleting all but particular high-priority HITs of that type, there is not a way to ensure that some particular HIT gets done before the rest. And deleting the other HITs would hurt throughput and increase latency for any new high-priority HITs added in the near future (since workers won’t simply start these once they finish their previous HITs).
EQ HITs allow one to avoid this problem. Unlike with QF HITs (without Flash and Java embeds), one does not have to specify the full content of the HIT in advance. When a worker accepts an EQ HIT, you can dynamically serve up the HIT you want to depending on changing priorities. But this means that you can’t take advantage of, e.g., the simplicity of creating and managing data from QF HITs. So though there are ways of coping, adding dynamic reprioritization to Mechanical Turk would be a boon for time-sensitive uses.
There are, of course, other factors that influence latency and throughput on mTurk when (EQ) HITs are reprioritized. Here are a few:
- HIT and sub-tasks duration. How long does it take for workers to complete a HIT, which may be composed of multiple sub-tasks? A worker cannot be assigned a new HIT until they complete (or reject) the previous one. This can be somewhat avoided by creating longer HITs that are subdivided into dynamically selected sub-tasks. This can be done with an EQ HIT or an embedded Flash or Java application in a QF HIT. But the sub-task duration is always a limiting factor, unless one is willing to force abortion of the current sub-task, replacing it will still in progress (with an EQ, Flash, or Java).
- Available workers. How many workers are logged into mTurk and completing task? How many are currently switching HIT types? This can vary with the time of day.
- Appeal of your HITs. How much do workers like your HITs — are they fun? How much do you pay for how much you ask? How many of their completed assignments do you approve?
- Reliability. How accurate or precise must your results be? How many workers do you need to complete a HIT before you have reliable results? Do other workers need to complete meta-HITs before the data can be used?
- I use the term HIT somewhat loosely in this article. There are at least three uses that each differ in their identity conditions. (1) There are HITs considered as human intelligence tasks, and thus divided as we divide tasks; this means that a HIT in another sense can be composed of multiple HITs in this sense (tasks or sub-tasks). (2) There are HITs in Amazon’s technical sense of the term: a HIT is something that has the same HIT ID and therefore has the same specification. In QF HITs without embeds, this means all instances (assignments) of a HIT are the same in content; but in EQ HIT this is not necessarily true, since the content can be determined when assigned. (3) Finally, there is what Amazon calls assignments, specific instances of a HITs that are only completed once. [↩]