Mechanical Turk

Aardvark’s use of Wizard of Oz prototyping to design their social interfaces

By Dean EcklesPosted on April 26, 2010April 26, 2010

The Wall Street Journal’s Venture Capital Dispatch reports on how Aardvark, the social question asking and answering service recently acquired by Google, used a Wizard of Oz prototype to learn about how their service concept would work without building all the tech before knowing if it was any good.

Aardvark employees would get the questions from beta test users and route them to users who were online and would have the answer to the question. This was done to test out the concept before the company spent the time and money to build it, said Damon Horowitz, co-founder of Aardvark, who spoke at Startup Lessons Learned, a conference in San Francisco on Friday.

“If people like this in super crappy form, then this is worth building, because they’ll like it even more,” Horowitz said of their initial idea.

At the same time it was testing a “fake” product powered by humans, the company started building the automated product to replace humans. While it used humans “behind the curtain,” it gained the benefit of learning from all the questions, including how to route the questions and the entire process with users.

This is a really good idea, as I’ve argued before on this blog and in a chapter for developers of mobile health interventions. What better way to (a) learn about how people will use and experience your service and (b) get training data for your machine learning system than to have humans-in-the-loop run the service?

My friend Chris Streeter wondered whether this was all done by Aardvark employees or whether workers on Amazon Mechanical Turk may have also been involved, especially in identifying the expertise of the early users of the service so that the employees could route the questions to the right place. I think this highlights how different parts of a service can draw on human and non-human intelligence in a variety of ways — via a micro-labor market, using skilled employees who will gain hands-on experience with customers, etc.

I also wonder what UIs the humans-in-the-loop used to accomplish this. It’d be great to get a peak. I’d expect that these were certainly rough around the edges, as was the Aardvark customer-facing UI.

Aardvark does a good job of being a quite sociable agent (e.g., when using it via instant messaging) that also gets out of the way of the human–human interaction between question askers and answers. I wonder how the language used by humans to coordinate and hand-off questions may have played into creating a positive para-social interaction with vark.

Multitasking among tasks that share a goal: action identification theory

By Dean EcklesPosted on July 15, 2009August 4, 2009

Right from the start of today’s Media Multitasking Workshop ((The full name is the “Seminar on the impacts of media multitasking on children’s learning and development”.)), it’s clear that one big issue is just what people are talking about when they talk about multitasking. In this post, I want to highlight the relationship between defining different kinds of multitasking and people’s representations of the hierarchical structure of action.

It is helpful to start with a contrast between two kinds of cases.

Distributing attention towards a single goal

In the first, there is a single task or goal that involves dividing one’s attention, with the targets of attention somehow related, but of course somewhat independent. Patricia Greenfield used Pac-Man as an example: each of the ghosts must be attended to (in addition to Pac-Man himself), and each is moving independently, but each is related to the same larger goal.

Distributing attention among different goals

In the second kind of case, there are two completely unrelated tasks that divide attention, as in playing a game (e.g., solitaire) while also attending to a speech (e.g., in person, on TV). Anthony Wagner noted that in Greenfield’s listing of the benefits and costs of media multitasking, most of the listed benefits applied to the former case, while the costs she listed applied to the later. So keeping these different senses of multitasking straight is important.

Complications

But the conclusion should not be to think that this is a clear and stable distinction that slices multitasking phenomena in just the right way. Consider one ways of putting this distinction: the primary and secondary task can either be directed at the same goal or directed at different goals (or tasks). Let’s dig into this a bit more. ((As I was writing this, the topic re-emerged in the workshop discussion. I made some comments, but I think I may not have made myself clear to everyone. Hopefully this post is a bit of an improvement.))

Byron Reeves pointed out that sometimes “the IMing is about the game.” So we could distinguish whether the goal of the IMing is the same as the goal of the in-game task(s). But this making this kind of distinction requires identity conditions for goals or tasks that enable this distinction. As Ulrich Mayr commented, goals can be at many different levels, so in order to use goal identity as the criterion, one has to select a level in the hierarchy of goals.

Action identities and multitasking

We can think about this hierarchy of goals as the network of identities for an action that are connected with the “by” relation: one does one thing by doing (several) other things. If these goals are the goals of the person as they represent them, then this is the established approach taken by action identification theory (Vallacher & Wegner, 1987) — and this could be valuable lens for thinking about this. Action identification theory claims that people can report an action identity for what they are doing, and that this identity is the “prepotent identity”. This prepotent identity is generally the highest level identity under which the action is maintainable. This means that the prepotent identity is at least somewhat problematic if used to make this distinction between these two types of multitasking because then the distinction would be dependent on, e.g., how automatic or functionally transparent the behaviors involved are.

For example, if I am driving a car and everything is going well, I may represent the action as “seeing my friend Dave”. I may also represent my simultaneous, coordinating phone call with Dave under this same identity. But if driving becomes more difficult, then my prepotent identity will decrease in level in order to maintain the action. Then these two tasks would not share the prepotent action identity.

Prepotent action identities (i.e. the goal of the behavior as represented by the person in the moment) do not work to make this distinction for all uses. But I think that it actually does help makes some good distinctions about the experience of multitasking, especially if we examine change in action identities over time.

To return to case of media multitasking, consider the headline ticker on 24-hour news television. The headline ticker can be more or less related to what the talking heads are going on about. This could be evaluated as a semantic, topical relationship. But considered as a relationship of goals — and thus action identities — we can see that perhaps sometimes the goals coincide even when the content is quite different. For example, my goal may simply to be “get the latest news”, and I may be able to actually maintain this action — consuming both the headline ticker and the talking heads’ statements — under this high level identity. This is an importantly different case then if I don’t actually maintain the action at the level, but instead must descend to — and switch between — two (or more) lower level identities that are associated the two streams of content.

References

Vallacher, R. R., & Wegner, D. M. (1987). What do people think they’re doing? Action identification and human behavior. Psychological Review, 94(1), 3-15.

Etching by Da Vinci? Representing legend, culture, and language

By Dean EcklesPosted on April 27, 2009May 13, 2009

A photo I took in Piazza della Signoria of an etching, reportedly a self-portrait of Leonardo da Vinci that he etched behind his back on a dare onto the side of the Palazzo Vecchio.

Is this etching a self-portrait by Leonardo da Vinci created hundreds of years ago? That’s what I was told by a Californian friend who had “gone native” in Florence. Another matter: is this, in fact, a commonly believed and shared legend, and what other variations are there on it?

I shared the story with some fellow visitors in Florence on a lunch-time return to the piazza. Ed Chi tried to verify the rumor using a Web search, but with no success. At least in English, there didn’t seem to be much on this in the Web. (See my photo and comments on Flickr.)

I posted the photo on Flickr. I asked questions on LinkedIn and Yahoo! Answers, with no success. I also asked for help from workers on Mechanical Turk. Here’s part of how I asked for help:

There is a portrait etched in stone on the wall of Palazzo Vecchio in Piazza della Signoria in Florence (Firenza), Italy. It is close behind the copy of the David there. I have heard that there is a legend that this is a self-portrait by Leonardo da Vinci. I am looking for any information about this legend, alternate versions of the legend, or information about the real source of the portrait.

What results have been offered seem to suggest that this legend exists — though perhaps it is “actually” (at least as captured online, since perhaps the Leonardo theorists aren’t as active digital content creators) about Michelangelo:

Palazzo Vecchio in Italian Wikipedia
Florentine Legends: Fact or Fiction (in Italian)
Curiosities in Florence

The best way of finding out seemed to actually be my Flickr photo itself, since that’s where Daniel Witting provided the first two links above — however, this was a few months after the photo was first posted to Flickr. Turkers provided a couple useful links also (“Curiosities” above) on a shorter schedule and with a higher price. (I should have also tried uClue — where many former Google Answers researchers now work. This was recommended by Max Harper, who has studied Q&A sites in detail.)

–

Question and answer services along the lines of Yahoo! Answers rose to global (and U.S.) significance only after success in Korea, where Naver Knowledge iN pioneered the use of an online community to power a Q&A site. A major motivation Korea was the limited amount of Korean content online. With Naver’s offering, Korea’s Internet saavy, English population made information newly available in Korean (and did plenty of other interesting work).

This is as significant a motivation for Q&A sites by English-speaking folks in the U.S., but the present case is an exception.

Some of the questions that made this case interesting to me:

What culturally-shared beliefs get manifest online? During this whole process, I and others wondered whether perhaps this local legend was only shared orally. It seems that it is represented online after all — at least the Michelangelo variant, but it could have been otherwise.
How does the pair of languages a task requires knowledge of determine the processes, structres, and communities that are optimal for completing the task? For example, it seems quite important whether the target or source language has many more speakers than the other. (One could think about this simplistically in terms of conditional probabilities of skills with language A given skill with language B and vice verse.)

Reprioritizing human intelligence tasks for low latency and high throughput on Mechanical Turk

By Dean EcklesPosted on July 24, 2008December 3, 2009

Amazon Mechanical Turk is a platform and market for human intelligence tasks (HITs) that are submitted by requesters and completed by workers (or “turkers”). Each HIT is associated with a payment, often a few cents. This post covers some basics of Mechanical Turk and shows its lack of designed-in support for dynamic reprioritization is problematic for some uses. I also mention some other factors that influence latency and throughput.

With mTurk one can create a HIT that asks someone to rate some search results for a query, evaluate the credibility of a Wikipedia article, draw a sheep facing left, enter names for a provided color, annotate a photo of a person with pose information, or create a storyboard illustrating a new product idea. So Mechanical Turk can be used in many ways for basic research, building a training set for machine learning, or actually enabling a (perhaps prototype) service in use through a kind of Wizard-of-Oz approach. Additionally, I’ve used mTurk to code images captured by participants in a lab experiment (more on this in another post or article).

When creating HITs, a requester can specify a QuestionForm (QF) (e.g., via command line tools or an SDK) that is then presented to the worker by Amazon. This can include images, free text answers, multiple choice, etc. Additionally one can embed Flash or Java objects in it. But the easiest way of creating HITs is to use a QF and not create a Java or Flash application of one’s own. This is especially true for HITs that are handled well by the basic question form. The other option is to create an ExternalQuestion (EQ), which is hosted on one’s own server and is displayed in an iFrame. This provides greater freedom but requires additional development and it is you that must host the page (though you can do so through Amazon’s S3). QF HITs (without embeds) also offer a familiar interface to workers (though it is possible to create a more efficient, custom interface by, e.g., making all the targets larger). So when possible, it is often preferable to use a QF rather than an EQ.

For some of the uses of mTurk for powering a service, it can be important to minimize latency for specific HITs ((I use the term HIT somewhat loosely in this article. There are at least three uses that each differ in their identity conditions. (1) There are HITs considered as human intelligence tasks, and thus divided as we divide tasks; this means that a HIT in another sense can be composed of multiple HITs in this sense (tasks or sub-tasks). (2) There are HITs in Amazon’s technical sense of the term: a HIT is something that has the same HIT ID and therefore has the same specification. In QF HITs without embeds, this means all instances (assignments) of a HIT are the same in content; but in EQ HIT this is not necessarily true, since the content can be determined when assigned. (3) Finally, there is what Amazon calls assignments, specific instances of a HITs that are only completed once.)), including prioritizing particular new HITs over previously created HITs. For example, after some HIT has not been completed for a specific period after creation, it may still be important to complete it, but when it is completed may become less important. This can happen easily if the value of a HIT being completed has a sharp drop off after some time.

This should be done while maintaining high throughput; that is, you don’t want to reduce the rate at which your HITs are completed. When there are more HITs of the same type, workers can check a box to immediately start the next HIT of the same type when they submit the current one (see screenshot). Workers will often complete many HITs of the same type in a row. So throughput can drop substantially if any workers run out of HITs of the same type at any point: they may switch to another HIT type, or if they do your HITs once more appear, then there will be a delay. As we’ll see, these two requirements don’t seem to be well met by the platform — or at least certain uses of it.

Mechanical Turk does not provide a mechanism for prioritizing HITs of the same type, so without deleting all but particular high-priority HITs of that type, there is not a way to ensure that some particular HIT gets done before the rest. And deleting the other HITs would hurt throughput and increase latency for any new high-priority HITs added in the near future (since workers won’t simply start these once they finish their previous HITs).

EQ HITs allow one to avoid this problem. Unlike with QF HITs (without Flash and Java embeds), one does not have to specify the full content of the HIT in advance. When a worker accepts an EQ HIT, you can dynamically serve up the HIT you want to depending on changing priorities. But this means that you can’t take advantage of, e.g., the simplicity of creating and managing data from QF HITs. So though there are ways of coping, adding dynamic reprioritization to Mechanical Turk would be a boon for time-sensitive uses.

There are, of course, other factors that influence latency and throughput on mTurk when (EQ) HITs are reprioritized. Here are a few:

HIT and sub-tasks duration. How long does it take for workers to complete a HIT, which may be composed of multiple sub-tasks? A worker cannot be assigned a new HIT until they complete (or reject) the previous one. This can be somewhat avoided by creating longer HITs that are subdivided into dynamically selected sub-tasks. This can be done with an EQ HIT or an embedded Flash or Java application in a QF HIT. But the sub-task duration is always a limiting factor, unless one is willing to force abortion of the current sub-task, replacing it will still in progress (with an EQ, Flash, or Java).
Available workers. How many workers are logged into mTurk and completing task? How many are currently switching HIT types? This can vary with the time of day.
Appeal of your HITs. How much do workers like your HITs — are they fun? How much do you pay for how much you ask? How many of their completed assignments do you approve?
Reliability. How accurate or precise must your results be? How many workers do you need to complete a HIT before you have reliable results? Do other workers need to complete meta-HITs before the data can be used?