Reprioritizing human intelligence tasks for low latency and high throughput on Mechanical Turk

Amazon Mechanical Turk is a platform and market for human intelligence tasks (HITs) that are submitted by requesters and completed by workers (or “turkers”).  Each HIT is associated with a payment, often a few cents. This post covers some basics of Mechanical Turk and shows its lack of designed-in support for dynamic reprioritization is problematic for some uses. I also mention some other factors that influence latency and throughput.

With mTurk one can create a HIT that asks someone to rate some search results for a query, evaluate the credibility of a Wikipedia article, draw a sheep facing left, enter names for a provided color, annotate a photo of a person with pose information, or create a storyboard illustrating a new product idea. So Mechanical Turk can be used in many ways for basic research, building a training set for machine learning, or actually enabling a (perhaps prototype) service in use through a kind of Wizard-of-Oz approach. Additionally, I’ve used mTurk to code images captured by participants in a lab experiment (more on this in another post or article).

When creating HITs, a requester can specify a QuestionForm (QF) (e.g., via command line tools or an SDK) that is then presented to the worker by Amazon. This can include images, free text answers, multiple choice, etc. Additionally one can embed Flash or Java objects in it. But the easiest way of creating HITs is to use a QF and not create a Java or Flash application of one’s own. This is especially true for HITs that are handled well by the basic question form. The other option is to create an ExternalQuestion (EQ), which is hosted on one’s own server and is displayed in an iFrame. This provides greater freedom but requires additional development and it is you that must host the page (though you can do so through Amazon’s S3). QF HITs (without embeds) also offer a familiar interface to workers (though it is possible to create a more efficient, custom interface by, e.g., making all the targets larger). So when possible, it is often preferable to use a QF rather than an EQ.

For some of the uses of mTurk for powering a service, it can be important to minimize latency for specific HITs1, including prioritizing particular new HITs over previously created HITs. For example, after some HIT has not been completed for a specific period after creation, it may still be important to complete it, but when it is completed may become less important. This can happen easily if the value of a HIT being completed has a sharp drop off after some time.

This should be done while maintaining high throughput; that is, you don’t want to reduce the rate at which your HITs are completed. When there are more HITs of the same type, workers can check a box to immediately start the next HIT of the same type when they submit the current one (see screenshot). Workers will often complete many HITs of the same type in a row. So throughput can drop substantially if any workers run out of HITs of the same type at any point: they may switch to another HIT type, or if they do your HITs once more appear, then there will be a delay. As we’ll see, these two requirements don’t seem to be well met by the platform — or at least certain uses of it.

Mechanical Turk does not provide a mechanism for prioritizing HITs of the same type, so without deleting all but particular high-priority HITs of that type, there is not a way to ensure that some particular HIT gets done before the rest. And deleting the other HITs would hurt throughput and increase latency for any new high-priority HITs added in the near future (since workers won’t simply start these once they finish their previous HITs).

EQ HITs allow one to avoid this problem. Unlike with QF HITs (without Flash and Java embeds), one does not have to specify the full content of the HIT in advance. When a worker accepts an EQ HIT, you can dynamically serve up the HIT you want to depending on changing priorities. But this means that you can’t take advantage of, e.g., the simplicity of creating and managing data from QF HITs. So though there are ways of coping, adding dynamic reprioritization to Mechanical Turk would be a boon for time-sensitive uses.

There are, of course, other factors that influence latency and throughput on mTurk when (EQ) HITs are reprioritized. Here are a few:

  • HIT and sub-tasks duration. How long does it take for workers to complete a HIT, which may be composed of multiple sub-tasks? A worker cannot be assigned a new HIT until they complete (or reject) the previous one. This can be somewhat avoided by creating longer HITs that are subdivided into dynamically selected sub-tasks. This can be done with an EQ HIT or an embedded Flash or Java application in a QF HIT. But the sub-task duration is always a limiting factor, unless one is willing to force abortion of the current sub-task, replacing it will still in progress (with an EQ, Flash, or Java).
  • Available workers. How many workers are logged into mTurk and completing task? How many are currently switching HIT types? This can vary with the time of day.
  • Appeal of your HITs. How much do workers like your HITs — are they fun? How much do you pay for how much you ask? How many of their completed assignments do you approve?
  • Reliability. How accurate or precise must your results be? How many workers do you need to complete a HIT before you have reliable results? Do other workers need to complete meta-HITs before the data can be used?
  1. I use the term HIT somewhat loosely in this article. There are at least three uses that each differ in their identity conditions. (1) There are HITs considered as human intelligence tasks, and thus divided as we divide tasks; this means that a HIT in another sense can be composed of multiple HITs in this sense (tasks or sub-tasks). (2) There are HITs in Amazon’s technical sense of the term: a HIT is something that has the same HIT ID and therefore has the same specification. In QF HITs without embeds, this means all instances (assignments) of a HIT are the same in content; but in EQ HIT this is not necessarily true, since the content can be determined when assigned. (3) Finally, there is what Amazon calls assignments, specific instances of a HITs that are only completed once. []

2 Responses to this post.

  1. Posted by Mike Krieger on July 25th, 2008 at 9:46 am

    Hi Dean,

    Great post! I think you’ve really nailed down the issues that affect throughput on mTurk. I was just finishing up a similar post when I read yours this morning, so I rewrote it to include your findings and elaborate on one additional one (task visibility) that I’ve come across as a throughput factor. Let me know what you think,


  2. Posted by Brendan O'Connor on July 25th, 2008 at 11:20 pm

    Yeah, there’s definitely room to improve the decisionmaking of which questions to present when. Lots of the experiments on our blog use straight up MTurk with static EQ’s, but we’re moving to having all question htmls dynamically generated, so we can follow a strict priority ordering of questions, or weighted random sample them, whatever.

    The worst is if you have a batch of HITs and they start stalling out at 50% done, sometimes if you kill the job then resubmit those 50%, with the exact same payments and everything, suddenly you can get much higher throughput. I think what’s happening is the HIT group gets buried in the Turkers’ HIT browsing interface because it’s not new. Totally lame.

Respond to this post