Mechanical Turk - The Poisoned Well?
Todd M. Gureckis · · 16 minutes to readNote
Several of the ideas here came in discussion with Jordan Suchow and Dave Eargle but I won't drag them down if this ends up controversial for some reason.
I've recently seen several academic researchers complain (1, 2) on social media that the data they obtain from Amazon Mechanical Turk (AMT) has degraded in "quality" in recent months. This reminds me a bit of the famous "AMT bot panic" of the summer of 2018. Has Amazon Mechanical Turk become a "poisoned well"?[1]
Although a genuine concern for my own research studies[2], my interest in this issue is maybe a bit more academic. What does it mean for a online labor market to "degrade in quality"? I suppose labor markets, in general, do sometimes "degrade" in that the productivity of the labor supply goes down.
For instance, people often talk about the "brain drain" from small rural towns within the US as having the effect of weakening the labor market in these areas. The logic is that the more capable workers seek higher pay and more interesting jobs in major cities. The people who stay generally are less adaptable and possibly worse workers for some types of jobs.
This example is quite different than what has been alleged about AMT, particularly in recent months. The currently zeitgiest is that the COVID-19 pandemic caused mass unemployment in the US as well as the need for social distancing. As a result, many more people starting working on crowdworking sites like AMT to make a few extra dollars while stuck at home. Given the flexibilty of the work and the ability to do it online, it does make sense it would be appealing for a least some of these unemployed workers. So the idea is this rush of new workers into this AMT labor market somehow lowered the quality of the work being done.
I'm not an economist and so I might not know all the issues here, but it seems counter-intuitive for a labor market to decrease in quality at the same time that there is an increase in available workers. One obvious reason is that unless the supply of work also goes up, it increases the competition in the market and generally that should result in improvements in the labor productivity/quality.[3]
Theories of decline
So what are some of the more plausible theories of the "decline"? One version of the story might be that the new entrants to this labor market are simply less able. They perhaps have lower literacy or educational background and so perform more poorly on tests assessing cognitive ability. This could be true, although I haven't seen any particular analysis of who these new workers might be (UPDATE 6/9/2021: David Rand shared a paper led by Antonio Arechar looking at changes in the composition of the AMT worker population during the pandemic). Generally it shouldn't really be a problem for scientific uses of AMT if the labor pool now was more diverse and representative of cognitive ability within the US or world. Perhaps all the past studies misestimated many effects by effectively sampling slightly more tech-oriented people who wanted to use AMT before it was a life-saving income source.
However, I'm a little skeptical about this theory because it makes such strong and unflattering assumptions about the new entrants to the labor market. Lots of people who previously were too busy holding down a regular job might, upon entering the AMT market, increase the overall quality of work done. Your temporarily unemployed bartender or Broadway actress is probably reasonably high on ability for most complex cognitive tasks. Also, the complaint researchers are often making is not that the data quality is simply reflecting lower levels of performance but that the data is fully and clearly junk from some workers (failing basic sanity checks).
Perhaps a more reasonable explanation is that a greater fraction of the pool is doing it for money and not for fun. Historically, I'd guess that at least some people on AMT did in the same way that people do crossword puzzles. Each task or HIT is a little puzzle to solve and, why not make a bit of money while doing it? The new entrants are there to put in their time and make enough money to contribute to their lost income. This might mean that they focus their energy more efficiently using techniques that Turkers sometimes call "grinding" where you try to maximize your efficiency at tasks. If so, these new workers might be more sophisticated in the sense of looking for loop holes and tricks to get through a task quickly including tools that might help fill out forms (e.g., scripts that auto-fill in text fields, etc...). The even more extreme version of this is tech-savvy workers (or collections of workers) who program or install bots that try to automate the completion of tasks.[4]
Anyway, lets just say I like this later theory better because it sort of lines up with the idea that 1. more workers are doing this for real money not just "volunteerism" and 2. these workers are highly optimized and professional (for lack of a better term). If you work on something every day in order to supplement your income, you probably do figure out tricks for efficiency.
What are the scientists to do? [5]
I think it'd be premature to say "ok, fuck it, the workers have gotten too smart and hacking my task." Any reaction like that seems to have assumed something about online data collection that was never true. It has always been true that paid work online was exploitable by people trying to maximize profit. At best researchers simply ignored the fact, or managed to get by just because the sophistication and prevalence of the people exploiting their particular task was low.
There are 7+ billion people on the planet. Many of them live in countries (or even regions within rich countries) with extremely low wages and are in poverty. If researchers at some wealthy university start offering up ways to earn enough income to feed a family by clicking on some things on a tablet or cell phone you better be assured that people will figure out how to earn that money, and automatize it with layers of increasing sophistication. This is, in my view, a singular positive feature of the Internet not a bug or complication.
One consequence is that doing high-quality online research is actually very, very difficult. Researchers often seem to think that collecting behavioral data online should be as simple as just putting a form up on the web and waiting for good, honest data to come flowing in for very low cost. That might have worked when the pool was mostly people just doing stuff for fun. However, those days are gone and likely never going to return.
It is a mistake to lose sight of the fact that any type of behavioral data collection is a very complex topic ethically, and in terms of research design. On the science side, it has elements of an "art" or "craft" just like any other careful measurement (say wet lab benchwork) is. Tacit knowledge, understanding, and adapation to changing technologies has to be front in center in all online research efforts.
Reframing the issue in terms of incentives
One helpful way to think about this is that the incentives between the goals of the researcher and those of the participant/subject need to align. When there is misalignment then likely one side is going to not get what they wanted. For example, many researchers in decision-making prefer "incentive-compatible" research designs. The idea here is that the set up of the experiment asks the subject to try to earn as much money as they can. However, in order to earn the most money the subject needs to basically cooperate with the design of the task (e.g., read the instructions carefully and perform the task with high levels of sustained attention).
This is just one example of aligning incentives between researchers and participants. And even this can be quite complicated with things like satisficing behavior (I made enough money from this task at this point so I will dial it in the rest of the time, etc...). There are solutions worked out to almost all of these issues (randomly selecting 1 trial from entire task as basis for reward). However, it still requires very careful research design and really the right type of research question. Opinion polling where there is no ground truth answer that can be incentivized is harder to study in these environments and generally relies more on indirect methods for assessing data quality.
What to do when bots attack
Perhaps a somewhat extreme version of this "poisoned well" idea is that the entire labor pool on AMT could become contaminated with bots -- computer scripts that complete the tasks in semi-human ways but distort the scientific conclusions one draws from a sample. This is a real risk but I (and Dave and Jordan) think can still be addressed under the framework of incentives. The idea is that writing a bot is costly. If your script is a small one-off study, it simply isn't worth it for a bot developer to target your research. However, if your tasks/script is built on a common platform, for instance something like Qualtrics, Google Forms, or even jsPsych, then it might make more sense to study the inner workings of these webpages and develop a good bot that can fake responses.
Many researchers say they use Qualtrics because it is so easy to build the study this way compared to custom programming. However, the widespread use of these tools does open loopholes for technical exploitation. There are definitely ways around this and some of these tool builders do try to track things like WorkerIds that might be judged to be bots, etc...
Ultimately raising the cost for bot makers should increase the quality of data. It is as simple as that. Costs can be raised by making it more difficult to write a bot (which requires understanding a bit about what tools are likely used by bot developers) and also keeping the overall prevelance of any one tool low (so that there isn't a huge suspectible population of tasks from one scripting approach). This is a kind of "herd immunity" type concept to bots where diversity is your shield.
Interface elements like CAPTCHA's and other human intelligence tasks folded into studies can help by also raising the cost to bot makers. The idea is that it is not worth trying to defeat a really good/novel CAPTCHA to earn a $3 bonus from a researcher with a total study budget of $300. However, it makes a lot of sense to defeat a general purpose CAPTCHA that is used everywhere on the web because the upside is being able to auto-complete millions of dollars worth of possible tasks. A researcher-friendly version of CAPTCHA's are sanity-check type questions that you can use to exclude people. If you have several common sense questions that any reasonable human should agree about, then often bots will fail at them you can use this as a filter. Nick Byrd has a nice blog post detailing several of these methods.
Various technical solutions do increase the cost to requesters though. For online studies this extra cost requires more programming knowledge, more thinking about possible exploits, etc... Perhaps many researchers just don't want to deal with this kind of stuff and so will move to another labor market less contaminated by bots. This should work for a short time unless enough of the money moves to the new system and then bot makers begin to take notice and move too.
Another approach is to just consider paying for bad data as the cost of doing online data collection. It is like how stores budget for a certain amount of theft because the cost to prevent all theft is too high. In this case the cost is being paid by the bot (development costs) and the experimenter (paying for useless data). This seems reasonable but scientifically it could be problematic in part because it can be difficult to know exactly who to exclude.
A report from the trenches
In a study run in the last several weeks (definitely "pandemic-era AMT") we had people read some fairly complex instructions that included a description of a lottery that would determine their reward. The instructions were followed by comprehension checks that required non-trival inferences about the task. Failure to get 100% correct answers required participants to repeat the instructions in an infinite loop. The distribution of # of completed loops among 300 subjects is here[6]:
Now it is perhaps shocking that 1 participant completed the instruction loop 22 times in part because these instructions are very tedious. However, lets say going through them 3 times is pretty normal (honestly it takes me three times if I haven't had my morning coffee) then about 16% of the participants show up as "troubling". And even within that one might guess there were some lower literacy or non-native English readers in the pool who had trouble, so perhaps 16% bad data on AMT is the upper bound for our type of study. This doesn't seem all that bad especially since excluding the people who looped the instructions >10 times is not hard to do in a pre-registered analysis plan.
On the other hand, we published one recent study which included common sense reasoning "checks" as part of the study design. In study 1, using Qualtrics + AMT resulted in nearly 40% of the participants being excluded prior to analysis. These individuals failed to answer simple physical reasoning questions that had trivially obvious answers (no "traps" like the cognitive reflection tasks). Again, you can catch this behavior with proper study design, but this did suggest a much larger proportion of inattentive subject, bad actors, and/or bots. (We of course still pay all pariticpants according to our IRB rules and so this raised the cost of the study quite a lot.)
Recommendation letters
Scientists rely heavily on recommendations from peers for hiring. We are probably in need of widespread systems, perhaps beyond the AMT Qualifications system, that establish trust and reputation about workers (and requesters as well). For instance, if you know a particular worker did a good job on your task you might invite them back for more studies or given them a good reference letter to other requesters. We don't have many widespread systems for this type of trust on AMT although some websites like TurkPrime aka CloudResearch apparently do, for a fee, provide some type of quality assurance.[7]
Again, this can be viewed as a general instantiation of incentives. If obtaining the credentials/references requires a rather large threshold of "honest" labor then it becomes not worth it for a nefarious actor. Of course, thinking one step ahead like a proper game theorist, the more you make an "elite class" of accounts that offer selectively lucrative tasks for certain verified accounts the more you will find people obtaining that credential and then possibly selling to others for a profit. Also there are serious scientific concerns about drawing inferences from what effectively becomes a "panel" (non-representative sampling in the extreme). It seems an open frontier to think about making generalizable conclusions from environments where you are never clear who is a real human and who is a script.
A nightmare world of human-bot hybrids
Once I tried to dream up a particularly ghoulish nightmare about online labor markets (cue dim lights and a flashlight uplighting my face).
It might seem that making a script that can pass some of the manipulation checks included by a researcher is difficult. Certainly, very sophisticated bots could be written that employ elements of artificial intelligence or other techniques.
However, there are commonly available browser plugins which essentially will "record" the sequence of button clicks and mouse movements a human makes when interfacing with a website. This recording can then be "played back" many times to submit a form over and over again. Furthermore, some of these plugins output a script which represents the instructions needed to mimic the recorded actions of the human (e.g., in pseudo-code the script might say "click on the blue button, then toggle the age variable to 22, then click ok, then type the phrase 'ok thanks' into the text box, etc..."). This script can be downloaded and edited to add random elements or generalizations. For example, instead of always choosing the age "22" a simple adjustment might be to select a random age. Or instead instead of always typing 'ok thanks' into a text box the script could be modified to randomly sample from a set of phrases. These modifications for any particular script might only take a few moments for an experienced user of this tool and then the script can be shared across a network of machines to attack the particular survey, bypassing even some manipulation checks.
Under this attack one human effort in taking a survey is amplified many times over by quickly automating this behavior, generalizing it, and repeating it. Similarly human-bot hybrids can be used where a worker completes the CAPTCHA at the start of the task then initiates the bot for the remainder of the task.
What can be done in this situation? I'm not really sure. Perhaps one solution is to have substantial task diversity. So imagine you include some comprehension questions that require non-trivial checks that you understand the task (this is something I've been advocating for almost a decade). Currently this is seen as like a "gold standard" method in papers and does require extra development time. The next level version of this is to include different questions randomly from a relatively large pool. This means a human-bot hybrid will have to solve these answers across many runs of the task, again raising the cost.
Some simple types of obfuscation might help too. For instance, each time your task is served up to a person your code could define custom HTML elements then randomly rename all of them each page load to randomized nonsense strings. This means that code that does things like "select the HTML button called 'ok' and click it" becomes "select the element called 'lkjasldiuq123iehlkj' and click it" but this is different each time the page is loaded making the script have to generalize yet again.
If you aren't doing these things are your data trust worthy?
Let's take stock of some ideas that were floated so far. Our goal is to raise the cost for bots, scripts, and "uncooperative" subjects by:
- Avoiding common/popular software packages
- Creating non-standard user inferface elements
- Using CAPTCHAs and other human intelligence checks
- Creating databases or panels of "good workers"
- Using rotating sets of non-trivial comprehension questions
- Using incentive-compatible designs (e.g., where trials are randomly selected for bonus payment uniformly throughout the task)
- Programmatically obfuscating your Javascript and HTML to be non-human readable and different for each page request
Now ask yourself if you have ever read a paper that passes all these checks (despite being simple in the big scope of things). I don't even use all of these things in my studies! Does this mean the data is untrustworthy? If the people who argue the AMT pool is poisoned are right, then I guess the answer is yes. Any study that doesn't meet this minimal checks and possible more is likely presenting data+bots as evidence for some scientific claim about human cognition. I'm a bit more optimistic, but it is something to consider.
Did I say earlier that online research is harder than it seems? All of this raises the development costs of doing research. It takes time to implement all the these checks. They introduce new potential for bugs in your scientific code.
The cost of identifying the "best practices" and then implementing them is one reason that we developed psiTurk in the first place. The more you think about these issues the more you realize that it is complicated and requires technical solutions. For instance, in psiturk you never compute a bonus amount in javascript. Instead this is handed off to a server-side process. The reason is a nefarious subject could alter the javascript to change their bonus amount simply by opening the developer tools in the browser. The development community around psiTurk thought about these issues and tried to come up with smart ways to deal with them.
The high cost of costs
I've implied a few times now that raising the costs of bad behavior it something that researchers need to think more about. However, I see limitations even to this. Perhaps the biggest one is that as you add cognitive costs to your tasks to exclude bots you may be cognitively exhausting your real human participants. If every n-th trial of your experiment requires solving a puzzle while holding your breath upside down under water well then I bet the data on the other trials will be bad too.
This is one place where value-alignment between researchers and experimenters becomes difficult because the things that increase cost for bad actors also increases cost for the good actors embedded in your study, making the interaction awkward, complex, and confusing. Several studies that use complex economic incentives to try to shape behavior are hard to perform online because understanding the lottery that will determine your reward requires a nearly PhD level understanding of probability theory.
The real goal (thanks Kevin Smith) is to invent asymmetric costs which tax inattentive people, bots, human-machine hybrids and not real humans. I'm certainly interested in people's ideas about this.
Burn it all down?
The bottom line is that the issues in scientific research in online labor markets is pretty complex. In this post I focused on incentivizing real human behavior but didn't at all get into the numerous ethical issues about these labor markets in terms of fair wages and abuse.
Perhaps one solution is to try to emphasize more altruistic, non-incentivized studies. Imagine you just host your task on your lab webpage and invite people to take part for free, for fun, and in order to learn more about themselves. Instead of spending your money on wages on AMT you spend it on google/twitter/facebook ads promoting your lab website. Make your task fun, enjoyable, good to look at, and engaging. Now, I submit that no-one will make a bot to target your website because there is no incentive for doing so. (Exception is if you have accumulated serious academic enemies.) In addition, the people who do the task will pay more attention because they want to do your task. If they didn't want to do it they would just leave and go look at some other website.
So weirdly the way out of the mess might be to sidestep the poisoned well altogether and focus on citizen science and general altruistic tendency of humankind outside of capitalist market mechanisms[8]. I suspect that such systems are less efficient for researchers in terms of amount of subjects/hour but certainly there is some level of development costs for conducting good research on AMT that would simply be too much.
Another approach is to move to new markets (e.g., Prolific) and just avoid the unregulated mess that has AMT has become. I'm sort of sympathetic to this as a short term solution but I have some doubts that these new markets can avoid similar issues creeping in. I highly doubt paying people to take part in studies is going away, and in fact I suspect it will continue to increase in prevalance. Looking forward, I think the general research community really needs to take the issue of data quality more seriously. Personally, I'm a fan of building better open-source tools to ensure higher quality data, but honestly the complexity of this issue suggest it could become a productive field of methodology study in itself.
Footnotes
In full disclosure, although I have contributed to some tools that aid interactions with Amazon Mechanical Turk I have no particular horse in this race. I make no money from AMT personally and I use many other websites in my own research including Prolific, and if it turned out that AMT was burned to the ground, I wouldn't be a voice of objection. ↩︎
I've not be super systematic at analyzing this effect yet because it is ongoing research but recently I replicated pretty well a study I first conducted on AMT in 2013. Thus, things might not be bad in all cases and understanding if there is some lurking association with good/bad experiences on AMT is very interesting to explore. ↩︎
Of course crowd-labor markets like AMT are not the same as local labor markets for in-person jobs. For one, the transaction to "hire" a worker is entirely digital and often done with relatively no pre-screening of "applicants" (all these terms in quotes because obviously they are a bad match for AMT design). Thus efficient allocation between the employer side demand and the worker side supply is probably unrealized. ↩︎
Another hypothesis I have seen thrown around is that as machine learning has become more widespread more and more of the tasks on AMT are rote "click" tasks (e.g., labeling common objects in photographs) and so people are just kind of mindlessly clicking on psychology tasks as well. I'm not sure I buy this exactly because workers often rate psychology tasks as more interesting/fun than the "normal fare" of AMT tasks. It might seems like a refreshing (and slightly above median wage) to do a psychology experiment catching a worker's attention and curiosity. ↩︎
Modeled on "What are the youth to do? I think we should destroy the bogus captialist process that is destroying youth culture by mass marketing and commercial paranoia behavior control" - Thurston Moore ↩︎
Thanks to Pam Osborn Popp for sharing her in-progress data. ↩︎
There are likely several complex privacy issues involved in setting up systems like this, particularly for academic researchers. It is akin to researchers in one university, with on particular IRB "telling" researchers at another university "don't trust this subject, they gave me bad data." ↩︎
I bet you didn't expect to go from "AMT has problems lately" to "abolish capitalism" but these are the times we live in. ↩︎