Crowdsourcing Web Data with Amazon Mechanical Turk Example

The amount of data available on the web is astounding, and if you have the computer programming skills you can often write simple "web scrapping" code to automatically harvest that data. But what if you don't have the computer programming skills? You could spend a lot of time learning those skills, or you could hire someone else to write the code for you, but these options cost time and money. A faster and cheaper solution is to use crowdsourcing. In this post I will walk through an example of using Amazon Mechanical Turk for collecting data from the web.

In this example I have a list of Twitter IDs and I want to find the Twitter account name associated with each of those IDs. Every Twitter account has a unique ID number that never changes, but the Twitter account name, or "handle," is the more commonly used way of identifying a Twitter user. For example, my Twitter ID is 16016329 and my Twitter account name is @pjlamberson. (You can find your Twitter ID here.) You could use this same procedure to have workers look up any data on the web — for example, to collect online reviews, telephone numbers, addresses, or any other type of online data.

The first thing to do is go to Amazon Mechanical Turk and click on Get Started on the Get Results side of the screen. Click Create an Account and register as a Requester. Once you're back to the main requester screen click Create, then click New Project.


There are many types of tasks (aka HITs=Human Intelligence Tasks) that Amazon Mechnical Turk workers (aka "Turkers") can do. In this example, we are going to use the Data Collection task. So, click on Data Collection and then click Create Project.



On the next screen you enter the properties for your project including how much you want to pay per HIT and how many people you want to complete each task. You can see the properties I chose below, but you may want to change your properties if you expect your task to take longer or if you want multiple workers to complete each HIT to ensure higher quality data.

When you're finished specifying your project properties, click on Design Layout.

For my data collection task I have a spreadsheet full of Twitter IDs. You can see a sample of the spreadsheet at:

For each row in the spreadsheet, I need the Turker to go to the web address where XXXXXXXXXXX is replaced with the Twitter ID in that row of the spreadsheet. The corresponding instructions for this task to the Turker are shown below.

To refer to the variable ID that shows up in my spreadsheet I use the syntax ${Id}. The name of the variable Id matches the header of the corresponding column in my spreadsheet. Mechanical Turk will automatically create one HIT for each row of my spreadsheet. If I have multiple variables that change for each HIT, you can have multiple columns in the spreadsheet and refer to a column with column header "ColumnX" with ${ColumnX} in your instructions. For each HIT, the placeholder ${ColumnX} will be replaced with the content of the variable ColumnX from the spreadsheet from a given row of your spreadsheet.

The HTML source code for the HIT design shown above is available at If you click on the Source button on your Design Layout page, you can replace the example source code with this to reproduce my HIT.

Once you have entered your instructions to the Turkers, hit Preview. If the instructions look like you want them to (note the variable placeholder will still be just a placeholder until you upload your spreadsheet later), hit Finish.

Now you need to upload the spreadsheet containing the variables that change for each HIT and open up the job to the Turkers. To do this click Publish Batch and then click Choose File. Your spreadsheet should be in csv format, have one row for each HIT, and one column for each variable that changes from HIT to HIT. In my case, there is just the one variable, Id. Once you have selected your file, click Upload.

Now you will see a preview of your HITs with the variable(s) filled in with values from your spreadsheet. In my case, I have this:

Now, instead of ${Id} showing up in the web address, the first Twitter Id from my spreadsheet, 23779644, has been substituted. By clicking on Next HIT, you can see the other HITs that have been created using the data from the spreadsheet. Take a look at a few to make sure everything looks like you want to and then click Next.

On the next page you will see how much your batch of HITs is going to cost, which is a combination of the per HIT fee you set to pay the Turkers and the fee Amazon charges you to use the service. You may need to add funds to your account through Amazon payments in order to pay for the work. Once you have done that click Publish HITs and wait for the magic to happen.

This is the fun part. While you surf the web aimlessly, go grab a coffee, play solitaire, or get some really important work done, the tasks you posted are being completed by members of the thousands of Turkers that you are connected to through the platform. It's as if you're the CEO of a major company with a massive workforce waiting to do your bidding at a moments notice. You can watch the progress the Turkers are making on the results page.

In short order your tasks will be complete and you can see what the Turkers came up with by clicking the Results button. Here is a portion of my results:

If you're satisfied with the results you can approve them so that the Turkers get paid, or if a Turker did not do a satisfactory job you can reject the work. If you do nothing, the HITs will automatically be approved after a set time that you specified when setting up the HITs. To download the results just click on Download CSV and then right click and select Download linked file... on the here link and you're all set!

Crowdsourcing and Open Innovation Examples

When I teach about Crowdsourcing and Open Innovation in my Social Dynamics course at Kellogg we look at a ton of examples of how innovative organizations are using these tools to connect with a  global network of problem solvers, innovators, and regular people to make more accurate predictions, find better problem solutions, and speed the pace of innovation. One of my students recently suggested that I compile a list of these examples so that the class could have all of the links in one place. So, here is a roughly annotated list of some of my favorite examples. Some of these are platforms you can use and others are organizations that are using or have used crowds in innovative ways. If you have others, I would love to hear about them.

Processing unstructured data

Many organizations today have more data than they know what to do with. Much of this data is what we call “unstructured” — it’s not a nice spreadsheet that we can feed into a regression or even in to a fancy machine learning algorithm. Instead the data is in the form of images or massive amounts of text that we don’t really know how to handle. Crowdsourcing has proven to be a very effective way to process this kind of unstructured data into something usable.

The New York Times and the Sarah Palin Emails

The New York Times asked to crowd to help comb through the thousands of pages of Sarah Palin’s email released by court order and flag newsworthy content. Then the professional editors would take a closer look at flagged items and do the background research to put together a real new story.

Galaxy Zoo

The crowd helps process the hundreds of thousands of pictures of galaxies taken by the Hubble Space Telescope by categorizing galaxies by shape, color, and other features.


Workers help process data on gene sequence alignment.


You know those online security questions where you have to enter some fuzzy text or a blurry number? Sometimes you are actually helping to process scanned text that was too muddled for a computer to read.


Duolingo is an awesome FREE language learning app. It’s secret to staying free is that while people are using the app to learn a language, Duolingo makes money from the documents that they translate. Started by the same people that came up with reCaptcha.

How photoshopped is an image?

This is an academic study that used crowdsourcing to help form a rating system for how much an image has been digitally altered. Outstanding example of how crowds can be used to “train computers” to process unstructured data.


Focused particularly on document processing, like reading handwritten form responses.


Workers process text, do research, tag, and categorize your data.

Expert tasks that are inefficient to bring in house

Often times we have jobs to be done where we could really use a little bit of an expert, but we don’t really need a whole employee. After all, hiring people is expensive. Any time you hire someone it costs money to find them, you have to give them a desk, and a phone, and a computer, and benefits, and usually you’re stuck with them for a while. Sometimes it would be nice to have just a part of an employee - say, 1/4 of a marketer, 1/10 of a graphic designer, 1/10 of a web designer, and 1/3 of a data scientist. Crowdsourcing effectively lets you do this.

99 designs

Designers compete for your logo or web design. I have heard many, many stories of students having great experiences with this platform.


Find freelance programmers, developers, designers, writers, and marketers.


Teams from across the globe compete to deliver the best code. Used by companies like Google, Pfizer, Microsfot, Intel, Geico, and ESPN.


The crowd both submits and chooses clever T-shirt designs like the now famous “Communist Party” shirt.


Similar to Threadless, but more art focused. Designs can be printed on T-shirts, mugs, posters, etc.


Buy or submit stock photo images.


Have an idea for a great product? Submit it on Quirky. Used by companies like Bed, Bath, and Beyond, Target, Toys R Us, and Ace Hardware.


Crowdfunding allows us to distribute the risk of funding new projects across a huge number of people. It’s also a great way of using a “Measure and React Strategy” because in many cases, like on Kickstarter, people effectively commit to buy your product before you have to take the risk of producing it.


The most prominent crowdfunding platform on the Web, Kickstarted started to fund arts projects but has grown to much, much more, raising millions of dollars in startup capital for projects like the Pebble smart watch.


Crowdfunding for personal causes things like medical expenses.


Raises money to fund MBA students by connecting students and alumni.


Designed to let regular people try their hand at being a venture capitalist.


Another crowd funding tool focused on funding startups, which has now expanded into a whole group of crowd funding and investment companies.


Startup funding specifically aimed at Israeli companies.

Open innovation

One of the most powerful uses of the crowd is through open innovation platforms. This application is designed to take advantage of the super additive benefits of diversity. For more on how the power of diversity leads to better problem solving, I highly recommend Scott E. Page’s book The Difference, which was once aptly described as “an airplane book if you’re on a flight to Singapore."


The largest, most developed open innovation platform hosts challenges of all sorts, but especially problems in chemistry and engineering. Prizes for solutions often extended into the tens of thousands.


Like TopCoder, but for data analytics.


Many of the examples above overlap across multiple categories. These ones do to, and I gave up on trying to label them.

The NetFlix Prize

One of the first and all time greatest examples of a prize contest incentivizing diverse problem solvers to come together to solve a difficult question.


Amazon Mechanical Turk is a platform for all of the above. The most developed and effective crowdsourcing platform on the web, with a massive population of workers (aka Turkers), Mechanical Turk can be used for processing data, running experiments, disseminating surveys, … you name it.


Instead of playing solitaire or minesweeper, why not kill time by helping to solve protein folding puzzles with implications for combatting diseases like Parkinsons and HIV?

Big Data and the Wisdom of Crowds are not the same

I was surprised this week to find an article on Big Data in the New York Times Men's Fashion Edition of the Style Magazine. Finally! Something in the Fashion issue that I can relate to I thought. Unfortunately, the article by Andrew Ross Sorkin (author of Too Big To Fail) made one crucial mistake. The downfall of the article was conflating two distinct concepts that are both near and dear to my research, Big Data and the Wisdom of Crowds, which led to a completely wrong conclusion.

Big Data is what it sounds like — using very large datasets for ... well for whatever you want. How big is Big depends on what you're doing.  At a recent workshop on Big Data at Northwestern University, Luís Amaral defined Big Data to be basically any data that is too big for you to handle using whatever methods are business as usual for you. So, if you're used to dealing with data in Excel on a laptop, then data that needs a small server and some more sophisticated analytics software is Big for you. If you're used to dealing with data on a server, then your Big might be data that needs a room full of servers.

The Wisdom of Crowds is the idea that, as collectives, groups of people can make more accurate forecasts or come up with better solutions to problems than the individuals in them could on their own. A different recent New York Times articles has some great examples of the Wisdom of Crowds. The article talks about how the Navy has used groups to help make forecasts, and in particular forecasts for the locations of lost items like "sunken ships, spent warheads and downed pilots in vast, uncharted waters." The article tells one incredible story of how they used this idea to locate a missing submarine, the Scorpion:

"... forecasters draw on expertise from diverse but relevant areas — in the case of finding a submarine, say, submarine command, ocean salvage, and oceanography experts, as well as physicists and engineers. Each would make an educated guess as to where the ship is ... This is how Dr. Craven located the Scorpion.

“I knew these guys and I gave probability scores to each scenario they came up with,” Dr. Craven said. The men bet bottles of Chivas Regal to keep matters interesting, and after some statistical analysis, Dr. Craven zeroed in on a point about 400 miles from the Azores, near the Sargasso Sea, according to a detailed account in “Blind Man’s Bluff,” by Christopher Drew and Sherry Sontag. The sub was found about 200 yards away."

This is a perfect example of the Wisdom of Crowds: by pooling the forecasts of a diverse group, they came up with an accurate collective forecast.

So, how do Big Data and The Wisdom of Crowds get mixed up? The mixup comes from the fact that a lot of Big Data is data on the behavior of crowds. The central example in Sorkin's article is data from Twitter, and in particular data that showed a lot of people on Twitter were very unhappy with antigay comments made by Phil Robertson, the star of A&E's Duck Dynasty. The short version of the story is that A&E initially terminated Robertson in response to the Twitter data, but Sorkin argues this was a business mistake because Twitter users are "not exactly regular watchers of the camo-wearing Louisiana clan whose members openly celebrate being 'rednecks'." He also cites evidence that data from Twitter does not provide accurate election predictions for essentially the same reason — the people that are tweeting are not a representative sample of the people that are voting. All of this is correct. Using a big dataset does not mean that you don't have to worry about having a biased sample. No matter how big your dataset, a biased sample can lead to incorrect conclusions. A classic example is the prediction by The Literary Digest in 1936 that Alf Landon would be the overwhelming winner of the presidential election that year. In fact, Franklin Roosevelt carried 46 of the 48 states. The prediction was based on a huge poll with 2.4 million respondents, but the problem with the prediction was that the sample for the poll drew primarily on Literary Digest subscribers, automobile and telephone owners. This sample tended to be more affluent than the average voter, and thus favored Landon's less progressive policies.

So, Sorkin is on the right track to write a great article on how sample bias is still important even when you have Big Data. This is a really important point that a lot of people don't appreciate. But unfortunately the article veers off that track when it starts talking about the Wisdom of Crowds. The Wisdom of Crowds is not about combining data on large groups, but about combining the predictions, forecasts, or ideas of groups (they don't even have to be that large). If you want to use the Wisdom of Crowds to predict an election winner, you don't collect data on who they're tweeting about, you ask them who they think is going to win. If you want to use the Wisdom of Crowds to decide whether or not you should fire Phil Robertson, you ask them, "Do you think A&E will be more profitable if they fire Phil Robertson or not?" As angry as all of those tweets were, many of those angry voices on Twitter would probably concede that Robertson's remarks wouldn't damage the show's standing with its core audience.

The scientific evidence shows that using crowds is a pretty good way to make a prediction, and it often outperforms forecasts based on experts or Big Data. For example, looking at presidential elections from 1988 to 2004, relatively small Wisdom of Crowds forecasts outperformed the massive Gallup Poll by .3 percentage points (Wolfers and Zitzewitz, 2006). This isn't a huge margin, but keep in mind that the Gallup presidential poles are among the most expensive, sophisticated polling operations in history, so the fact that the crowd forecasts are even in the ballpark, let alone better, is pretty significant.

The reason the Wisdom of Crowds works is because when some people forecast too high and others forecast too low, their errors cancel out and bring the average closer to the truth. The accuracy of a crowd forecast depends both on the accuracy of the individuals in the crowd and on their diversity — how likely are their errors to be in opposite directions. The great thing about it is that you can make up for low accuracy with high diversity, so even crowds in which the individual members are not that great on their own can make pretty good predictions as collectives. In fact, as long as some of the individual predictions are on both sides of the true answer, the crowd forecasts will always be closer to the truth than the average individual in the crowd. It's a mathematical fact that is true 100% of the time. Sorkin concludes his article, based on the examples of inaccurate predictions from Big Data with biased samples, by writing, "A crowd may be wise, but ultimately, the crowd is no wiser than the individuals in it." But this is exactly backwards. A more accurate statement would be, "A crowd may or may not be wise, but ultimately, it's always at least as wise as the individuals in it. Most of the time it's wiser."

Social Dynamics Videos

While I've been teaching Social Dynamics and Networks at Kellogg, I've amassed a collection of links to interesting videos on social dynamics. Here they are:

Duncan Watts TEDx talk on "The Myth of Common Sense"

Nicholas Christakis TED talk on "The hidden influence of social networks"; TED talk on "How social networks predict epidemics."

James Fowler talking about social influence on the Colbert Report.

Sinan Aral TEDx talk on "Social contagion"; at PopTech 2010 on "Social contagion"; at Nextwork on "Social contagion"; at the International Conference on Weblogs and Social Media on "Content and causality in social networks."

Scott E. Page on "Leveraging Diversity", and at TEDxUofM on "Putting Milk Crates on the Internet."

Eli Pariser TED talk on "Beware online 'filter bubbles'"

Freakonomics podcast on "The Folly of Prediction"

Damon Centola on "Network Contagion."

Jure Leskovec on "The Web as a Laboratory for Studying Humanity"

There are several good videos of talks from the Web Science Meets Network Science conference at Northwestern: Duncan Watts, Albert-Laszlo Barabasi, Jure Leskovec, and Sinan Aral.

The "Did You Know?" series of videos has some incredible information about, well, information. More info here.

Training Computers with Crowds

Computers are awesome, but they don't know how to do much on their own; you have to train them. Crowdsourcing turns out to be a great way to do this. Suppose you would like to have an algorithm to measure something — like whether a tweet about a movie is positive or negative. You might want to know this so you can count positive and negative tweets about a particular movie and use that information to predict box office success (like Asur and Huberman do in this paper). You could try and think of all of the positive and negative words that you know and then only count tweets that include those words, but you'd probably miss a lot. You could categorize all of the tweets yourself, or hire a student to do it, but by the time you finished the movie would be on late night cable TV. You need a computer algorithm so you can pull thousands of tweets and count them quick, but a computer just doesn't know the difference between a positive tweet and negative tweet until you train it.

That's where the crowd comes in. People can easily judge the tone of a tweet, and you don't have to be an expert to do it. So, what you can do is gather a pile of tweets — say a few thousand — put them up on Amazon Mechanical Turk, and let the crowd label them as positive or negative. At a few cents per tweet you can do this for something in the ballpark of a hundred bucks. Now that you have a pile of labeled tweets, you can train the computer. There's lots of fancy terms for it — language model classifiers, self organizing fuzzy neural networks, ... — but basically, you run a regression.  The independent variable is stuff the computer can measure, like how many times certain words appear, and the dependent variable is whether the tweet is positive or negative. You estimate the regression (a.k.a train the classifier) on the tweets labeled by the crowd, and now you have an algorithm that can label new tweets that the crowd hasn't labeled.When the next movie is coming out, you harvest the unlabeled tweets and feed them through the computer to see how many are positive and negative.

This is exactly how Hany Farid at Dartmouth trained his algorithm for detecting how much digital photographs have been altered.  On it's own the computer can measure lots of fancy statistical features of the image, but judging how significant the alteration of the image is requires a human. So, he gave lots of pairs of original and altered images to people on MTurk and had them rate how altered the images were.  Then he essentially let the computer figure out what image characteristics for the altered images correlate with high alteration scores (but in a much fancier way then just a regular regression).  Now, he has a trained algorithm that can read in photographs where we don't have the original and predict how altered the image is.

"Predicting the Present" at the CIA

The CIA is using tools similar to those we teach in the Kellogg Social Dynamics and Networks course to "predict the present" according to an AP article (see also this NPR On the Media interview).

While accurately predicting the future is often impossible, it can be pretty challenging just to know what's happening right now.  Predicting the present is the idea of using new tools to get a faster, better picture of what's happening in the present.  For example, the US Bureau of Labor and Statistics essentially gathers the pricing information that goes into the Consumer Price Index (CPI) by hand (no joke, read how they do it here). This means that the governments measure of CPI (and thus inflation) is always a month behind, which is not good for making policy in a world where decades old investment banks can collapse in a few days.

To speed the process up, researchers at MIT developed the Billion Prices Project, which as the name implies collects massive quantities of price data from across the Internet to get a more rapid estimate of CPI. The measure works, and is much more responsive than the governments measure. For example, in the wake of the Lehman collapse, the BPP detected deflationary movement almost immediately while it took more than a month for those changes to show up in the governments numbers.

"Social Media in Tornado Alley"

A recent New York Times newsletter contained an article, "Social Media in Tornado Alley," in which they describe how resident posted YouTube videos and reporter Twitter feeds contributed to their coverage of the tornado devastation in Joplin, Missouri.  They created a video by piecing together YouTube clips and their reporter Brian Stelter, "immediately began filing a stream of Twitter updates that provide a unique and up to the second account of what he was seeing on the ground there."

It's interesting to see how the Times and other "traditional" news organizations are folding social media into their portfolio.  On the one hand, social media is seen as a challenge to traditional news sources, since so much information is available via the Web. But, I think we're seeing how organizations like the Times can serve as curators of this information by collecting the most interesting/important/reliable pieces and adding expert commentary and analysis.

Crowdsourcing the Palin Email Release

Slate reports that several major news outlets, including the Washington Post and the New York Times, are planning to use crowdsourcing to scour thousands of pages of emails from her time as Governor of Alaska that will be released on Friday.

In many ways this is a perfect crowdsourcing task.  It would be hugely time consuming for news reporters to sift through the more than 24,000 pages of email themselves.  And automating this process would be next to impossible because what counts as "interesting" is very difficult to program into a natural language processor. On the other hand, it is relatively easy for for humans to pick out.  The task comes with built in motivation: first, people are personally interested in reading Palin's emails; second, Palin's detractors are motivated to try and dig up embarrassing information and supporters will be motivated to respond; and third, finding something interesting comes with the promise of acknowledgement in the pages of a major news outlet.  All this adds up to the fact that you don't need to pay anyone to do this and do it well.  The biggest potential pitfall is that crowdsourcing relies fundamentally on local information.  Each individual looks through a handful of emails, which is good for finding particular juicy quotes, but not so good for identifying larger patterns.  To combat this, the news outlets could rely on wiki-like interfaces where the crowdsourcers could post "leads" that other individuals could add to in order to piece together larger narratives.

Diversity Trumps Accuracy in Large Groups

In a recent paper with Scott Page, forthcoming in Management Science, we show that when combining the forecasts of large numbers of individuals, it is more important to select forecasters that are different from one another than those that are individually accurate.  In fact, as the group size goes to infinity, only diversity (covariance) matters.  The idea is that in large groups, even if the individuals are not that accurate, if they are diverse then their errors will cancel each other out.  In small groups, this law of large numbers logic doesn’t hold, so it is more important that the forecasters are individually accurate.  We think this result is increasingly relevant as organizations turn to prediction markets and crowdsourced forecasts to inform their decisions.

Crowdsourcing a clinical trial

Ars Technica has an article today about a crowdsourced clinical trial to evaluate the effectiveness of using lithium for treating ALS (Lou Gehrig’s Disease).  Over 3500 patients participated in tracking their disease symptoms online and 150 of them were treated with the drug.  The results showed no significant impact of the drug on ALS symptoms.  The company that ran the study, PatientsLikeMe, was founded by three MIT engineers, and they published an article describing the trial in Nature Biotechnology.

From the press release:

“This is the first time a social network has been used to evaluate a treatment in a patient population in real time,” says ALS pioneer and PatientsLikeMe Co-Founder Jamie Heywood. “While not a replacement for the gold standard double blind clinical trial, our platform can provide supplementary data to support effective decision-making in medicine and discovery. Patients win when reliable data is made available, sooner.”