Crowdsourcing Web Data with Amazon Mechanical Turk Example

The amount of data available on the web is astounding, and if you have the computer programming skills you can often write simple "web scrapping" code to automatically harvest that data. But what if you don't have the computer programming skills? You could spend a lot of time learning those skills, or you could hire someone else to write the code for you, but these options cost time and money. A faster and cheaper solution is to use crowdsourcing. In this post I will walk through an example of using Amazon Mechanical Turk for collecting data from the web.

In this example I have a list of Twitter IDs and I want to find the Twitter account name associated with each of those IDs. Every Twitter account has a unique ID number that never changes, but the Twitter account name, or "handle," is the more commonly used way of identifying a Twitter user. For example, my Twitter ID is 16016329 and my Twitter account name is @pjlamberson. (You can find your Twitter ID here.) You could use this same procedure to have workers look up any data on the web — for example, to collect online reviews, telephone numbers, addresses, or any other type of online data.

The first thing to do is go to Amazon Mechanical Turk and click on Get Started on the Get Results side of the screen. Click Create an Account and register as a Requester. Once you're back to the main requester screen click Create, then click New Project.


There are many types of tasks (aka HITs=Human Intelligence Tasks) that Amazon Mechnical Turk workers (aka "Turkers") can do. In this example, we are going to use the Data Collection task. So, click on Data Collection and then click Create Project.



On the next screen you enter the properties for your project including how much you want to pay per HIT and how many people you want to complete each task. You can see the properties I chose below, but you may want to change your properties if you expect your task to take longer or if you want multiple workers to complete each HIT to ensure higher quality data.

When you're finished specifying your project properties, click on Design Layout.

For my data collection task I have a spreadsheet full of Twitter IDs. You can see a sample of the spreadsheet at:

For each row in the spreadsheet, I need the Turker to go to the web address where XXXXXXXXXXX is replaced with the Twitter ID in that row of the spreadsheet. The corresponding instructions for this task to the Turker are shown below.

To refer to the variable ID that shows up in my spreadsheet I use the syntax ${Id}. The name of the variable Id matches the header of the corresponding column in my spreadsheet. Mechanical Turk will automatically create one HIT for each row of my spreadsheet. If I have multiple variables that change for each HIT, you can have multiple columns in the spreadsheet and refer to a column with column header "ColumnX" with ${ColumnX} in your instructions. For each HIT, the placeholder ${ColumnX} will be replaced with the content of the variable ColumnX from the spreadsheet from a given row of your spreadsheet.

The HTML source code for the HIT design shown above is available at If you click on the Source button on your Design Layout page, you can replace the example source code with this to reproduce my HIT.

Once you have entered your instructions to the Turkers, hit Preview. If the instructions look like you want them to (note the variable placeholder will still be just a placeholder until you upload your spreadsheet later), hit Finish.

Now you need to upload the spreadsheet containing the variables that change for each HIT and open up the job to the Turkers. To do this click Publish Batch and then click Choose File. Your spreadsheet should be in csv format, have one row for each HIT, and one column for each variable that changes from HIT to HIT. In my case, there is just the one variable, Id. Once you have selected your file, click Upload.

Now you will see a preview of your HITs with the variable(s) filled in with values from your spreadsheet. In my case, I have this:

Now, instead of ${Id} showing up in the web address, the first Twitter Id from my spreadsheet, 23779644, has been substituted. By clicking on Next HIT, you can see the other HITs that have been created using the data from the spreadsheet. Take a look at a few to make sure everything looks like you want to and then click Next.

On the next page you will see how much your batch of HITs is going to cost, which is a combination of the per HIT fee you set to pay the Turkers and the fee Amazon charges you to use the service. You may need to add funds to your account through Amazon payments in order to pay for the work. Once you have done that click Publish HITs and wait for the magic to happen.

This is the fun part. While you surf the web aimlessly, go grab a coffee, play solitaire, or get some really important work done, the tasks you posted are being completed by members of the thousands of Turkers that you are connected to through the platform. It's as if you're the CEO of a major company with a massive workforce waiting to do your bidding at a moments notice. You can watch the progress the Turkers are making on the results page.

In short order your tasks will be complete and you can see what the Turkers came up with by clicking the Results button. Here is a portion of my results:

If you're satisfied with the results you can approve them so that the Turkers get paid, or if a Turker did not do a satisfactory job you can reject the work. If you do nothing, the HITs will automatically be approved after a set time that you specified when setting up the HITs. To download the results just click on Download CSV and then right click and select Download linked file... on the here link and you're all set!

Clustering and the Ignorance of Crowds

Over on the Cheap Talk blog (@CheapTalkBlog), Jeff Ely (@jeffely) has an interesting post about the "Ignorance of Crowds." The basic idea is that when there are lots of connections among people, each individual has less incentive to seek out costly information — e.g. subscribe to the newspaper — on their own, because instead they can just get that information ("free ride") from others. More connections means more free riding and fewer informed individuals.

I take a much more complicated route to the same conclusion in "Network Games with Local Correlation and Clustering." Besides being sufficiently mathematically intractable to, hopefully, be published, the paper does show a few other things too. In particular, I look at how network clustering affects "public goods provision," which is the fancy term for what Jeff Ely calls subscribing to the newspaper. Lots of real social networks are highly clustered. This means that if I'm friends with Jack and Jill, there is a good chance that Jack and Jill are friends with each other. What I find in the paper is that clustering increases public goods provision. In other words, when people are members of tight knit communities, more people should subscribe to the newspaper (and volunteer, and pick up trash, and ...)

It's pretty clear that the Internet, social media etc... are increasing the number of contacts that we have, but an interesting question that I haven't seen any research on is How are these technologies affecting clustering (if at all)?

Social Dynamics of the Bin Laden Death Celebration

Many people, including myself, have been a little disturbed by the wild celebrations of Osama bin Laden’s death.  An article in the New York Times quotes a number of psychologists that explain the partying as natural cathartic “pure existential release.”  It’s not until the last two paragraphs of the article that they hit on what I think was the real driving force behind the “chanting and frat-party revelry”: crowd dynamics.  The article says, “in a crowd of like-minded people, the most intense drives for justice become the norm: People who may have felt a mix of emotions in response to the news can be swept up in the general revelry.”

The dynamic is similar to that detailed by Cass Sunstein in his book Going to Extremes (I’m currently writing a paper that develops formal models to explain the going to extremes dynamic)Sunstein describes a pile of social psychology research demonstrating that when like minded individuals discuss their opinions, they become more extreme, rather than converging to the mean.  A prime example is risk taking among teenagers, a bunch of kids that would never try driving their car 150 miles an hour or shotgunning cases of beer on their own, will turn into drunken race car drivers in a crowd of their peers.  I imagine the dynamic was much the same around the Georgetown bars last Sunday night.  Riots can erupt the same way.  Most people wouldn’t think of throwing bricks threw store windows or setting cop cars on fire, but in the midst of a rioting crowd our behavior can be much different.