The amount of data available on the web is astounding, and if you have the computer programming skills you can often write simple "web scrapping" code to automatically harvest that data. But what if you don't have the computer programming skills? You could spend a lot of time learning those skills, or you could hire someone else to write the code for you, but these options cost time and money. A faster and cheaper solution is to use crowdsourcing. In this post I will walk through an example of using Amazon Mechanical Turk for collecting data from the web.
In this example I have a list of Twitter IDs and I want to find the Twitter account name associated with each of those IDs. Every Twitter account has a unique ID number that never changes, but the Twitter account name, or "handle," is the more commonly used way of identifying a Twitter user. For example, my Twitter ID is 16016329 and my Twitter account name is @pjlamberson. (You can find your Twitter ID here.) You could use this same procedure to have workers look up any data on the web — for example, to collect online reviews, telephone numbers, addresses, or any other type of online data.
The first thing to do is go to Amazon Mechanical Turk and click on Get Started on the Get Results side of the screen. Click Create an Account and register as a Requester. Once you're back to the main requester screen click Create, then click New Project.
There are many types of tasks (aka HITs=Human Intelligence Tasks) that Amazon Mechnical Turk workers (aka "Turkers") can do. In this example, we are going to use the Data Collection task. So, click on Data Collection and then click Create Project.
On the next screen you enter the properties for your project including how much you want to pay per HIT and how many people you want to complete each task. You can see the properties I chose below, but you may want to change your properties if you expect your task to take longer or if you want multiple workers to complete each HIT to ensure higher quality data.
When you're finished specifying your project properties, click on Design Layout.
For my data collection task I have a spreadsheet full of Twitter IDs. You can see a sample of the spreadsheet at: http://bit.ly/mTurkData
For each row in the spreadsheet, I need the Turker to go to the web address https://twitter.com/intent/user?user_id=XXXXXXXXXXX where XXXXXXXXXXX is replaced with the Twitter ID in that row of the spreadsheet. The corresponding instructions for this task to the Turker are shown below.
To refer to the variable ID that shows up in my spreadsheet I use the syntax ${Id}. The name of the variable Id matches the header of the corresponding column in my spreadsheet. Mechanical Turk will automatically create one HIT for each row of my spreadsheet. If I have multiple variables that change for each HIT, you can have multiple columns in the spreadsheet and refer to a column with column header "ColumnX" with ${ColumnX} in your instructions. For each HIT, the placeholder ${ColumnX} will be replaced with the content of the variable ColumnX from the spreadsheet from a given row of your spreadsheet.
The HTML source code for the HIT design shown above is available at http://bit.ly/mTurkSource If you click on the Source button on your Design Layout page, you can replace the example source code with this to reproduce my HIT.
Once you have entered your instructions to the Turkers, hit Preview. If the instructions look like you want them to (note the variable placeholder will still be just a placeholder until you upload your spreadsheet later), hit Finish.
Now you need to upload the spreadsheet containing the variables that change for each HIT and open up the job to the Turkers. To do this click Publish Batch and then click Choose File. Your spreadsheet should be in csv format, have one row for each HIT, and one column for each variable that changes from HIT to HIT. In my case, there is just the one variable, Id. Once you have selected your file, click Upload.
Now you will see a preview of your HITs with the variable(s) filled in with values from your spreadsheet. In my case, I have this:
Now, instead of ${Id} showing up in the web address, the first Twitter Id from my spreadsheet, 23779644, has been substituted. By clicking on Next HIT, you can see the other HITs that have been created using the data from the spreadsheet. Take a look at a few to make sure everything looks like you want to and then click Next.
On the next page you will see how much your batch of HITs is going to cost, which is a combination of the per HIT fee you set to pay the Turkers and the fee Amazon charges you to use the service. You may need to add funds to your account through Amazon payments in order to pay for the work. Once you have done that click Publish HITs and wait for the magic to happen.
This is the fun part. While you surf the web aimlessly, go grab a coffee, play solitaire, or get some really important work done, the tasks you posted are being completed by members of the thousands of Turkers that you are connected to through the platform. It's as if you're the CEO of a major company with a massive workforce waiting to do your bidding at a moments notice. You can watch the progress the Turkers are making on the results page.
In short order your tasks will be complete and you can see what the Turkers came up with by clicking the Results button. Here is a portion of my results:
If you're satisfied with the results you can approve them so that the Turkers get paid, or if a Turker did not do a satisfactory job you can reject the work. If you do nothing, the HITs will automatically be approved after a set time that you specified when setting up the HITs. To download the results just click on Download CSV and then right click and select Download linked file... on the here link and you're all set!