Big Data

Crowdsourcing Web Data with Amazon Mechanical Turk Example

The amount of data available on the web is astounding, and if you have the computer programming skills you can often write simple "web scrapping" code to automatically harvest that data. But what if you don't have the computer programming skills? You could spend a lot of time learning those skills, or you could hire someone else to write the code for you, but these options cost time and money. A faster and cheaper solution is to use crowdsourcing. In this post I will walk through an example of using Amazon Mechanical Turk for collecting data from the web.

In this example I have a list of Twitter IDs and I want to find the Twitter account name associated with each of those IDs. Every Twitter account has a unique ID number that never changes, but the Twitter account name, or "handle," is the more commonly used way of identifying a Twitter user. For example, my Twitter ID is 16016329 and my Twitter account name is @pjlamberson. (You can find your Twitter ID here.) You could use this same procedure to have workers look up any data on the web — for example, to collect online reviews, telephone numbers, addresses, or any other type of online data.

The first thing to do is go to Amazon Mechanical Turk and click on Get Started on the Get Results side of the screen. Click Create an Account and register as a Requester. Once you're back to the main requester screen click Create, then click New Project.

create

There are many types of tasks (aka HITs=Human Intelligence Tasks) that Amazon Mechnical Turk workers (aka "Turkers") can do. In this example, we are going to use the Data Collection task. So, click on Data Collection and then click Create Project.

datacollection

 

On the next screen you enter the properties for your project including how much you want to pay per HIT and how many people you want to complete each task. You can see the properties I chose below, but you may want to change your properties if you expect your task to take longer or if you want multiple workers to complete each HIT to ensure higher quality data.

When you're finished specifying your project properties, click on Design Layout.

For my data collection task I have a spreadsheet full of Twitter IDs. You can see a sample of the spreadsheet at: http://bit.ly/mTurkData

For each row in the spreadsheet, I need the Turker to go to the web address https://twitter.com/intent/user?user_id=XXXXXXXXXXX where XXXXXXXXXXX is replaced with the Twitter ID in that row of the spreadsheet. The corresponding instructions for this task to the Turker are shown below.

To refer to the variable ID that shows up in my spreadsheet I use the syntax ${Id}. The name of the variable Id matches the header of the corresponding column in my spreadsheet. Mechanical Turk will automatically create one HIT for each row of my spreadsheet. If I have multiple variables that change for each HIT, you can have multiple columns in the spreadsheet and refer to a column with column header "ColumnX" with ${ColumnX} in your instructions. For each HIT, the placeholder ${ColumnX} will be replaced with the content of the variable ColumnX from the spreadsheet from a given row of your spreadsheet.

The HTML source code for the HIT design shown above is available at http://bit.ly/mTurkSource If you click on the Source button on your Design Layout page, you can replace the example source code with this to reproduce my HIT.

Once you have entered your instructions to the Turkers, hit Preview. If the instructions look like you want them to (note the variable placeholder will still be just a placeholder until you upload your spreadsheet later), hit Finish.

Now you need to upload the spreadsheet containing the variables that change for each HIT and open up the job to the Turkers. To do this click Publish Batch and then click Choose File. Your spreadsheet should be in csv format, have one row for each HIT, and one column for each variable that changes from HIT to HIT. In my case, there is just the one variable, Id. Once you have selected your file, click Upload.

Now you will see a preview of your HITs with the variable(s) filled in with values from your spreadsheet. In my case, I have this:

Now, instead of ${Id} showing up in the web address, the first Twitter Id from my spreadsheet, 23779644, has been substituted. By clicking on Next HIT, you can see the other HITs that have been created using the data from the spreadsheet. Take a look at a few to make sure everything looks like you want to and then click Next.

On the next page you will see how much your batch of HITs is going to cost, which is a combination of the per HIT fee you set to pay the Turkers and the fee Amazon charges you to use the service. You may need to add funds to your account through Amazon payments in order to pay for the work. Once you have done that click Publish HITs and wait for the magic to happen.

This is the fun part. While you surf the web aimlessly, go grab a coffee, play solitaire, or get some really important work done, the tasks you posted are being completed by members of the thousands of Turkers that you are connected to through the platform. It's as if you're the CEO of a major company with a massive workforce waiting to do your bidding at a moments notice. You can watch the progress the Turkers are making on the results page.

In short order your tasks will be complete and you can see what the Turkers came up with by clicking the Results button. Here is a portion of my results:

If you're satisfied with the results you can approve them so that the Turkers get paid, or if a Turker did not do a satisfactory job you can reject the work. If you do nothing, the HITs will automatically be approved after a set time that you specified when setting up the HITs. To download the results just click on Download CSV and then right click and select Download linked file... on the here link and you're all set!

Big Data and the Wisdom of Crowds are not the same

I was surprised this week to find an article on Big Data in the New York Times Men's Fashion Edition of the Style Magazine. Finally! Something in the Fashion issue that I can relate to I thought. Unfortunately, the article by Andrew Ross Sorkin (author of Too Big To Fail) made one crucial mistake. The downfall of the article was conflating two distinct concepts that are both near and dear to my research, Big Data and the Wisdom of Crowds, which led to a completely wrong conclusion.

Big Data is what it sounds like — using very large datasets for ... well for whatever you want. How big is Big depends on what you're doing.  At a recent workshop on Big Data at Northwestern University, Luís Amaral defined Big Data to be basically any data that is too big for you to handle using whatever methods are business as usual for you. So, if you're used to dealing with data in Excel on a laptop, then data that needs a small server and some more sophisticated analytics software is Big for you. If you're used to dealing with data on a server, then your Big might be data that needs a room full of servers.

The Wisdom of Crowds is the idea that, as collectives, groups of people can make more accurate forecasts or come up with better solutions to problems than the individuals in them could on their own. A different recent New York Times articles has some great examples of the Wisdom of Crowds. The article talks about how the Navy has used groups to help make forecasts, and in particular forecasts for the locations of lost items like "sunken ships, spent warheads and downed pilots in vast, uncharted waters." The article tells one incredible story of how they used this idea to locate a missing submarine, the Scorpion:

"... forecasters draw on expertise from diverse but relevant areas — in the case of finding a submarine, say, submarine command, ocean salvage, and oceanography experts, as well as physicists and engineers. Each would make an educated guess as to where the ship is ... This is how Dr. Craven located the Scorpion.

“I knew these guys and I gave probability scores to each scenario they came up with,” Dr. Craven said. The men bet bottles of Chivas Regal to keep matters interesting, and after some statistical analysis, Dr. Craven zeroed in on a point about 400 miles from the Azores, near the Sargasso Sea, according to a detailed account in “Blind Man’s Bluff,” by Christopher Drew and Sherry Sontag. The sub was found about 200 yards away."

This is a perfect example of the Wisdom of Crowds: by pooling the forecasts of a diverse group, they came up with an accurate collective forecast.

So, how do Big Data and The Wisdom of Crowds get mixed up? The mixup comes from the fact that a lot of Big Data is data on the behavior of crowds. The central example in Sorkin's article is data from Twitter, and in particular data that showed a lot of people on Twitter were very unhappy with antigay comments made by Phil Robertson, the star of A&E's Duck Dynasty. The short version of the story is that A&E initially terminated Robertson in response to the Twitter data, but Sorkin argues this was a business mistake because Twitter users are "not exactly regular watchers of the camo-wearing Louisiana clan whose members openly celebrate being 'rednecks'." He also cites evidence that data from Twitter does not provide accurate election predictions for essentially the same reason — the people that are tweeting are not a representative sample of the people that are voting. All of this is correct. Using a big dataset does not mean that you don't have to worry about having a biased sample. No matter how big your dataset, a biased sample can lead to incorrect conclusions. A classic example is the prediction by The Literary Digest in 1936 that Alf Landon would be the overwhelming winner of the presidential election that year. In fact, Franklin Roosevelt carried 46 of the 48 states. The prediction was based on a huge poll with 2.4 million respondents, but the problem with the prediction was that the sample for the poll drew primarily on Literary Digest subscribers, automobile and telephone owners. This sample tended to be more affluent than the average voter, and thus favored Landon's less progressive policies.

So, Sorkin is on the right track to write a great article on how sample bias is still important even when you have Big Data. This is a really important point that a lot of people don't appreciate. But unfortunately the article veers off that track when it starts talking about the Wisdom of Crowds. The Wisdom of Crowds is not about combining data on large groups, but about combining the predictions, forecasts, or ideas of groups (they don't even have to be that large). If you want to use the Wisdom of Crowds to predict an election winner, you don't collect data on who they're tweeting about, you ask them who they think is going to win. If you want to use the Wisdom of Crowds to decide whether or not you should fire Phil Robertson, you ask them, "Do you think A&E will be more profitable if they fire Phil Robertson or not?" As angry as all of those tweets were, many of those angry voices on Twitter would probably concede that Robertson's remarks wouldn't damage the show's standing with its core audience.

The scientific evidence shows that using crowds is a pretty good way to make a prediction, and it often outperforms forecasts based on experts or Big Data. For example, looking at presidential elections from 1988 to 2004, relatively small Wisdom of Crowds forecasts outperformed the massive Gallup Poll by .3 percentage points (Wolfers and Zitzewitz, 2006). This isn't a huge margin, but keep in mind that the Gallup presidential poles are among the most expensive, sophisticated polling operations in history, so the fact that the crowd forecasts are even in the ballpark, let alone better, is pretty significant.

The reason the Wisdom of Crowds works is because when some people forecast too high and others forecast too low, their errors cancel out and bring the average closer to the truth. The accuracy of a crowd forecast depends both on the accuracy of the individuals in the crowd and on their diversity — how likely are their errors to be in opposite directions. The great thing about it is that you can make up for low accuracy with high diversity, so even crowds in which the individual members are not that great on their own can make pretty good predictions as collectives. In fact, as long as some of the individual predictions are on both sides of the true answer, the crowd forecasts will always be closer to the truth than the average individual in the crowd. It's a mathematical fact that is true 100% of the time. Sorkin concludes his article, based on the examples of inaccurate predictions from Big Data with biased samples, by writing, "A crowd may be wise, but ultimately, the crowd is no wiser than the individuals in it." But this is exactly backwards. A more accurate statement would be, "A crowd may or may not be wise, but ultimately, it's always at least as wise as the individuals in it. Most of the time it's wiser."