You don't need "Big Data"

The Wall Street Journal recently ran an interesting article about the rise of "Big Data" in business decision making.  The author, Dennis Berman, makes the case for using big data  by pointing out that human decision making is prone to all sorts of errors and biases (referencing Daniel Kahneman's fantastic new book, Thinking, Fast and Slow). There's anchoring, hindsight bias, availability bias, overconfidence, loss aversion, status quo bias, and the list goes on and on. Berman suggests that big data — crunching massive data sets looking for patterns and making predictions — may be the solution to overcoming these flaws in our judgement.

I agree with Berman that big data offers tremendous opportunities, and he's also right to emphasize the ever increasing speed with which we can gather and analyze all that data. But you don't need terabytes of data or a self-organizing fuzzy neural network to improve your decisions. In many cases, all you need is a simple model.

Consider this example from the classic paper, "Clinical versus Actuarial Judgement" by Dawes, Faust and Meehl (1989). Twenty-nine judges with varying ranges of experience were presented with the scores of 861 patients on the Minnesota Multiphasic Personality Inventory (MMPI), which scores patients on 11 different dimensions and is commonly used to diagnose psychopathologies. The judges were asked diagnosis the patients as either psychotic or neurotic and their answers were compared with diagnoses from more extensive examinations that occurred over a much longer period of time. On average the judges were correct 62% of the time, and the best individual judge correctly diagnosed 67% of the patients. But known of the judges performed as well as the "Goldberg Rule". The Goldberg Rule is not a fancy model based on reams of data — it's not even a simple linear regression. The rule is just the following simple formula: add three specific dimensions from the test, subtract two others and compare the result to 45. If the answer is greater than 45 the diagnosis is neurosis, if it less, then psychosis. The Goldberg Rule correctly diagnosed 70% of the patients.

It's impressive that this simple non-optimal rule beat every single individual judge, but Dawes and company didn't stop there. The judges were provided with additional training on 300 more samples in which they were given the MMPI scores and the correct diagnosis. After the training, still no single judge beat the Goldberg Rule. Finally, the judges were given not just the MMPI scores, but also the prediction of the Goldberg Rule along with statistical information on the average accuracy of the formula, and still the rule outperformed every judge. This means that the judges were more likely to override the rule based on their personal judgement when the rule was actually correct than when it was incorrect.

This is just one of many studies that have shown time and time again that simple models outperform individual judgement. In his book, "Expert Political Judgement," Phillip Tetlock examined 28,000 forecasts of political and economic outcomes by experts and concludes, “It is impossible to find any domain in which humans clearly outperformed crude extrapolation algorithms, less still sophisticated statistical ones.”

And the great thing is, you don't have to be a mathematician or statistician to benefit from the decision making advantage of models. As Robyn Dawes has shown, even the wrong model typically outperforms individual judgement. So the next time you face an important decision before you fire up the supercomputer, write down the factors that you think are the most important, assign them weights and add them up. Even something as simple as making a pro and con list and adding the pros and subtracting the cons is likely to result in a better decision. As Dawes writes, “The whole trick is to know what variables to look at and then know how to add.”

Clustering and the Ignorance of Crowds

Over on the Cheap Talk blog (@CheapTalkBlog), Jeff Ely (@jeffely) has an interesting post about the "Ignorance of Crowds." The basic idea is that when there are lots of connections among people, each individual has less incentive to seek out costly information — e.g. subscribe to the newspaper — on their own, because instead they can just get that information ("free ride") from others. More connections means more free riding and fewer informed individuals.

I take a much more complicated route to the same conclusion in "Network Games with Local Correlation and Clustering." Besides being sufficiently mathematically intractable to, hopefully, be published, the paper does show a few other things too. In particular, I look at how network clustering affects "public goods provision," which is the fancy term for what Jeff Ely calls subscribing to the newspaper. Lots of real social networks are highly clustered. This means that if I'm friends with Jack and Jill, there is a good chance that Jack and Jill are friends with each other. What I find in the paper is that clustering increases public goods provision. In other words, when people are members of tight knit communities, more people should subscribe to the newspaper (and volunteer, and pick up trash, and ...)

It's pretty clear that the Internet, social media etc... are increasing the number of contacts that we have, but an interesting question that I haven't seen any research on is How are these technologies affecting clustering (if at all)?

The S&P credit downgrade, turmoil in the markets, and the 1973 toilet paper shortage

On Friday, August 5, Standard & Poor's downgraded the credit rating of the U.S. long-term debt to AA+.  On Monday, the first day the markets opened since the downgrade, the Dow Jones Industrial average dropped 5.6 percent and the S&P 500 fell 6.7 percent — the biggest single day drops since the crisis in 2008.  A lot of people might be confused about this turmoil in the markets, since US debt is still considered one of the safest investments there is.  Jay Forrester, founder of the field of System Dynamics, calls puzzles like this the "counterintuitive behavior of social systems."

Undoubtedly, the world economy is incredibly complex, and no individual or organization has a complete picture of how it works or where it's headed.  Through pricing, the market is supposed to aggregate all of the pieces of partial information that we each hold and then converge to the "truth" — that is prices should reflect true underlying value.  In some situations this can actually work.  Prediction markets have been shown to be valuable tools for businesses to harvest the "wisdom of the crowds" and assess the probabilities that future events occur.  But, this mechanism works best when individuals place their trades independently based on their own private information. In the real world, market dynamics are fundamentally social dynamics and as such they are subject to cascades of panic and the accumulation of overconfidence (what Alan Greenspan famously referred to as "irrational exuberance" (see also Robert Shiller)).


The current panic illustrates how even when there is no fundamental basis for a panic, social dynamics can amplify the signal of a panic to the point where an actual crisis ensues.  The gas shortages of 1979 are a classic example of this phenomenon.  The Iranian revolution sharply cut oil imports to the US from Iran.  Nervous consumers rushed to top off their tanks and even to hoard gasoline at home.  This drained the supply of gasoline at filling stations leading to an actual gasoline shortage.  Word-of-mouth and media coverage reinforced consumer fears of shortages, leading to even more topping off and hoarding, as well as government policies such as odd/even day purchase rules that actually further incentivized consumers to top off frequently and store gasoline at home.  Surprisingly, despite the very real shortage of gasoline at filling stations, US oil imports for the year actually increased in 1979 compared 1978.  The crisis was caused by social dynamics, not an actual drop in supply. (See Sterman, Business Dynamics p. 212).

A similar but more comical crisis occurred in 1973 when Johnny Carson made a joke saying, "You know what’s disappearing from the supermarket shelves?  Toilet paper.  There’s an acute shortage of toilet paper in the United States."  Consumers rushed out to stock up on toilet paper, leading to a real toilet paper shortage in the US that lasted several days.  Even though Carson tried to correct the joke a few days later, by that time toilet paper was in fact in short supply because people were hoarding it at home.

Crowdsourcing the Palin Email Release

Slate reports that several major news outlets, including the Washington Post and the New York Times, are planning to use crowdsourcing to scour thousands of pages of emails from her time as Governor of Alaska that will be released on Friday.

In many ways this is a perfect crowdsourcing task.  It would be hugely time consuming for news reporters to sift through the more than 24,000 pages of email themselves.  And automating this process would be next to impossible because what counts as "interesting" is very difficult to program into a natural language processor. On the other hand, it is relatively easy for for humans to pick out.  The task comes with built in motivation: first, people are personally interested in reading Palin's emails; second, Palin's detractors are motivated to try and dig up embarrassing information and supporters will be motivated to respond; and third, finding something interesting comes with the promise of acknowledgement in the pages of a major news outlet.  All this adds up to the fact that you don't need to pay anyone to do this and do it well.  The biggest potential pitfall is that crowdsourcing relies fundamentally on local information.  Each individual looks through a handful of emails, which is good for finding particular juicy quotes, but not so good for identifying larger patterns.  To combat this, the news outlets could rely on wiki-like interfaces where the crowdsourcers could post "leads" that other individuals could add to in order to piece together larger narratives.

"I mourn the loss of thousands of precious lives ... "

There is a great story on the Atlantic’s website about a fake quotation that exploded on Facebook and Twitter after Osama Bin Laden’s death.  The quote, wrongly attributed to Martin Luther King Jr. is:

“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”

The author of the article, Megan McArdle, traces the origins of the wrongly attributed quote to a facebook post from a 24 year old Penn State graduate student (check out the article for the fascinating story).  This brings up some interesting issues about rumors and social media.  An open question regarding information and the web, is whether technologies like social media and the Internet in general increase or decrease the prevalence of false information.  On the one hand, the “wisdom of the crowd” might be able to pick out the truth from falsehoods.  True statements will be repeated and spread, while false statements will be recognized by a great enough number of people to squelch them.  On the other hand, we know that systems like this with strong positive feedbacks can converge to suboptimal solutions.  If you think of retweeting some piece of information as like casting a vote that it is true, we might expect information cascades of the sort described theoretically by Bikhchandani et al..  In this case, two things seem to have happened.  Initially, there was a sort of information cascade that led to the spread of the quote.  Then it wasn’t the wisdom of the crowds that led to the squelching of the rumor, but the efforts of knowledgable individuals tracing the quotation back to the initial post.  What the Internet provided was a way to uncover the roots of the false information for those willing to take the time to look.

Homophily and Information Spread

This article in Wired covers new research on networks and information by Sinan Aral (Northwestern B.A. in Political Science, MIT Sloan PhD, now at NYU Stern) and Marshall Van Alstyne.  The article describes research on the email communications of members of an executive recruiting firm, and says, “those who relied on a tight cluster of homophilic contacts received more novel information per unit of time.”  The article is confusing though because it mixes several distinct network concepts: homophily, strong ties, clustering, and “band width.”  Homophily is the tendency for people to be connected to other people that are similar to them; birds of a feather flock together. In his seminal paper, “The Strength of Weak Ties,” Mark Granovetter defined the strength of a tie as “a (probably linear) combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie”.  Clustering measures the tendency of our friends to be friends with each other.  And bandwidth is a less standard term in the social networks literature that captures the total amount of information that flows through a given tie per unit time (and thus is about the same thing as strength of a tie).

After reading the Wired piece, I’m left wondering if it is

  1. strong or “high bandwidth” ties through which we communicate a lot of total information,
  2. homophilic ties with people that are similar to us,
  3. ties with people that are members of a tightly knit cluster of friends, or
  4. all of the above

that provide us with the most novelty in our information diet.

A look at the original research article makes it more clear why the Wired article was so confusing.  The actual argument has a lot of moving pieces to it.  The first argument is that structurally diverse networks tend to have lower bandwidth ties.  Here structurally diverse appears to mean not highly clustered.  So, you talk more to the people in your personal clique than to people outside of your tightly knit group.  The second piece relates structural diversity to information diversity.  They find that the more structurally diverse the network, the more diverse the information that flows through it.  So far, this seems to line up with the standard Granovetter weak ties story.  The third relationship is that increasing bandwidth also increases information diversity, and more importantly, increasing bandwidth increases the total volume of new (non-redundant) information that an individual receives.  The idea here is that if you get tons of information from someone, some of it is going to be new.

Finally, since both structural diversity and bandwidth increase information diversity, but structural diversity decreases with increased bandwidth, they set up a head to head battle to see whether the information diversity benefits of increasing bandwidth outweigh the costs of reducing structural diversity.  They have three main findings on this front that characterize when bandwidth is beneficial:

  • “All else equal, we expect that the greater the information overlap among alters, the less valuable structural diversity will be in providing access to novel information.”
  • “All else equal, the broader the topic space, the more valuable channel bandwidth will be in providing access to novel information.”
  • “All else equal, ... the higher the refresh rate, the more valuable channel bandwidth will be in providing access to novel information.”