Why Some Stories Go Viral (Maybe)

I read a(nother) article on Fast Company today about why some stories "go viral." (Mathematically speaking, why some things go viral and others don't boils down to a  simple equation.)

The article cites research by Jonah Berger and Katherine Milkman that finds articles with more emotional content, especially positive emotional content, are more likely to spread. A quick read of the article seems to promise an easy path to getting your own content on your blog, YouTube, or Twitter to take off. For example, the article cites Gawker editor Neetzan Zimmerman's success, pointing out his posts generate about 30 million views per month — the kind of statistics that get marketers salivating. The scientific research by Berger and Milkman is interesting and well done, but we have to be careful about how far we take the conclusions.

There are two interrelated issues. The first has to do with the "base rate." Part of Berger and Milkman's paper looks at what factors make articles on the New York Times online more likely to wind up on the "most emailed" list. They find, for example, that "a one standard deviation increase in the amount of anger an article evokes increases the odds that it will make the most e-mailed list by 34%."  In this case, the base rate is the percent of articles overall that make the most emailed list. When we hear that writing an especially angry article makes it 34% more likely to get on the most emailed list, it sounds like angry articles have a really good chance of being shared, but this isn't necessarily the case. What we know is that the probability of making the most emailed list given that the article is especially angry equal 1.34 times the base rate — but if the base rate is really low, 1.34 times it will be small too. Suppose for example that only 1 out of every 1000 articles makes the most emailed list, then what the result says is that 1.34 out of every thousand angry articles makes the most emailed list. 1.34 out of a thousand doesn't sound nearly as impressive as "34% more likely."

The second issue has to do with the overall predictability of making the most emailed list. The model that shows the 34% boost for angry content has an R-squared of .28. This model has more than 20 variables including things like article word count, topic, and where the article appeared on the webpage. But even knowing all of these variables, we still can't accurately predict if an article will make the most emailed list or not. All we know is that on average articles with some features are more likely to make the list than articles with other features. But for any particular article, we really can't do a very good job of predicting what's going to happen.

To get a better understanding of this idea, here's another example. In Ohio, 37% of registered voters are registered as Republicans and 36% are registered as Democrats. In Missouri, 39% are registered as Republicans and 37% are registered as Democrats. On average, registered voters in Missouri are more likely to be Republican than registered voters in Ohio, but just because someone is from Missouri doesn't mean we can confidently say they're a Republican. If we only looked at people from Ohio and Missouri, knowing which state a person is from wouldn't be a very good predictor of their party affiliation.