Social Media

What it takes to "Go Viral"

It seems like we hear a new story every week: a video, or a rumor, or a song, or a commercial has "gone viral," spreading across the web like wildfire, racing to the top of the most tweeted list, and grabbing headlines in real old fashioned news media. These memes can be disgusting (like the Domino's pizza video), controversial (like the recent Kony 2012 video), and entertaining ("Friday" ?). They can be disasters for companies (see Domino's above), or marketing campaigns that reach hundreds of thousand, or even millions, of viewers for relatively little investment (1300 foot drop, the Old Spice Guy). Given the potential impact of these "memes," there is a lot of interest in what exactly determines whether or not a video, or a message, or a rumor goes viral. Here's a simple model that explains why some things do and some things don't.

Let's consider the example of a YouTube video. Suppose that on average, every person that views the video tells of their friends about it per day (stands for contacts), and suppose that some fraction of the people that hear about the video actually watch it and start telling other people about it themselves (i stands for infectivity, and captures something like how interesting the video is.) Finally, suppose that on average, each person that is actively spreading word of the video does so for d days before they get bored and stop telling people about the video (d stands for duration).

To keep things simple, suppose that there are a total of N people in the population, and every one of these people is either actively spreading the video, or not actively spreading the video, but susceptible to becoming a video spreader. Let I denote the number of people currently spreading (i.e. infected) and S the number of people that are susceptible, but not currently spreading the video. So, I+S=N.

To see if the video goes viral or not, we just have to compare the rate at which people are becoming infected to the rate at which people are discontinuing sharing the video. It helps to think of a bath tub — the level of water in the bath tub represents the number of people spreading the video. The rate that water flows in through the faucet is the rate at which new people are becoming infected with the video spreading virus; the rate at which water drains out is the rate at which people are stopping spreading the video. If the rate at which water flows in is higher than the rate at which it drains out, the tub will keep filling up. On the other hand, if the drain is more open than the faucet, the bath tub will never fill up.

So, we have to figure out the rate at which new people are starting to spread the video and the rate at which people currently sharing the video are stopping. The second one is easier. If I people are currently sharing the video and each one of them shares it for d days on average, then each day we expect I/D people to stop spreading the video. For the first rate, we have I people actively sharing the video. On average, each one of them shares the video with c contacts per day, resulting in a total of cI contacts for the whole population. But, not all of these contacts results in a new person sharing the video. First, some of these people will already be sharing the video. The probability that a given person is not currently sharing the video is S/N, the fraction of "susceptible" people in the population. So, we expect cIS/N instances in which a person shares the video with someone that is currently spreading the video. Given such a contact, we said that a fraction i of these will result in a new person sharing the video. Putting it all together, the rate at which new people are becoming infected with the video sharing virus is ciIS/N.

Now we have to compare our two rates. The video will go viral if ciIS/N>I/d. Dividing both sides by I and multiplying both sides by d, this becomes, cidS/N>1. Finally, we can make life a little simpler by assuming that initially almost no one knows about the video, so the number of susceptible people S and the total population N are about the same. Then S/N is approximately 1, so the equation simplifies to just cid>1.

This simple equation tells us whether or not the video will go viral. It says if the average number of contacts, times the infectivity, times the duration is greater than one, the video will spread, otherwise it will die out. Right at cid=1 there is a tipping point; crossing this threshold causes a discontinuous jump in the future.

This model makes a lot of assumptions that don't really hold (big ones are that people have roughly the same # of contacts on average, and the people basically interact at random), but it gives us a basic understanding of the process. Even in more complicated models, where we make fewer simplifying assumptions, there is typically a similar tipping point, and increasing either contacts, infectivity, or duration increases the chance of crossing that threshold.

So, there you have it — everything you need to go viral: a network with enough contacts (c); a product, or message, that sounds interesting enough to be infectious (i), and with enough staying power so that people keep telling their friends about it for a long time (d).

Social Dynamics Videos

While I've been teaching Social Dynamics and Networks at Kellogg, I've amassed a collection of links to interesting videos on social dynamics. Here they are:

Duncan Watts TEDx talk on "The Myth of Common Sense"

Nicholas Christakis TED talk on "The hidden influence of social networks"; TED talk on "How social networks predict epidemics."

James Fowler talking about social influence on the Colbert Report.

Sinan Aral TEDx talk on "Social contagion"; at PopTech 2010 on "Social contagion"; at Nextwork on "Social contagion"; at the International Conference on Weblogs and Social Media on "Content and causality in social networks."

Scott E. Page on "Leveraging Diversity", and at TEDxUofM on "Putting Milk Crates on the Internet."

Eli Pariser TED talk on "Beware online 'filter bubbles'"

Freakonomics podcast on "The Folly of Prediction"

Damon Centola on "Network Contagion."

Jure Leskovec on "The Web as a Laboratory for Studying Humanity"

There are several good videos of talks from the Web Science Meets Network Science conference at Northwestern: Duncan Watts, Albert-Laszlo Barabasi, Jure Leskovec, and Sinan Aral.

The "Did You Know?" series of videos has some incredible information about, well, information. More info here.

Training Computers with Crowds

Computers are awesome, but they don't know how to do much on their own; you have to train them. Crowdsourcing turns out to be a great way to do this. Suppose you would like to have an algorithm to measure something — like whether a tweet about a movie is positive or negative. You might want to know this so you can count positive and negative tweets about a particular movie and use that information to predict box office success (like Asur and Huberman do in this paper). You could try and think of all of the positive and negative words that you know and then only count tweets that include those words, but you'd probably miss a lot. You could categorize all of the tweets yourself, or hire a student to do it, but by the time you finished the movie would be on late night cable TV. You need a computer algorithm so you can pull thousands of tweets and count them quick, but a computer just doesn't know the difference between a positive tweet and negative tweet until you train it.

That's where the crowd comes in. People can easily judge the tone of a tweet, and you don't have to be an expert to do it. So, what you can do is gather a pile of tweets — say a few thousand — put them up on Amazon Mechanical Turk, and let the crowd label them as positive or negative. At a few cents per tweet you can do this for something in the ballpark of a hundred bucks. Now that you have a pile of labeled tweets, you can train the computer. There's lots of fancy terms for it — language model classifiers, self organizing fuzzy neural networks, ... — but basically, you run a regression.  The independent variable is stuff the computer can measure, like how many times certain words appear, and the dependent variable is whether the tweet is positive or negative. You estimate the regression (a.k.a train the classifier) on the tweets labeled by the crowd, and now you have an algorithm that can label new tweets that the crowd hasn't labeled.When the next movie is coming out, you harvest the unlabeled tweets and feed them through the computer to see how many are positive and negative.

This is exactly how Hany Farid at Dartmouth trained his algorithm for detecting how much digital photographs have been altered.  On it's own the computer can measure lots of fancy statistical features of the image, but judging how significant the alteration of the image is requires a human. So, he gave lots of pairs of original and altered images to people on MTurk and had them rate how altered the images were.  Then he essentially let the computer figure out what image characteristics for the altered images correlate with high alteration scores (but in a much fancier way then just a regular regression).  Now, he has a trained algorithm that can read in photographs where we don't have the original and predict how altered the image is.

Clustering and the Ignorance of Crowds

Over on the Cheap Talk blog (@CheapTalkBlog), Jeff Ely (@jeffely) has an interesting post about the "Ignorance of Crowds." The basic idea is that when there are lots of connections among people, each individual has less incentive to seek out costly information — e.g. subscribe to the newspaper — on their own, because instead they can just get that information ("free ride") from others. More connections means more free riding and fewer informed individuals.

I take a much more complicated route to the same conclusion in "Network Games with Local Correlation and Clustering." Besides being sufficiently mathematically intractable to, hopefully, be published, the paper does show a few other things too. In particular, I look at how network clustering affects "public goods provision," which is the fancy term for what Jeff Ely calls subscribing to the newspaper. Lots of real social networks are highly clustered. This means that if I'm friends with Jack and Jill, there is a good chance that Jack and Jill are friends with each other. What I find in the paper is that clustering increases public goods provision. In other words, when people are members of tight knit communities, more people should subscribe to the newspaper (and volunteer, and pick up trash, and ...)

It's pretty clear that the Internet, social media etc... are increasing the number of contacts that we have, but an interesting question that I haven't seen any research on is How are these technologies affecting clustering (if at all)?

"Predicting the Present" at the CIA

The CIA is using tools similar to those we teach in the Kellogg Social Dynamics and Networks course to "predict the present" according to an AP article (see also this NPR On the Media interview).

While accurately predicting the future is often impossible, it can be pretty challenging just to know what's happening right now.  Predicting the present is the idea of using new tools to get a faster, better picture of what's happening in the present.  For example, the US Bureau of Labor and Statistics essentially gathers the pricing information that goes into the Consumer Price Index (CPI) by hand (no joke, read how they do it here). This means that the governments measure of CPI (and thus inflation) is always a month behind, which is not good for making policy in a world where decades old investment banks can collapse in a few days.

To speed the process up, researchers at MIT developed the Billion Prices Project, which as the name implies collects massive quantities of price data from across the Internet to get a more rapid estimate of CPI. The measure works, and is much more responsive than the governments measure. For example, in the wake of the Lehman collapse, the BPP detected deflationary movement almost immediately while it took more than a month for those changes to show up in the governments numbers.

A Gephi Visualization of Gephi on Twitter

This is a visualization of Twitter accounts that follow and are followed by @gephi that I made using ... Gephi. I collected the data using NodeXL. Two accounts are linked in the network if one follows the other on Twitter. Nodes are sized according to their degree. The modularity clustering algorithm finds 8 different groups among the accounts.  The blue group in the upper left, where I live, contains most of the network science crowd: @duncanjwatts, @ladadimc, @barabasi, @davidlazer, etc... The green group in the lower right seem to be data/visualization folks. I filtered out all of the nodes with degree less than four, before which there is a large contingent of accounts that followed @gephi, but with no other connections in the network.

Why Google Ripples will be a lot less cool than it sounds.

Google + now has a new feature, Ripples, that allows you to see a network visualization of the diffusion of a post (see the Gizmodo article here).  The pictures are cool, but the original post has to be public, and then it has to be shared by one Google+ user to other Google+ users.  But, the chance of interesting ripples happening very often are pretty slim; here's why.

Bakshy, Hofman, Mason, and Watts looked at exactly this kind of cascade on Twitter, which is a great platform for this kind of research for several reasons.  First, everything is effectively public, so there are none of the privacy issues of Facebook, and we don't have to limit ourselves to looking at just the messages that people choose to make public like we do on Google +.  Second, "retweeting" messages is an established part of Twitter culture, so we expect to find cascades. Finally, since tweets are limited to 140 characters, links are often shortened using services like bit.ly.  This means that if I create a link to a New York Times article and you create a link to the same page independently, those links will be different, so the researchers can tell the difference between a cascade that my post creates and one that yours creates.

Some of the cascades that Bakshy et al. found are shown in this figure.

They looked at 74 million chains like these initiated by more than 1.6 million Twitter users during two months in 2009.  A lot of interesting things came out of the study, but the most important one for Google Ripples is that 98 percent of the URLs were never reposted.  That's not good for Ripples.  The latest number puts the entire Google plus user population at only 43.6 million users, and since only a small fraction of these users' posts will be public posts, even if people share other people's posts on Google+ as frequently as the retweet links on Twitter (which is unlikely), we still can't expect to see many Ripples that look like anything but a lonely circle.

Northwestern's Defeat to the Illini as Seen on Twitter

The title says it all.  Here's the link.

Marketing and Social Media

If you are looking for tips on social media and marketing I suggest checking out the Facebook page of Hunter & Bard.

Hunter & Bard is headed up by Shira Abel, who I was fortunate to meet when she took my class in the Kellogg-Recanati International EMBA program in Tel Aviv. ( As of today Shira has exactly 518.1111... times more followers on Twitter than I do.)  The page is packed with social media and marketing information. (In fact, I have Shira to thank for sharing the Twitter Terrorists story with me.)

Twitter Terrorists: False information + positive feedbacks = real panic

Another example of how false information, amplified through positive feedbacks, can lead to real panic: in Veracruz Mexico two people posted messages on twitter reporting kidnappings at a local school. The messages spread rapidly through social media leading frightened parents to rush to try and save their children. The panic caused dozens of car accidents and jammed the city's emergency phone lines.

Amnesty International was quoted saying, "The lack of safety creates an atmosphere of mistrust in which rumours that circulate on social networks are part of people's efforts to protect themselves, since there is very little trustworthy information." As with many "tipping point" phenomenon, before the spark that set off the visible cascade, there was most likely a "contextual tipping point" that made the resulting contagion possible. Governments or managers have to realize that the only way to reliably prevent these cascades is by changing the context, not by stamping out all of the sparks.