Research

Recruiting for Postdoctoral Scholar

UPDATE: This position is now closed.

I recently received in NIH R01 grant for $1.74m to fund a project examining how team communication networks impact collaboration success. Part of this funding will support a postdoctoral scholar to work with me at UCLA on building a computational model of team collaboration. (For related models see Hong and Page, 2001 and 2004; Lazer and Friedman, 2007.) See below for the complete position description and application instructions (downloadable here).

The Department of Communication Studies at UCLA is recruiting for a Postdoctoral Scholar to help develop a computational model of team networks and collaboration.

The successful candidate will collaborate with Professor PJ Lamberson on an NIH funded project examining the characteristics of successful teams, the leading indicators of impending team failure, and potential policies for increasing the productivity of team science and problem solving. The project will employ a computational agent-based modeling approach. In addition to collaborating with Professor Lamberson, the postdoc will also have the opportunity to work closely with other members of the project team including Nosh Contractor, Leslie DeChurch, and Brian Uzzi from Northwestern University’s School of Communication, Kellogg School of Management, and the Northwestern Institute on Complex Systems (NICO). A wide variety of disciplinary backgrounds will be considered. Key qualifications are experience with computational modeling, complex systems, and network analysis.

To apply, please send:

1. A cover letter explaining your interests and qualification for the position

2. A CV, and

3. At least two letters of recommendation

to pj@social-dynamics.org.

Applications will be considered as they are received, and the position will remain open until filled.

The University of California is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability, age or protected veteran status. For the complete University of California nondiscrimination and affirmative action policy see:

http://policy.ucop.edu/doc/4000376/NondiscrimAffirmAct

Big Data and the Wisdom of Crowds are not the same

I was surprised this week to find an article on Big Data in the New York Times Men's Fashion Edition of the Style Magazine. Finally! Something in the Fashion issue that I can relate to I thought. Unfortunately, the article by Andrew Ross Sorkin (author of Too Big To Fail) made one crucial mistake. The downfall of the article was conflating two distinct concepts that are both near and dear to my research, Big Data and the Wisdom of Crowds, which led to a completely wrong conclusion.

Big Data is what it sounds like — using very large datasets for ... well for whatever you want. How big is Big depends on what you're doing.  At a recent workshop on Big Data at Northwestern University, Luís Amaral defined Big Data to be basically any data that is too big for you to handle using whatever methods are business as usual for you. So, if you're used to dealing with data in Excel on a laptop, then data that needs a small server and some more sophisticated analytics software is Big for you. If you're used to dealing with data on a server, then your Big might be data that needs a room full of servers.

The Wisdom of Crowds is the idea that, as collectives, groups of people can make more accurate forecasts or come up with better solutions to problems than the individuals in them could on their own. A different recent New York Times articles has some great examples of the Wisdom of Crowds. The article talks about how the Navy has used groups to help make forecasts, and in particular forecasts for the locations of lost items like "sunken ships, spent warheads and downed pilots in vast, uncharted waters." The article tells one incredible story of how they used this idea to locate a missing submarine, the Scorpion:

"... forecasters draw on expertise from diverse but relevant areas — in the case of finding a submarine, say, submarine command, ocean salvage, and oceanography experts, as well as physicists and engineers. Each would make an educated guess as to where the ship is ... This is how Dr. Craven located the Scorpion.

“I knew these guys and I gave probability scores to each scenario they came up with,” Dr. Craven said. The men bet bottles of Chivas Regal to keep matters interesting, and after some statistical analysis, Dr. Craven zeroed in on a point about 400 miles from the Azores, near the Sargasso Sea, according to a detailed account in “Blind Man’s Bluff,” by Christopher Drew and Sherry Sontag. The sub was found about 200 yards away."

This is a perfect example of the Wisdom of Crowds: by pooling the forecasts of a diverse group, they came up with an accurate collective forecast.

So, how do Big Data and The Wisdom of Crowds get mixed up? The mixup comes from the fact that a lot of Big Data is data on the behavior of crowds. The central example in Sorkin's article is data from Twitter, and in particular data that showed a lot of people on Twitter were very unhappy with antigay comments made by Phil Robertson, the star of A&E's Duck Dynasty. The short version of the story is that A&E initially terminated Robertson in response to the Twitter data, but Sorkin argues this was a business mistake because Twitter users are "not exactly regular watchers of the camo-wearing Louisiana clan whose members openly celebrate being 'rednecks'." He also cites evidence that data from Twitter does not provide accurate election predictions for essentially the same reason — the people that are tweeting are not a representative sample of the people that are voting. All of this is correct. Using a big dataset does not mean that you don't have to worry about having a biased sample. No matter how big your dataset, a biased sample can lead to incorrect conclusions. A classic example is the prediction by The Literary Digest in 1936 that Alf Landon would be the overwhelming winner of the presidential election that year. In fact, Franklin Roosevelt carried 46 of the 48 states. The prediction was based on a huge poll with 2.4 million respondents, but the problem with the prediction was that the sample for the poll drew primarily on Literary Digest subscribers, automobile and telephone owners. This sample tended to be more affluent than the average voter, and thus favored Landon's less progressive policies.

So, Sorkin is on the right track to write a great article on how sample bias is still important even when you have Big Data. This is a really important point that a lot of people don't appreciate. But unfortunately the article veers off that track when it starts talking about the Wisdom of Crowds. The Wisdom of Crowds is not about combining data on large groups, but about combining the predictions, forecasts, or ideas of groups (they don't even have to be that large). If you want to use the Wisdom of Crowds to predict an election winner, you don't collect data on who they're tweeting about, you ask them who they think is going to win. If you want to use the Wisdom of Crowds to decide whether or not you should fire Phil Robertson, you ask them, "Do you think A&E will be more profitable if they fire Phil Robertson or not?" As angry as all of those tweets were, many of those angry voices on Twitter would probably concede that Robertson's remarks wouldn't damage the show's standing with its core audience.

The scientific evidence shows that using crowds is a pretty good way to make a prediction, and it often outperforms forecasts based on experts or Big Data. For example, looking at presidential elections from 1988 to 2004, relatively small Wisdom of Crowds forecasts outperformed the massive Gallup Poll by .3 percentage points (Wolfers and Zitzewitz, 2006). This isn't a huge margin, but keep in mind that the Gallup presidential poles are among the most expensive, sophisticated polling operations in history, so the fact that the crowd forecasts are even in the ballpark, let alone better, is pretty significant.

The reason the Wisdom of Crowds works is because when some people forecast too high and others forecast too low, their errors cancel out and bring the average closer to the truth. The accuracy of a crowd forecast depends both on the accuracy of the individuals in the crowd and on their diversity — how likely are their errors to be in opposite directions. The great thing about it is that you can make up for low accuracy with high diversity, so even crowds in which the individual members are not that great on their own can make pretty good predictions as collectives. In fact, as long as some of the individual predictions are on both sides of the true answer, the crowd forecasts will always be closer to the truth than the average individual in the crowd. It's a mathematical fact that is true 100% of the time. Sorkin concludes his article, based on the examples of inaccurate predictions from Big Data with biased samples, by writing, "A crowd may be wise, but ultimately, the crowd is no wiser than the individuals in it." But this is exactly backwards. A more accurate statement would be, "A crowd may or may not be wise, but ultimately, it's always at least as wise as the individuals in it. Most of the time it's wiser."

A Scientist's Take on the Princeton Facebook Paper

Spechler and Cannarella's paper predicting the death of Facebook has been taking a lot of flak. While I do think there are some issues applying their model to Facebook and MySpace, they're not the ones that most people are citing.

The most common complaint about the Princeton Facebook paper that I've seen is that Facebook is not a disease. Facebook may not be a disease, but that doesn't mean a model that describes how diseases spread isn't a good model for how Facebook spreads. Models based on the disease spread analogy have been used for decades in marketing. The famous "Bass Model" is just a relabeled disease model. Frank Bass's original paper has been cited thousands of times and was named one of the ten most influential papers in Management Science. While it's received its fair share of criticism, the entirety of The Tipping Point is based on the disease spread analogy. Gladwell even writes, "... ideas and behavior and messages and products sometimes behave just like outbreaks of infectious disease."

Interestingly, one of the major points of Spechler and Cannarela's paper is that online social networks do NOT spread just like a disease, that's why they had to modify the original SIR disease model in the first place. (See an explanation here.)

But, the critics have missed this point and are fixated on particulars of the disease analogy. For example, Lance Ulanoff at Mashable (who has one of the more evenhanded critiques) says, "How can you recover from a disease you never had?" He's referring to the fact that in Spechler and Cannarella's model, some people start off in the Recovered population before they've ever been infected. These are people who have never used Facebook and never will. It is a bit confusing that they're referred to as "recovered" in the paper, but if we just called them "people not using Facebook that never will in the future" that would solve the issue. Ulanoff has the same sort of quibble with the term recovery writing, "The impulse to leave a social network probably does spread like a virus. But I wouldn’t call it “recovery.” It's leaving that's the infection." Ok, fine, call it leaving, that doesn't change the model's predictions. Confusing terminology doesn't mean the model is wrong.

All of this brings up another interesting point, how could we test if the model is right? First off, this is a flawed question. To quote the statistician George E. P. Box, "... all models are wrong, but some are useful." Models, by definition, are simplified representations of the real world. In the process of simplification we leave things out that matter, but we try to make sure that we leave the most important stuff in, so that the model is still useful. Maps are a good analogy. Maps are simplified representations of geography. No map completely reproduces the land it represents, and different maps focus on different features. Topographic maps show elevation changes and road maps show highways. One kind is good for hiking the Appalachian trail, another is good for navigating from New York City to Boston. Models are the same — they leave out some details and focus on others so that we can have a useful understanding of the phenomenon in question. The SIR model, and Spechler and Cannarela's extension leave out all sorts of details of disease spread and the spread of social networks, but that doesn't mean they're not useful or they can't make accurate predictions.

myspace

Spechler and Cannarela fit their model to data on MySpace users (more specifically, Google searches for MySpace), and the model fits pretty well. But this is a low bar to pass. It just means that by changing the model parameters, we can make the adoption curve in the model match the same shape as the adoption curve in the data. Since both go up and then down, and there are enough model parameters so that we can change the speed of the up and down fairly precisely, it's not surprising that there are parameter values for which the two curves match pretty well.

There are two better ways that the model could be tested. The first method is easier, but it only tests the predictive power of the model, not how well it actually matches reality. For this test, Spechler and Cannarela could fit the model to data from the first few years of MySpace data, say from 2004 to 2007, and see how well it predicts MySpace's future decline.

The second test is a higher bar to clear, but provides real validation of the model. The model has several parameters — most importantly there is an "infectivity" parameter (β in the paper) and a recovery parameter (γ). These parameters could be estimated directly by estimating how often people contact each other with messages about these social networks and how likely it is for any given message to result in someone either adopting or disadopting use of the network. For diseases, this is what epidemiologists do. They measure how infectious a disease is and how long ti takes for someone to recover, on average. Put these two parameters together with how often people come into contact (where the definition of "contact" depends on the disease — what counts as a contact for the flu is different from HIV, for example), and you can predict how quickly a disease is likely to spread or die out. (Kate Winslet explains it all in this clip from Contagion.) So, you could estimate these parameters for Facebook and MySpace at the individual level, and then plug those parameters into the model and see if the resulting curves match the real aggregate adoption curves.

Collecting data on the individual model parameters is tough. Even for diseases, which are much simpler than social contagions, it takes lab experiments and lots of observation to estimate these parameters. But even if we knew the parameters, chances are the model wouldn't fit very well. There are a lot of things left out of this model (most notably in my opinion, competition from rival networks.)

Spechler and Cannarella's model is wrong, but not for the reasons most critics are giving. Is it useful? I think so, but not for predicting when Facebook will disappear. Instead it might better capture the end of the latest fashion trend or Justin Bieber fever. 

 

Joshua Spechler and John Cannarella's Facebook is Dying Paper

This morning my email is blowing up with links to articles describing research by Joshua Spechler and John Cannarella, two Princeton PhD students, that predicts Facebook will lose 80% of its user base between 2015 and 2017. Are they right?

The paper is getting plenty of criticism, but as far as I can tell most of the critics haven't read or didn't understand the math in the paper. Let's take a closer look. Spechler and Cannarella's starting point is a basic model of disease spread called the SIR model. The SIR model (and its marketing variant the Bass model) have been applied to study the spread of innovations for decades. Without calling it by its name, I discussed applying the SIR model to the spread of memes online in the previous post on "What it Takes to Go Viral".

The SIR model is pretty simple. Imagine everyone in the world is in one of three states, Susceptible, Infected, or Recovered. Every time a Susceptible person bumps into an Infected person, there is a chance they become Infected too. Once a person is Infected, they stay Infected for awhile, but eventually they get better and become Recovered. The whole model is summed up by this "stock and flow" diagram.

SIR.001

 

Spechler and Cannarella update this model by making the recovery rate proportional to the number of recovered individuals. In other words, as more "recover" there is an increasing rate of recovery. In terms of Facebook, this would be interpreted as an increasing social pressure to leave Facebook as more other people leave Facebook. In our diagram, this amounts to adding another feedback loop — the "abandonment" feedback loop in red below:

SIRr.001.001

 

The effect of adding this loop is that recovery is slower in the beginning, because few people have recovered so there isn't much social pressure to recover, but then to accelerate recovery as the recovered population grows. For Facebook, it would mean once people start leaving, they'll leave in droves. When Specheler and Cannarella fit this model to the data, the best fit predicts that this mass exodus for Facebook will occur between 2015 and 2017.

To test their model they fit it to data on MySpace (they use Google Search data, which is a cool idea) and find that it fits pretty well. But, here's where we need to start being skeptical. First, just because the model fits the data well doesn't mean that the model captures what's really happening. It just means that you can manipulate the parameters of the model to produce a curve that goes up and down with a shape similar to the up and down curve that describes the users of MySpace over time. This isn't too surprising.

More problematic is that the model doesn't account for what is most likely the biggest single reason that people left MySpace — Facebook. In this model, the reason people leave MySpace is that everyone else is leaving MySpace — MySpace becomes uncool and there is a social pressure to not be on MySpace. But in reality, people probably didn't feel pressure to not be on MySpace, they left MySpace because they felt pressure to be on Facebook because that's where everyone else was.

I think this is an interesting model, but it's probably better suited to other phenomenon. When I was in junior high, it was cool to "tight roll" your jeans as demonstrated by these ladies.

tight-rolled-pants-3By the time I was in high school, no one would be caught dead tight rolling their jeans. This is the kind of dynamic that Spechler and Cannarella's model captures.

It's quite possible that Facebook will pass away, but probably only if something new comes along to displace it, not because people are embarrassed if someone finds out they still have an account.

 

Why Some Stories Go Viral (Maybe)

I read a(nother) article on Fast Company today about why some stories "go viral." (Mathematically speaking, why some things go viral and others don't boils down to a  simple equation.)

The article cites research by Jonah Berger and Katherine Milkman that finds articles with more emotional content, especially positive emotional content, are more likely to spread. A quick read of the article seems to promise an easy path to getting your own content on your blog, YouTube, or Twitter to take off. For example, the article cites Gawker editor Neetzan Zimmerman's success, pointing out his posts generate about 30 million views per month — the kind of statistics that get marketers salivating. The scientific research by Berger and Milkman is interesting and well done, but we have to be careful about how far we take the conclusions.

There are two interrelated issues. The first has to do with the "base rate." Part of Berger and Milkman's paper looks at what factors make articles on the New York Times online more likely to wind up on the "most emailed" list. They find, for example, that "a one standard deviation increase in the amount of anger an article evokes increases the odds that it will make the most e-mailed list by 34%."  In this case, the base rate is the percent of articles overall that make the most emailed list. When we hear that writing an especially angry article makes it 34% more likely to get on the most emailed list, it sounds like angry articles have a really good chance of being shared, but this isn't necessarily the case. What we know is that the probability of making the most emailed list given that the article is especially angry equal 1.34 times the base rate — but if the base rate is really low, 1.34 times it will be small too. Suppose for example that only 1 out of every 1000 articles makes the most emailed list, then what the result says is that 1.34 out of every thousand angry articles makes the most emailed list. 1.34 out of a thousand doesn't sound nearly as impressive as "34% more likely."

The second issue has to do with the overall predictability of making the most emailed list. The model that shows the 34% boost for angry content has an R-squared of .28. This model has more than 20 variables including things like article word count, topic, and where the article appeared on the webpage. But even knowing all of these variables, we still can't accurately predict if an article will make the most emailed list or not. All we know is that on average articles with some features are more likely to make the list than articles with other features. But for any particular article, we really can't do a very good job of predicting what's going to happen.

To get a better understanding of this idea, here's another example. In Ohio, 37% of registered voters are registered as Republicans and 36% are registered as Democrats. In Missouri, 39% are registered as Republicans and 37% are registered as Democrats. On average, registered voters in Missouri are more likely to be Republican than registered voters in Ohio, but just because someone is from Missouri doesn't mean we can confidently say they're a Republican. If we only looked at people from Ohio and Missouri, knowing which state a person is from wouldn't be a very good predictor of their party affiliation.

Obesity Epidemic?

Today on Slate there is a nice little GIF (that originally appeared on The Atlantic) that shows how obesity rates have changed over time by state. Slate seems to suggest that the geographic progression of obesity rates might indicate some sort of social contagion. But, ss many others (and here) before me have pointed out, we have to be very careful when trying to draw inferences about social contagion. If we take a look at a map of household income by state, we see that there is a lot of overlap between the poorest states and those with the highest obesity rates.

Household Income

There are lot's of potential causal connections here. For example, income might affect the types of stores and restaurants available, which in turn affects obesity rates. For a more careful look at some data on the social contagion of obesity, have a look at our paper that examines obesity rates, screen time, and social networks in adolescents.

 

As a side note, it's interesting to compare the map of the "obesity epidemic" to a map of something we know spreads through person to person contagion, like the swine flu (image from the New York Times).

H1N1

Unlike the obesity epidemic, swine flu jumps all over the place, which obviously has to do with air travel.

New paper on social contagion of obesity

Along with a team of researchers led by epidemiologist David Shoham from Loyola University, I recently published a paper in PLoS One examining the social contagion of obesity. As many of you know, this is a hotly debated topic of research that was kicked off by work of James Fowler and Nicholas Christakis published in the New England Journal of Medicine.  (See this post for my two cents on the debate.) The central criticism of this research surrounds the issue of separating friendship selection from influence, which in some sense was laid to rest by Cosma Shalizi and Andrew Thomas.

One alternative approach is to use a "generative model," which is exactly what my coauthors and I do. Specifically, we use the SIENA program developed by Tom Snijders and colleagues. Essentially, this model assumes that people make choices about their friendships and behavior just like economists and marketers assume people make choices about where to live or what car to buy.

In our paper, we apply the model to data from two high schools from the AddHeath study. We use the model to understand social influences on body size, physical activity, and "screen time" (time spent watching TV, playing video games, or on the computer). In short, here's what we find:

  • In both schools students are more likely to select friends that have a similar BMI (body mass index), that is there is homophily on BMI.
  • In both schools there is evidence that students are influenced by their friends' BMI.
  • There is no evidence for homophily on screen time in either school, and there is evidence that students are subject to influence from their friends'  on screen time in only one of the two schools.
  • In one of the two schools there was evidence for homophily on playing sports, but in both schools there was evidence that students influenced their friends when it comes to playing sports.

Satellite Symposium on Economics in Networks at NetSci 2012

Anyone interested in applying tools from economics to studying networks or tools from network science to studying economics is invited to the satellite symposium on Economics in Networks to be held in conjunction with NetSci 2012 at Northwestern University in Evanston on Tuesday, June 19. We have a great line-up of speakers and registration is free.

Social Dynamics Videos

While I've been teaching Social Dynamics and Networks at Kellogg, I've amassed a collection of links to interesting videos on social dynamics. Here they are:

Duncan Watts TEDx talk on "The Myth of Common Sense"

Nicholas Christakis TED talk on "The hidden influence of social networks"; TED talk on "How social networks predict epidemics."

James Fowler talking about social influence on the Colbert Report.

Sinan Aral TEDx talk on "Social contagion"; at PopTech 2010 on "Social contagion"; at Nextwork on "Social contagion"; at the International Conference on Weblogs and Social Media on "Content and causality in social networks."

Scott E. Page on "Leveraging Diversity", and at TEDxUofM on "Putting Milk Crates on the Internet."

Eli Pariser TED talk on "Beware online 'filter bubbles'"

Freakonomics podcast on "The Folly of Prediction"

Damon Centola on "Network Contagion."

Jure Leskovec on "The Web as a Laboratory for Studying Humanity"

There are several good videos of talks from the Web Science Meets Network Science conference at Northwestern: Duncan Watts, Albert-Laszlo Barabasi, Jure Leskovec, and Sinan Aral.

The "Did You Know?" series of videos has some incredible information about, well, information. More info here.

Free online complex systems course by Scott Page!

Yesterday I talked with Scott E. Page from the University of Michigan and Santa Fe Institute and he pointed me to an amazing opportunity. He will be teaching a free online course on complex systems and modeling. The course is called Model Thinking and will start in mid to late January. Scott has a short video introduction to the course on the site.

This seems almost too good to be true. Scott is a fantastic teacher — I have been lucky to co-teach course with him at the University of Michigan and in the ICPSR summer program. He's the author of The Difference, Diversity and Complexity, and Complex Adaptive Systems: An Introduction to Computational Models of Social Life. He's a member of the American Academy of Arts and Sciences. The list goes on and on. He is a super clear thinker and has an infectious enthusiasm for modeling and complex systems.

I would recommend this course to anyone who isn't on the verge of finding a cure for cancer or solving world hunger, because otherwise whatever your doing probably won't be more impactful than taking this course.