Scaling Social Science with Apache Hadoop
This post was contributed by researcher Scott Golder, who studies social networks at Cornell University. Scott was previously a research scientists at HP Labs and the MIT Media Laboratory.
The methods of social science are dear in time and money and getting dearer every day.
— George C. Homans, Social Behavior: Its Elementary Forms, 1974.
When Homans — one of my favorite 20th century social scientists — wrote the above, one of the reasons the data needed to do social science was expensive was because collecting it didn’t scale very well. If conducting an interview or lab experiment takes an hour, two interviews or experiments takes two hours. The amount of data you can collect this way grows linearly with the number of graduate students you can send into the field (or with the number of hours you can make them work!). But as our collective body of knowledge has accumulated, and the “low-hanging fruit” questions have been answered, the complexity of our questions is growing faster than our practical capacity to answer them.
Things are about to change. We’re reaching the end of what philosopher Thomas Kuhn might call “normal science” in the social sciences — a period of time when scholarly progress grows incrementally using widely-accepted methods. This doesn’t mean an end to interviews, surveys, or lab experiments as important social science methods. Though questions about interpersonal behavior and small groups are undoubtedly still interesting, what we really want to know — what we’ve always wanted to know — is how entire societies work. The most interesting findings are going to have to come some other way.
“Computational social science”  represents a turn toward the use of large archives of naturalistically-created behavioral data. These data come from a variety of places, including popular social web services like Facebook and Twitter, consumer services like Amazon, weblog and email archives, mobile telephone networks, or even custom-built sensor networks. What these data have in common is that they grow as byproducts of people’s everyday lives. People email, shop and talk for their own reasons, without thinking about how the digital traces of their activity provide naturalistic data for social scientists.
That the data are created naturalistically is important both methodologically and theoretically. Though social scientists care what people think it’s also important to observe what people do, especially if what they think they do turns out to be different from what they actually do.When responding to survey or interviews, subjects might honestly mis-remember and mis-report the past. They might deliberately omit some things that embarrass them, or rationalize post-hoc and justify actions differently from how they reasoned about them at the time . Collecting data on actual behavior is seen by many as the gold standard of social science, and experimental methods have had, and continue to have, many successes across the social sciences, including how people interpret probabilities in decision making, and how people develop beliefs about status hierarchies along racial, gender and other dimensions. But it was recognized long ago that findings within a lab might not generalize to the whole world. What we need to do now is measure the whole world in a controlled way. The web services named above do just that. Want to know how corporations really work? Look at their email . Want to know about racial preferences in dating? Look at their online dating profiles (or even server logs) .
I believe that Hadoop is going to play a large role in analyzing these data and therefore in generating social science advances very soon. In the infancy of the social web, even successful systems had only thousands or tens of thousands of users (in contrast with tens of millions today), and creating an archive of all of the system’s data was as simple as doing SELECT *on each table in a MySQL database. But in an ironic twist, this method’s undoing would be the success of the social web itself. Though in Homans’ day, the questions grew faster than the data, today the data is growing faster than we can store and process it.
Enter Hadoop. Last year, I decided to invest some time in learning to write my data analysis processing programs using MapReduce. Cornell is lucky enough to have a project called WebLab whose resources include a 50-plus node Hadoop cluster, and I am lucky enough to be allowed to use it. As soon as I ran some test cases on it — single-process implementations of computations that took 4.5 hours on my beefy workstation took 3 minutes when implemented in MapReduce — I was sold.
In social network analysis, the research area I work in, the main questions of interest concern how patterns of social relationships affect individuals’ behavior and create social structure at the macro, or societal, level. Network analysts work in many areas of sociological interest, such as markets, employment, individual well-being, opinion formation, and others. Often, the structural properties of these networks are important predictors of individual behavior, but the computations required to calculate these measures is prohibitive. Right now, for example, I’m struggling to work with a comparatively large dataset comprising about 8 million people.
I learned quickly, it’s not the size of the data that kills you, it’s the size of the metadata. Thought it’s relatively easy to count the number of friends or neighbors everyone has, other calculations, such as the average number of “steps” between each pair of people, have much more demanding computational requirements. Algorithms that are O(n2) or bigger in their space or time requirements become prohibitive — a network with 8 million members has a whopping 64 trillion relations between (all pairs of) members. No individual workstation, no matter how fancy, is equal to such a task. With enough disk space and RAM, and some fancy programming tricks that repeatedly swap to/from disk only the data necessary for parts of computations, you might be able to process all that data, once, and it might take several weeks to do even that. Distributing the same computation over a large number of Hadoop nodes and finishing the process in minutes or hours means that it’s possible to iterate rapidly. Iterating rapidly means fixing bugs rapidly, and trying variations rapidly. I can process weighted and non-weighted versions of the same graph in quick succession, with only a small code change, for example.
Learning to process data using MapReduce is a skill that scales. The benefits of MapReduce over conventional programming is, in my opinion, equal to (or greater than) the benefits of conventional programming over analyzing data by hand. It takes a sizable initial time investment to learn to think in this way, especially if you haven’t been exposed to functional programming before (most non-computer scientists haven’t). But after getting the basic idea — the application of a series of transformations and compressions to data — the usefulness of the skill continues to grow naturally. The data is going to keep getting larger and more detailed, as more people experience more of their social and economic lives online. But the size of Hadoop clusters are going to get larger as well, and often increasing the number of nodes a job is processed on is as simple as changing one parameter in a configuration file. Academics can request access to the National Science Foundation’s TeraGrid system, and academics as well as recreational or corporate data crunchers can use MapReduce with their own clusters or cloud-based services like Amazon’s Elastic MapReduce.
Another of my favorite scholars, British sociologist Anthony Giddens, once remarked that because we live in a modern world in which people naturally reflect on their own behavior and the behavior of others, the professional sociologist is “at most one step ahead of the enlightened lay practitioner” . And that was before we lived in a world of gigantic datasets. Besides their use in web search at Yahoo and Google, Hadoop and MapReduce have been touted as having tremendous potential in the area of business intelligence. The Economist just recently focused on this very issue . Though corporations are generally not releasing their internally-generated data, governments and the media are starting to get into the act, with data.gov in the U.S. and Guardian Datastore in the U.K. User-contributed sites like Swivel and ManyEyes contain data sets of many different kinds, though of relatively small sizes. Clearly, many people are interested in questions of social scientific importance and the stories these data can tell us. I think that’s a really good thing, and I’m excited for the long-term prospects of both “professional” and “amateur” data analysis. In the same way that the DIY movement and publications like Make Magazinehave inspired laypeople to become interested in some of the principles and practices of engineering, public datasets can perhaps inspire interest in the social sciences. Right now, the datasets are small and not particularly interoperable, but I have some confidence that will change over time. Imagine a world in which mashups aren’t just songs and videos, but terabytes of data, where the data input path specified in a Hadoop configuration file isn’t a local directory containing one’s own data, but rather a URI pointing to some stranger’s (or company’s or government’s) publicly-available archives.
There’s a long way to go. Business practices, technologies and tools, and social science training each have years of advances to make before such a reality can become possible.
Until now, scientific computing has largely been the domain of the natural sciences, fields like fluid dynamics, astrophysics and bioinformatics. The computational social science revolution that is just beginning is mostly attributed to the growth in data available from the sources I’ve mentioned and surely from scores more, and I agree; you can’t have data analysis without the data. Another important part of that story is computation on a cheap, pervasive, distributed cloud, and tools like Hadoop to process and analyze it all.
 For an overview see, this article published in Science last year.
 They might also provide insightful but non-intuitive ideas that open up whole new lines of inquiry in your research. So these methods have many positive and indispensible qualities, too.
 You can start with these two, very different, papers: Email as Spectroscopy or Communication (and Coordination) in a Modern, Complex Organization.
 An excellent choice is Cynthia Feliciano’s Gendered Racial Exclusion among White Internet Daters. The dating service OkCupid also has a blog which is suggestive and interesting, but not quite controlled enough to be persuasive as social science.
 Anthony Giddens, The Consequenecs of Modernity, 1990.
 “The Data Deluge“, The Economist, 10 February 2010.
 See the WebUse project to see the ways in which internet users are and are not representative of larger populations.