close this window
Analyzing Culture with Google Books: Is It Social Science?
OPINION: Discovering fun facts by graphing terms found among the 5 million volumes of the Google Books project sure is amusing — but this pursuit dubbed ‘culturomics’ is not the same as being an historian.
For more stories about all things Google, see the links at the end of this article.
Earlier this year, a group of scientists — mostly in mathematics and evolutionary psychology — published an article in Science titled “Quantitative Analysis of Culture Using Millions of Digitized Books.” The authors’ technique, called “culturomics,” would, they said, “extend the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.” The authors employed a “corpus” of more than 5 million books — 500 billion words — that have been scanned by Google as part of the Google Books project. These books, the authors assert, represent about 4 percent of all the books ever published, and will allow the kind of statistically significant analysis common to many sciences.
This sounds impressive. The authors point out that 500 billion words are more than any human could reasonably read in a lifetime. Their main method of analysis is to count the number of times a particular word or phrase (referred to as an n-gram) occurs over time in this corpus. (Try your own hand at n-grams here.) Their full data set includes over 2 billion such “culturomic trajectories.” One of the examples the authors give is to trace the usage of the year “1951.” They note that “1951” was not discussed much before the actual year 1951, that it appeared a lot in 1951, and that its usage dropped off after 1951. They call this evidence of collective memory.
I initially reacted to this article with skepticism. As I read more — including a recent piece (one might call it a puff piece) in Nature on one of the co-authors, Erez Lieberman Aiden, in which he was dubbed “the prophet of digital humanities” — my skepticism became stronger. I think culturomics is a nifty tool, but we need to be cautious and critical about this kind of digital data and about claims that culturomics could make “much of what [historians] do trivially easy.” Historians do much more than follow trajectories, so I am not so sure that culturomics will lead to a new way of doing historical work. It’s not the game-changer it’s been claimed to be.
I would not call myself a Luddite — I use digital resources all the time, in my research and my teaching. I have hundreds of PDFs of books I have downloaded from a variety of online sources — Early English Books Online, Eighteenth Century Collections Online, Gallica (the digital service of the French National Library), and yes, Google Books — that I use in my research.
But when I read the Science article, I was immediately struck by what seems to me to be a fundamental flaw in its methodology: its reliance on Google Books for its sample. Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society. As any historian knows, every scholarly library is different and every library has its biases. And surely I am not the only historian who has noticed that the digitizing policy of Google Books does not, and perhaps cannot, result in anything like a uniform, or a uniformly random, sample of all books in a given period. Google’s ability to digitize books is dependent on a number of factors: the willingness of libraries to open their collections for digitization; the condition of the books being digitized; copyright regulations, which allow only “snippets” of many 20th-century books; and the quality of the digitization process itself.
The authors further narrow their range by admitting only publications for which they have “metadata” — that is, author, title, year, immediately confining the range of publications to books, and not periodicals or other more ephemeral literature — and to the period after 1800. The article itself gives no clue as to how the authors obtained this metadata. But surely it skews their data set even more toward a certain kind of book, while treating books as interchangeable pieces of data. In this universe, one book is much like another.
The authors equate size with representativeness and quantity of data with rigor. I am not sure that is true. I do not deny that some of their results are interesting, particularly the tracing of linguistic and grammatical changes over time, which is like watching a speeded-up newsreel. But some of the results are simply banal. The year “1951” appears most often in 1951. The word “slavery” appears more often during the U.S. Civil War. The word “influenza” appears more often during pandemics. Duh. Are these even historical questions?
Perhaps most disturbing to me is the underlying assumptions of such work about the humanities and about what scholars in the humanities do. One assumption is that the humanities need to be more like science and that we need to be more like scientists — that quantitative knowledge is the only legitimate knowledge and that humanities scholars are therefore not “rigorous.” For well over a century, historians and their critics have debated whether their discipline is a science or an art. When the journal Past and Present was founded by a group of Marxist historians in the early 1950s, it was billed as “a journal of scientific history.” By the mid-1960s this had changed to simply “a journal of historical studies.” On the one hand, there are plenty of examples of humanities scholars who have been using sophisticated digital tools and quantification for years. The Cambridge population survey, with birth and death information gleaned from thousands of parish record books all over England, revolutionized social history when it began in the 1960s. When I was in graduate school in the 1980s, the SPSS statistical package could be mastered as an alternative to a second language. As cultural history became more prominent, quantitative history became less fashionable, but it never disappeared.
On the other hand, as these examples indicate, there is not just one kind of historical or, more broadly, humanities scholarship as the Science authors seem to think. Not all of us trace ideas over time. Some of us look at the people who had those ideas and the places they lived and worked, and the people they knew, and how they lived. Not all of this can be found in books but must be traced across a variety of published, manuscript and material media. Although the culturomics people are confident that they can apply their methods to manuscripts and maps, I’m not going to wait for that possibility.
Much like the digital versus the long-lost card catalog, such a sweeping tool leaves out the chance juxtapositions and serendipities that often tell us much more than the texts themselves. I spent many years off and on at the British Library reading advertisements in the microfilmed Burney collection of 18th-century newspapers. Now these have been digitized, and I can search for “anatomy lectures” and come up with dozens of hits that took me many eye-straining hours to find. But it cannot tell me that on the previous page, or in the previous issue, there was an ad for a patent medicine, or a live animal combat, or another fascinating bit of 18th-century London life that lends meaning and context to the bare entry.
It is revealing of another kind of bias that the long list of authors of the Science article includes no historians, in fact no one from the humanities (Louis Menand also pointed this out in an interview in The New York Times). To be fair, “R. Darnton” and “C. Rosenberg” (presumably the Harvard historians Robert Darnton and Charles Rosenberg) are thanked at the end. The Nature article goes out of its way to point out that Erez Lieberman Aiden studied history and philosophy and even creative writing, which is something like saying I took physics in college, and therefore I can publish on quantum mechanics in Nature. Both articles show a nearly complete lack of understanding of what historians and other humanities scholars actually do.
When Lieberman Aiden and his co-authors presented their findings at the meeting of the American Historical Association in January, AHA President Tony Grafton expressed cautious praise of this new tool. In the Nature article he sounds decidedly more anxious: “You can’t help but worry that this is going to sweep the deck of all money for the humanities everywhere else.”
Indeed.
***
More Stories About Google
How Google Disrespected Mexican History
Dear Google: Do Not Track Me
The Government, Google and Lady Gaga
Google Street View Ruffles European Feathers
Sign up for the free Miller-McCune.com e-newsletter.
“Like” Miller-McCune on Facebook.
Follow Miller-McCune on Twitter.
word on the street
more in this section
Announcing Our New Name
Bitter About Your Life? Blame Facebook
Miller-McCune’s Top Stories of 2011
Pop Charts Still Dominated by Men
Two Russian Films Give Differing Views of Motherland
Securing Nebulous Privacy Rights in the Cloud
Searing Look at Rio’s Homicidal Police
PBS to Show ‘Where Soldiers Come From’
Civil Rights Groups’ Surprising Net-Neutrality Bedfellows
Call Us Names (Or At Least, Give Us Some …)
also by this author

Receive 1 year (6 issues) of our print magazine for just $14.95. Miller-McCune features polished, in-depth reports on research and solutions across the policy spectrum — from health care, education and energy to international affairs, poverty and the global economy. It's a must read for well-informed and solutions-driven individuals.

follow us on:
from the source

What makes communities strong and vibrant? Researchers say local schools bring a raft of positives to town — even for the childless — beyond creating an educated populace.

New research finds support for school projects differs according to the race and age of the recipients.

Once seen as non-ideological “universities without students,” the American think tank has, in many cases, become a partisan stalking horse that devalues the sector’s scholarship.

Swiss scientists plan to send a “janitor satellite” into orbit to attempt to clean up space debris.

Turning unloved federal property into homeless services centers has been federal law for a quarter century, but tough times have bureaucrats hoping to shove that tradition into the cold.








