New Research Tool: Culturomics

I just heard about a digital resource which went online yesterday. Called Culturomics, this tool allows a researcher to comb through the last two hundred years of books and chart the frequency with which specific words occur.  Here’s a more detailed description of the database, as well as some words of caution about its limitations,  from the “Interpretation” page.

The Google Labs N-gram Viewer is the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases (‘United States of America’) in books over time. You’ll be searching through over 5.2 million books: ~4% of all books ever published!

There are lots of different things you can check, like your favorite word (Supercalifragilisticexpialidocious) or person (President Taft; Chief Justice Taft) or part of the holiday (Christmas Tree).

It can be fun to compare things, too; whether it’s people (Galileo, Darwin, Freud, Einstein), pieces of music (Beethoven’s First, Beethoven’s Second, Beethoven’s Third, Beethoven’s Fourth, Beethoven’s Fifth, Beethoven’s Sixth, Beethoven’s Seventh, Beethoven’s Eighth, Beethoven’s Ninth), facts about grammar (sneaked, snuck), or increasingly precise values for the speed of light (‘2.99796, 2.997925, 2.99792458’).

The browser allows you to search different collections of books (called ‘corpora’). You’ll definitely want to try taking advantage of more than one corpus. For instance, compare ‘centre, center’ in both American and British English. Corpora are available in English, Chinese, French, German, Hebrew, Russian, and Spanish, so you can examine effects in many different cultures and compare them to one another (‘feminism’ in English vs. ‘féminisme’ in French, for instance.) If you look carefully, you can occasionally see evidence of censorship (such as ‘Marc Chagall’ in the German corpus under the Nazis.)
But even with all that data, you’ll need to carefully interpret your results. Some effects are due to changes in the language we use to describe things (‘The Great War’ vs. ‘World War I’). Others are due to actual changes in what interests us (note how ‘slavery’ peaks during the Civil War and during the Civil Rights movement.)

Watch out for the time period your are looking into: the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project. The other corpora are smaller, and can’t be used to go as far back in time; their metadata has also not been subjected to as much scrutiny as English in the bicentennial period.

Notice that the potential for music research is included in the above description.  Data is displayed in the form of a graph, which plots the frequency of occurrence by decade.   After just a few minutes of playing around with the search engine, I found some pretty interesting stuff!  For example, here’s a screen capture of the graph (click on the image for a larger version) showing data for the search term “French horn” [obtained from]

Notice the spike in the graph – use of the words “French horn” – at least in the documents archived in this database – seem to peak between 1940 and 1950.  What does that mean?  I have no idea!  But it is suggestive, and seems like a viable topic for further research. Hooked yet?  How about another search term, this time “Kopprasch.”

The peak this time is between 1980 and 2000.  Again, the interpretation of this data would take a bit more research, but having these kind of raw statistics at your fingertips is an amazing tool in itself .  And for a final example, how about “horn concerto?”

As you can already see, there are numerous possibilities for this search engine in music research, and I’m sure it will see much more use over the coming weeks and months.

On another note, I’ll be probably be posting less frequently over the next couple of weeks because of the holidays, but I plan to resume regular posts in the New Year.  I want to wish all of my readers a safe and enjoyable holiday season!


