N-grams, So What?


Digital tool of the week: the n-gram.  I have spent at least a good hour on Google’s N-gram Viewer, which allows one to track the frequency of search terms across the existing corpus of books digitized by Google.

So far I have been using the N-gram Viewer to confirm old hypotheses.  Baby steps.   Today I ran a search on ‘zombie’ within the parameters of the 20th century, which largely served to affirm the findings I took ages to assemble for my undergraduate dissertation. In the n-gram graph, ‘zombie’ only surfaces in 1928 (which according to further research stems from the circulation of William H. Seabrook’s Haiti travelogue The Magic Island, arguably the first mention of the zombie in English writing), enjoys a brief blip of popularity before sinking back into obscurity through the 1930s.  We see vague interest floundering throughout the 1940s and 1950s, thanks to the black-and-white B-movie zombie craze.  Oddly, the spike of popularity I expected in the late 1960s, congruent to George A. Romero’s landmark film Night of the Living Dead, is not as pronounced as I thought it would be – possibly because Romero did not really use the word ‘zombie’ in his film, rather ‘ghoul’, and the word only got tagged on later once the film reached cult status.

This brings me to one of the fallibilities of the n-gram, which Ted Underwood has aptly pointed out in his blogpost ‘How Not To Do Things With Words’: the keyword used is all important.  Were I to go blindly searching for the keyword ‘zombie’ alone, the results would be undeniably skewed.  Similarly, an n-gram comparing the frequency of mentions of Ernest Hemingway against F. Scott Fitzgerald shows Fitzgerald winning when only their surnames are used in the search, but Hemingway tops when it comes to full names.  These are details one expects statisticians to deal with, not literature students.  But the world changes.

Franco Moretti has said of the Google Books database: ‘It’s like the invention of the telescope.  All of a sudden, an enormous amount of matter becomes visible.’  He was being quoted in the Chronicle in an article by Marc Parry.  Here Parry trails Stanford researchers Moretti and Matt Jockers in their Literature Lab, where a team of English literature, history and computer science experts pursue digital literary research.  At the time of the report (published in 2010), they were text-mining 19th century novels.  Parry, in a deft turn of phrase, produces the perfect metaphor to explain text-mining to the layman: ‘…they cast giant digital nets into that megapot of words, trawling around like intelligence agents hunting for patterns in the chatter of terrorists.’

Some days the whole operation of big data and text mining and concordances begins to sound frighteningly like the Book Machine of Lagado, in Jonathan Swift’s Gulliver’s Travels:

‘[The machine] was twenty feet square, placed in the middle of the room (…) composed of several bits of wood (…) covered, on every square, with paper pasted on them; and on these papers were written all the words of their language, in their several moods, tenses, and declensions; but without any order. The professor then desired me “to observe; for he was going to set his engine at work.” The pupils, at his command, took each of them hold of an iron handle, whereof there were forty fixed round the edges of the frame; and giving them a sudden turn, the whole disposition of the words was entirely changed. He then commanded six-and-thirty of the lads, to read the several lines softly, as they appeared upon the frame; and where they found three or four words together that might make part of a sentence, they dictated to the four remaining boys, who were scribes. This work was repeated three or four times, and at every turn, the engine was so contrived, that the words shifted into new places, as the square bits of wood moved upside down.

Six hours a day the young students were employed in this labour; and the professor showed me several volumes in large folio, already collected, of broken sentences, which he intended to piece together, and out of those rich materials, to give the world a complete body of all arts and sciences; which, however, might be still improved, and much expedited, if the public would raise a fund for making and employing five hundred such frames in Lagado, and oblige the managers to contribute in common their several collections.’

I guess that would be Swift, presciently and unpleasantly, giving us his thoughts on text analysis from the 18th century.

How do we prevent the literary take on Big Data from turning into the nonsensical Lagado book machine?  Moretti and Jockers say in the Chronicle article that they ask themselves every day: so what?  So we can read one thousand novels in one click of a button.  So what?   So, I suppose, that one day we can make generalizations about all of literature and then prove they are not just generalizations.  We just have to wade through all the statistics first.  Among other things.

  • Swift, Jonathan. Gulliver’s Travels. New York: Airmont, 1963.




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s