Visualizing 27 years, 12 million words of the Humanist list

Launch the visualization


Back in September, I spent some time playing around with a little project called Textplot, which converts a document into a network of terms by computing similarities between the distribution patterns of individual pairs of words. When you pass the network through a force-directed layout algorithm, it folds out into a kind of conceptual atlas of the document, a two-dimensional diagram that teases out the underlying topic structure of the text. War and Peace, for example, turns into a big trianglewar on the left, peace on the right, Tolstoy’s essays about history on top.

Under the hood, each word is converted into a probability density function across the width of the document – this makes it possible to compute a really fine-grained similarity score between any two words, which can then be used as an edge weight in the network:


In a lot of ways, the density functions look like the time-series plots that crop up in visualizations that show how the frequency of a word changes over time – most notably, of course, the Google Ngram viewer, but also projects like the New York Times’ “Chronicle” tool, which does something similar for the NYT corpus going back to the middle of the 19th century. They’re not exactly the same thing. The biggest difference is that the density functions are normalized so that they always trace out the same amount of area over the X-axis, regardless of how many times the word shows up in the document – this is what makes it possible to compare the distribution of any two words, even if one shows up 1000 times and the other just 20 times. Whereas, the raw word counts in the n-gram viewers show the absolute difference in frequency between words. But, the gist is similar – both are capturing information about how something fluctuates over time.

With the novels, though, the “time” axis isn’t really time at all, but instead what Matt Jockers calls the “novel time” of the text – the interval between the beginning and the end. This got me thinking, though – what would happen if the X-axis actually were, in fact, time in the literal sense of the word? What if the “text” were actually a huge corpus of documents, daisy-chained together in chronological order into a single mega-document, so that the “novel time” of the text corresponds to the “historical time” of the corpus? Would Textplot surface some kind of broad, diachronic shift in semantic focus over time, in the way that it captures the linear progressions in texts like Walden and the Divine Comedy?

Data cleaning

I decided to try this out with the Humanist list, the venerable, 27-year-old email listserv started by Willard McCarty at the University of Toronto in 1987. This seemed like a good place to start for a couple of reasons. The fulltext archive can be downloaded directly from – and, as a built-in freebie, email is inherently chronological, which meant that I didn’t have to do any kind of prep work to make sure that the documents were in the right order. I was also kind of interested to try this with a corpus that I don’t know very much about. I’ve subscribed to the Humanist for the last couple years, but I’ve only read it on and off (for a long time, gmail flagged it as spam!), and I certainly don’t know anything about what the list was like for the 25 years before 2012. This is an interesting opportunity, though, to test the usefulness of this kind of approach – without reading the entire thing, what could I learn about it?

I downloaded all of the 27 year-long archive files, and started by writing a little Python script to scrub out the large quantity of non-human-readable “header” information that gets prepended to most of the emails, leaving just the text that had actually been typed out by people (for the most part). Then, I just concatenated all of the files together into a single, 80 megabyte humanist.txt file, which Textplot parses this out into a cool 11.5 million words, consisting of 138,476 unique types. Last, I made a couple of tweaks to the logic that determines which words get added to the final network – I wanted to pick words that are the most characteristic of a particular period in the history of the corpus, in an effort to get the most coherent portrait of the diachronic shift over time (more on this later). Once all the pieces were in place, I built out the graph, fired up Gephi, flipped on Force Atlas 2, and watched the network open up into a big, spindly line1987 on the left, 2014 on the right:


Almost immediately, I found myself tabbing back and forth between Gephi and the iPython terminal, pulling up the density functions for different terms to see how they do (or don’t) map onto the positions of the nodes in the network. This got annoying pretty quickly, so I decided to write some code that would take the raw GML that comes out of Gephi and turn it into an interactive, d3-powered viewer that makes it easier to compare the positions of the nodes in the network with the distributions profiles of the words across the history of the corpus:


As the network is panned and zoomed, the time axis at the bottom of the screen will automatically refocus so that it always shows the (approximate) temporal range of the current viewport:


And, to see the temporal pattern of an individual word, click on the label to open up a little chart that shows the kernel density estimate for that word, aligned below the “minimap” at the top right, which makes it easy to see how the final placement of the word compares to the density profile:


The correspondence is pretty tight, though not exact. Most of the nodes line up pretty closely with the approximate “center of mass” of their density functions, although some words get pulled off into weird positions that don’t really make sense. This is usually because the distribution of the word is really distinctly multimodal (it clusters in more than one place), which causes the layout algorithm to drag the node into a kind of median position, a no-man’s-land between the different regions of the network that the word is bound to.

Reading the Humanist

But, overall, the the layout does seem to tease out a kind of condensed, visual intellectual history of the list. To the left, at the very beginning, the list is dominated by words related to the hardware and software of the mid-80s – mainframe, microcomputer, workstation, wordperfect, printer, macintosh, vax, diskette, hypercard, bitnet, compatible, command, modem, compuserve, telnet, etc. It seems pragmatic, grounded in the day-to-day praxis of computing, and relatively low on theory. Beyond the technical terms, the academic conversation seems to center on language and textual studies – cyrillic, sanskrit, diacritics, arabic, grammar, lexicography, morphological, phonetic. And, probably not unrelated, there also seems to be a focus on religious studies – bible, hebrew, religious, church.

Then, the 90s seem to be about growth. There’s a sudden explosion of place names – Philadelphia, Pittsburgh, Pennsylvania, Georgetown, Quebec, Rutgers, Ottawa, Lancashire, Buffalo, Sussex, Florida, Sheffield, Washington, Oregon, Africa, Michigan, Toronto, Berkeley, Iowa, California, USA. And, in the first couple years of the aughts, there seems to be a broadening of scope, a turn outwards – press, national, forum, education, consortium, worldwide, dialogue, commerce. 2002 is anchored by issues, an interestingly flexible word with notes both of critical engagement – “issues in computational literary analysis,” etc. – and of difficulties, challenges, concerns.

Starting around 2000, there’s an uptick in words that seem broadly related to the day-to-day work of administering an academic discipline – creating structures for exchanging and communicating ideas, preserving results, evaluating the quality of work, etc – dissemination, evaluation, preservation, speakers, invited, lecturer, workshop, organised, proposals, interviews, submission, deadline. By the end of the decade, a distinctly type-2 DH is anchored by collaborative, and other words that hang together with an understanding of DH as something that takes place across traditional disciplinary and professional lines – team, alliance, intersection, technologists, interdisciplinary, roundtable.

The “spatial turn” seems to register around the same time (spatial, gis), and the network then drifts into more or less the present moment. PhD, studentship, and postdoctoral, all of which have been gradually gaining ground since about the turn of the century, both surge around 2012, perhaps tracking the formalization of DH as a discrete field of study, instead of just a methodology that gets mixed into existing disciplines? At the far right is a cluster of terms related to social media and modern web products, which provides a tidy counterweight to the 80s-era technologies to the left – gmail, wordpress, ipad, blogspot. And, of course, twitter and facebook, both of which peak out in unison before rapidly falling off in the spring of 2012, which seems to have been the moment of peak-DH-social-media? It’s fascinating to see just how late in the game digitalhumanities (and then, a bit later, the abbreviated dh) come into view – neither existed before about 2005.

This is especially fun for me because the history of the list maps almost exactly onto my own life! I was born on June 25, 1987, just 44 days after Willard McCarty sent the first message on May 12:

“This is test number 1. Please acknowledge.”

Next steps

This seems to work well for the Humanist, but I’m curious to see how the same technique generalizes to other corpora – especially at really large scales, and over much longer temporal intervals. I’m thinking about trying to do something similar with the newly released feature-count data set from HathiTrust, which provides page-level term counts for 250,000 volumes published between 1431 and 2010. Would you get the same kind of broad, unified, coherent diachronic progression that surfaced out of the Humanist? Or is that the exception, not the rule?