Literary MRIs (or, tuning Textplot)

Last week I posted some instructions for getting up and running with Textplot. But, in the step that involves actually using Textplot, there was some handwaving:

textplot generate pg2600.txt war-and-peace.gml --bandwidth 10000

Why 10,000 for the bandwidth? And, what about the other open parameters that the CLI app exposes – term_depth, skim_depth, kernel, etc.? These have a big impact on the structure of the final graphs, and it’s often worth fiddling with them a bit to find the best combination. This can especially make sense if you’re working with texts that are really long, or a big corpus of chronologically-stacked documents – out of the box, Textplot is tuned to work well with roughly novel-length documents (say, a couple hundred thousand words), and there are a handful of settings that should be tweaked if you go out of that range.

And, either way, I think this kind of knob-turning is useful intellectually. It’s a good guard against tea-leaf-reading and confirmation bias, a never-distant risk with this kind of work. By drawing out the same text with really different parameters, you get a sense of what’s “durable” about the structure of the network (and, possibly, solid ground for literary-critical argument) – and what’s just ephemeral, an artifact of the layout algorithm, or some kind of quirk in the particular combination of words and connections that slipped above the different thresholds for inclusion.

There are three main parameters worth looking at:

term_depth

The simplest is the term_depth, which is just the number of words that get added to the network. Although in theory you could use all the words in the text, this isn’t actually very interesting – most words appear just once or twice (see Zipf’s Law), which doesn’t give enough information to say anything statistically meaningful about whether the word “represents” any particular part of the document. Most connections would be noise, a reflection of the incidental co-occurrences of low-frequency terms.

For now, Textplot gets around this by just taking the top N most frequent words in the document, once stopwords are removed. I think there are probably cleverer ways to do this – I’ve been trying to find ways of picking out particularly “clumpy” words that appear both very frequently and in really distinct clusters. But, for now, just skimming off the most frequent words seems to be a reasonably good, low-magic solution.

It goes without saying, the fewer the words, the smaller and simpler the network. Here’s War and Peace in 100 words, which still shows the basic war / peace opposition, but not much else:

td100-sd10-b10000-web2

term_depth: 100 / skim_depth: 10 / bandwidth: 10000

500 words surfaces the historiography cluster, and seems to pull out a minimally “complete” model of the document:

td500-sd10-b10000-web2

term_depth: 500 / skim_depth: 10 / bandwidth: 10000

1000 words (the default):

td1000-sd10-b10000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 10000

By 2,000 words, the “war” section looks more confused:

td2000-sd10-b10000-web2

term_depth: 2000 / skim_depth: 10 / bandwidth: 10000

Pretty quickly, you start to dip down into the reservoir of terms that appear too infrequently to be interesting. At 4,000 words, these start to clutter up the scene in the form of little clumps around the edges that correspond to specific sections in the text that use a distinctive set of words (eg, the spoke on the far right is the famous wolf hunting scene in Book 7). But, though these are often recognizable and coherent, they don’t add much information about the general structure of the layout, and it all starts to look tangled and haphazard:

td4000-sd10-b10000-web2

term_depth: 4000 / skim_depth: 10 / bandwidth: 10000

At the far end of the spectrum, it becomes basically meaningless. At 10,496 words (all of them), the words that appear just once get linked up into “chains” with the other words that show up immediately before and after, which, when passed through the layout algorithm, get shot off as a set of weird, shroud-like tendrils in the periphery:

td8000-sd10-b10000-web2

term_depth: 10496 / skim_depth: 10 / bandwidth: 10000

A ghost? An octopus? The angel of history? Who knows. It’s almost like focusing a lens – too few words, and important things are missing; too many, and the structural essence gets drowned out by the noise of the low-frequency terms. I wonder – is it always true that the basic structural essence of a text is instantiated in about 500 words? Or is that just a function of the length of the document? Or, does it vary in a meaningful way from author to author, genre to genre? Beyond just eyeballing it – could it be quantified? I’m most interested in this kind of question. I think the visualizations are fun as literary intuition pumps, to borrow an idea from Daniel Dennett (via my wife) – provocations that force you to formalize and evaluate your mental model of something. But, in the long run, I think they might be intellectually interesting to the extent that they can point the way to new ways to classify and compare texts en masse.

skim_depth

The skim depth is the number of “siblings” that each word is connected to in the network. For example, in War and Peace, these 20 words have distribution patterns most similar to “napoleon”:

If the skim depth is 5, then these edges will be added to the network:

napoleon -> war (0.65319871313854128)
napoleon -> military (0.64782349297012154)
napoleon -> men (0.63958189887106576)
napoleon -> order (0.63636730075877446)
napoleon -> general (0.62621616907584432)

Or, if 6, then an edge to “russia” will also be added:

napoleon -> russia (0.62233286026418089)

And so on and so forth. Roughly speaking, this controls the “clumpiness” or “connectivity” of the graph. If it’s low, then words will only be connected to words that show up in very similar patterns, and the nodes will tend to ball up into really assortive little clusters. If it’s higher, then more connections will be added for each word, but each additional connection will be weaker than the last. These weaker edges will tend to bridge across topic clusters formed by the more statistically significant edges, which produces a smoother, more evenly-bound structure.

Here’s War and Peace with 1000 words and a (very low) skim depth of 3, which gives a much more spindly and sparsely-connected version of the triangle:

td1000-sd3-b10000-web2

term_depth: 1000 / skim_depth: 3 / bandwidth: 10000

At 10 (the default), it’s bound together more tightly:

td1000-sd10-b10000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 10000

At 20, the edges of the triangle start to get smoothed out:

td1000-sd20-b10000-web2

term_depth: 1000 / skim_depth: 20 / bandwidth: 10000

Even more so at 50:

td1000-sd50-b10000-web2

term_depth: 1000 / skim_depth: 50 / bandwidth: 10000

And, beyond a certain point, it just turns into a ball – words end up getting linked to words that they don’t have much in common with at all. Only the broadest strokes survive:

td1000-sd200-b10000-web2

term_depth: 1000 / skim_depth: 200 / bandwidth: 10000

bandwidth

Last but not least, the bandwidth controls the smoothness of the underlying probability density functions. For example, here’s “napoleon” in War and Peace, with a 500-word bandwidth:

bw500

This is probably too low. The idea is that the density function should extract a kind of statistical “trend” for a word – it should smooth out the noise, but not so much that it papers over important information about the overall pattern. Here, though, the distribution looks “undersmoothed,” in the sense that it’s splitting apart clusters that intuitively seem like they should hang together (for instance, the side-by-side peaks around word 375k – those look like they belong to the same “unit” of plot, to my eye). Here’s the resulting layout, which looks sort of diffuse and half-baked:

td1000-sd10-b500-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 500

5,000 looks like a better representation of the high-level trend:

bw5000

And, the layout seems to come closer into focus:

td1000-sd10-b5000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 5000

10,000 seems like a sweet spot for War and Peace:

bw10000

td1000-sd10-b10000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 10000

The historiography cluster is more elongated and drifts up higher from the main body, and the war and peace sections each appear seem to split into two recognizable sub-sections – for war, the actual battles (guns, battery, cannon, cavalry, flank, artillery) as opposed the things that come before and after (campfires, road, wagons, carts, prisoners, captured). And, for peace, the social life of adults (drawing, room, hall, guests, anna, pavlovna, bezukhov, bear) versus the domestic life of families and children (marriage, mother, natasha, countess, supper, papa, household).

At 20,000, the density functions start to get a bit abstract:

bw20000

And the main war <-> peace axis starts to compress towards the center:

td1000-sd10-b20000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 20000

And, at 50,000, it pretty much breaks down:

bw50000

td1000-sd10-b50000-web2

term_depth: 1000 / skim_depth: 10 / bandwidth: 50000

It seems like the higher bandwidths (but only to a point) do a better job of sharpening out the conceptual architecture of the document? I suppose this makes sense – the higher bandwidths tend to iron out the incidental unevenness in the distributions, with the effect of linking words more “accurately” with the other terms that ebb and flow in the most similar high-level patterns?

They remind me of MRI images – cross-sectional slices of the insides of something, each true but incomplete. Not unlike this MRI of a piece of broccoli:

animation

War and Peace, with a bandwidth ranging from 50050,000 words

  • Micki

    Another hugely useful post, thank you David! I think those ‘broccoli’ trees are a promising way to render useful information from the time axis into a spacial one… Looking forward to seeing you in about a week!