Textplot refresh – Python 3, PyPI, CLI app

A couple weeks ago, I flew back to the east coast to give some Neatline workshops at Davidson College, which was lovely. I finally got to meet Mark Sample in person, after following him on Twitter for many years (we brainstormed about some very cool possible e-lit / digital art collaborations), and I learned about some really interesting projects in the English and History departments that are making use of Neatline.

I also had the chance to spend some time with Mark’s “Data Culture” class, where I talked a bit about Textplot. After the class, I got some questions about how to get up and running with the code, which reminded me that I really need to get back on this project. It got pushed to the back burner back in December when I got sucked in to all the fascinating things happening over at the Open Syllabus Project, but I’ve been fiddling with it on and off for the last few months, and I’ve been meaning to write up some recent experiments. And, I’m giving a paper about the project at the ACH conference in June, so this seemed like a good time to give the code a little refresh, wrap it up as a usable piece of software, and fill out the documentation a bit.

Three big changes – first, I moved the code over to Python 3, in an effort to build up karma with the Python gods (homebrew and venv make it easy to work with multiple versions, if you’re on 2). And, it’s now published as a PyPI package, which means it can be installed with a regular pip install textplot. Last, I added a little CLI app that makes it possible to build out a graph without having to fire up a Python shell – stuff like, textplot generate war-and-peace.txt war-and-peace.gml.

As much as I love Python, the packaging / dependency-management toolchain is kind of difficult to work with, so I figured I’d write out a full list of steps to configure the environment, install the code, and wire up Textplot with Gephi. This assumes you’re on a Mac, but (I think) it should all more or less work on Windows, except for the very first part about installing Python.

Part 1: Generating a graph from a text file

First, we’ll use the textplot executable to convert a plaintext file into a Gephi-readable GML file:

  1. If you don’t already have it, install Homebrew, a package manager for OSX:

    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

    (On Windows, I believe the easiest thing is just to download an installer from python.org.)

  2. Then, use brew to install Python 3:

    brew install python3

  3. Next, create a directory to house your Python virtual environments:

    mkdir ~/.env

  4. Then, create a new environment for textplot:

    pyvenv ~/.env/textplot

  5. And then activate the environment:

    . ~/.env/textplot/bin/activate

    Now, you should see a little (textplot) to the left of your prompt on the command line, which tells you which virtual environment is active. When you open a new shell, you’ll need to run this command again to flip on the right environment.

    textplot-env

  6. Next, install Textplot with:

    pip install textplot

  7. Textplot uses a couple of pretty heavyweight libraries (namely numpy/scipy and scikit-learn), so this might take a few minutes to run, depending on what kind of computer you’re using. Once it’s finished, run:

    textplot generate --help

    And you should see some debug output that lists out the different arguments and flags for the CLI command:

    Usage: textplot generate [OPTIONS] IN_PATH OUT_PATH
    
      Convert a text into a GML file.
    
    Options:
      --term_depth INTEGER            The total number of terms in the network.
      --skim_depth INTEGER            The number of words each word is connected
                                      to in the network.
      --d_weights                     If set, connect "close" terms with low edge
                                      weights.
      --bandwidth INTEGER             The kernel bandwidth.
      --samples INTEGER               The number of times the kernel density is
                                      sampled.
      --kernel [gaussian|tophat|epanechnikov|exponential|linear|cosine]
                                      The kernel function.
      --help                          Show this message and exit.
    
  8. Now, the fun part – let’s build out a term network from a text. for the time being, Textplot needs a plain text file (not HTML, PDF, etc). if you don’t have one handy, grab War and Peace from Project Gutenberg:

    wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt

  9. And use the textplot generate command to build the network and write it out as a GML file:

    textplot generate pg2600.txt war-and-peace.gml --bandwidth 10000

    (More in a bit about how to choose values for bandwidth and the other open parameters.) This will take a few seconds to run – Textplot has to tokenize the text, compute probability densities for all of the words, and index a big term matrix that stores the similarities between the unique pairs.

Part 2: Compute a force-directed layout in Gephi

Once this is finished, we can take the GML file generated by Textplot and load it into Gephi, which can compute a force-directed layout. If you don’t already have it, get Gephi from gephi.github.io.

  1. Once Gephi is up and running, click File > Open, and select the war-and-peace.gml file. Set “Graph Type” to “Undirected,” and click “OK” to pull in the file.

    import

  2. In the “Layout” panel on the bottom left, select the “Force Atlas 2” option from the dropdown select, and then click “Run.” These kinds of layout algorithms don’t ever actually finish (they’re basically just physics simulations, so they’re happy to churn away forever). But, they do settle into a kind of equilibrium, after which more iterations of the algorithm won’t do much to change the overall structure.

    Once the layout stabilizes, flip on the “Prevent Overlap” option in the “Layout” panel and and let it run for a few more seconds, which gives Gephi time to spread out nodes that end up stacked on top of each other. (It’s best not to check this at the start, since it can sometimes “encumber” more complex networks and slow down the unfolding of the high-level structure.) Once it settles down again, click “Stop.”

    force-atlas

  3. Now, we can fiddle with the style settings and save a render of the network. Click over into the “Preview” tab on the top, and then click the “Refresh” button at the bottom of the left panel.

    refresh-preview

  4. By default, Gephi will just render the nodes as gray circles, which isn’t much use to us in this case, since we want to be able to see the words that the nodes represent. In the “Node Labels” section on the left, flip on “Show Labels” to display the words and uncheck “Proportional Size,” which we won’t need for now. And, in the “Nodes” section, set “Opacity” to 0 to hide the node circles. After making changes, hit “Refresh” to update the preview render.

    hide-nodes

  5. I usually end up tweaking the style of the labels to make them a bit easier to read. Open the configuration box for the “Color” setting, click the swatch next to “Custom,” and pick some kind of dark color that looks good on the eyes. And, back in “Preview Settings,” I like to set “Outline Size” to 3.0, which puts a white border around the labels and makes them easier to read against the backdrop of the edges.

    labels

  6. Last, down in the “Edges” field set, drop the “Thickness” down to 0.3 and the “Opacity” to 30 – this makes the edges less dense, and easier to see the high-level structure of the connections.
  7. One last step before saving of a render of the network – it can sometimes make sense to rotate the layout into some kind of orientation that makes it easier to think/talk about. This is purely aesthetic – the rotation that comes out of Force Atlas 2 doesn’t have any meaning (as opposed to the relative positions of the nodes, which will be very similar from one run of the layout to another). In this case, though, I want the “triangle” shape of War and Peace to look as as triangle-y as possible, with the main axis between “war” and “peace” running horizontal along the bottom of the layout.

    To do this, click back into the “Overview” tab, and then select either the “Clockwise Rotate” or “Counter-Clockwise Rotate,” depending on which direction you want to go. Set the angle to 1.0, which gives you the most precision, and then click “Run” to spin the layout.

    rotate

  8. Last but not least, we can save off the final image. Click File > Export > SVG/PDF/PNG, and select “PNG” from the “File Format” dropdown. Click the “Options” button, and enter some kind of reasonably large dimension into the “Width” and “Height” fields – I usually draw out a really big render, around 20,000 pixels, and then crop it down to size in Photoshop. Click OK, and then “Save” to write the image.

    save

And, the final render (click to zoom, which is powered by osd-dzi-viewer):

render

If you give this a try, let me know how it goes!

  • scott.enderle@gmail.com

    Went very smoothly! I have a question for you: have you tried markov clustering with this? This is a good explanation: https://www.cs.ucsb.edu/~xyan/classes/CS595D-2009winter/MCL_Presentation2.pdf

    I find it works pretty well for creating topic clusters from topic model networks: https://github.com/senderle/tmtk

    • dclure

      Hey Scott,

      Very interesting, I will definitely give that a shot! Thanks for providing the implementation, looks really robust. Graph clustering is something I don’t know much about, but want to learn more – it’s the obvious next step after something like Textplot.

      • scott.enderle@gmail.com

        Thanks — though your code is much better documented & deployed! I’ve got to work tmtk into a real Pypi package and break it out of its current monolithic state.

        I’m actually trying to add it to textplot right now — I’ll post an “issue” and patch at github if I can get it working.

        • dclure

          That would be awesome. Let me know if anything is unclear in the Textplot code.

  • Pingback: assignment-Comparing Text Analysis Tools()

  • Matthew Santone

    Not sure if this is still active. Is there anyway to not remove characters like + and – when the text is tokenized? I’m working with wine data that uses terms like med-, med+ and those are being lost. Thanks!