Open Syllabus 2.0 | Citation extraction pipeline

A NER and entity-linking pipeline that identifies references to books and articles in college course syllabi. This is similar to traditional citation-extraction projects that match structured bibliographic strings (eg, in scientific papers); but with syllabi, we have to account for a large amount of fuzziness and inconsistency, since syllabi are messy documents with essentially no standardization in terms of how texts are referenced.

We take a two-step approach – first, starting with a bibliographic database of ~65 million books and articles, we surface a set of candidate matches based on a raw keyword match of tokens from the title of the work and the author’s last name. (We index the title and author token sequences in a space-optimized trie, implemented in Rust, which makes it possible to match all possible references with a single linear scan through tokens in a document). This produces a large set of matches, places where the title and author of a text appear in close proximity in a document. In many cases, this is enough to identify a work — for example, if the tokens attention is all you need and vaswani appear in close proximity, then this is almost certainly a reference to the paper. The difficulty, though, is that large bibliographic databases will also contain a set of works (generally books) with short titles that consist of relatively frequent tokens, and where the author name is also somewhat common. Eg, the “Politics” by Aristotle — if politics and aristotle appear within ~10 tokens, then in might be a reference to the Politics; but it might also just be an incidental co-occurrence of the words in regular prose. These works produce millions of false-positive matches, when then need to be pruned.

To do this accurately, we need to incorporate contextual information from the document. We apply a validation model that extracts features from the document contexts before, between, and after the raw title and author keyword sequences, and predicts whether the match is a legitimate text reference. This is implemented in PyTorch (LSTMs over character and word embeddings) and trained on ~12k hand-labeled examples, and gets to ~90% accuracy.

Posted in .