Graphing citations between books using RoBERTa and D3.js

8 min readApr 9, 2021

Have you ever started thinking about all the relations between all the books you’ve read? If the authors of your favorite books have read one another, or even if they have read the same people from before their time? Which books then were read by basically everyone?

In this post I will share my pet-project of training a fine-tuned version of RoBERTa to detect citations in free-form (i.e. not just academic citations) between my books and generating a nice big directed graph with hundreds of books and their relations. All my Kindle library and more!

You can check the graph of citations made with my Kindle library and augmented with hundreds more books from Goodreads here! (If possible play with the graph on a desktop because it sucks on mobile).

Most of the code I’ll be going through on this post is avaliable here!

D3.js force graph very elegantly displays the relations between my books.

Citations in free-form?

So, what am I calling citations in “free-form”? Well, anytime an author mentions another book on the text. For example, from Neil Gaiman’s The View from the Cheap Seats:

“Children listened to them and enjoyed them, but children were not the primary audience, no more than they were the intended audience of Beowulf, or The Odyssey.”

Why citations in “free-form”? Because bibliographical/academic citations are not present on older books and even on some new books. Depending on the type of book the author doesn’t need to construct a giant list of citations with precise bibliographical information at the end of the book. And with Deep Learning and enough data we might very well make a model that works on both kinds of citations and be done with it (At the end of the day it’s all text/data).

Quote by Will Durant — **Story of Western Philosophy**

Why am I doing this?

I love to read books about books, e.g. Will Durant’s History of Western Philosophy and Italo Calvino’s Why Read the Classics and I’m always curious to know the books that were formative on the lifes of my favorite authors (and on a bigger scale even society as a whole).

The idea was, then, to make something that would receive the text from some non-fiction book as input and output me all the books that are cited inside of it. Ideally the output from Why Read The Classics would be [‘The Odyssey’, ‘Candide’, ‘100 Years of Solitude’, …] and from History of Western Philosophy [“Basically everything ever written by a Greek, French or German guy (or gal) with way too much time on their hands”].

The process then would be to:

Get the text from all my books.
Get metadata from all my books, authors, original publication date etc.(*)
Search on every book citations to every other book I have.
Build a rad looking graph illustrating all these citations with the books categorized by publication date.
EXTRA: Search for citations from my books to books I don’t own! (**)(***)

(*) This was a somewhat convoluted process. It is hard to get the original publication date because this is sometimes something historical/uncertain, and there are dozens (or even hundreds!) of editions for many books. So I settled by importing all my books to Calibre and then using a Goodreads plugin to get the original publication date for each of them! Fun fact: It is pure pain and suffering to support very very distant dates with code (like b.c. dates).

(**) I did this by using a dataset scraped from Goodreads that has metadata from thousands of books, old and new. For my purposes I just filtered the dataset by removing books with no ratings/reviews and modern Fantasy/Fiction books that would pollute the analysis and probably are not cited many times (Which doesn’t mean that I don’t like Fantasy books, I LOVE Fantasy).

(***) Unfortunately, these extra citations are just one-way. I don’t have the text from these books to check which books they cite in turn. So… this part of the graph can only be the target of citations from the books I have!

The naive solution (and why this is more complicated than it seems)

One simple solution is to build a list with a bunch of names of books and just search each title on the text of each book you own. And this actually works pretty well for most books (technical considerations on how to do this efficiently later). We start to have problems on smaller book titles like “The Prince” or “The Republic”, since this string of text might very well appear on a book with absolutely no relation to this particular work by Machiavelli. See the following example:

"The Prince reads Marcus Aurelius' Meditations to relax." "Marcus Aurelius reads The Prince to relax."

We can’t always expect that quotations in this form will be, well, quoted or in italics or whatever. So how to detect the string “The Prince” as a citation on the latter sentence but not on the former? There are many ways to go about this, and after iterating for a while I’ve settled on a NLP technique called NER, or named entity recognition.

Creating a NER dataset for the task

NER models should consume some text and then return the location of labels for each word.

The model has to be trained on some specific set of labels using some text with the annotations made beforehand. On my case we have a single label, [BOOK]. The relevant information are the indexes of the string where this tagged text begins and ends. NER models works by basically having a tag associated with each word (‘no-tag’ is also a tag!) and sequentially assigning a probability for each word on the text for being represented by each tag on the training data e.g. [“The Prince”: (0.8, ‘BOOK’ , 0.2, ‘NO_TAG’ ) , “Marcus Aurelius”: (0.1 , ‘BOOK’ , 0.9 , ‘NO_TAG’)].

Of course, I had to manually annotate a full dataset (a thousand examples!) of labels so that the model could learn what is and what isn’t a book, by context. It’s quite magical really when you see it working. Here are some examples from my dataset.

To create my dataset I’ve searched on all my books for some ambiguous book titles such as “Ulysses” (The character from The Odyssey or the book from James Joyce?), “The Prince” (The book from Machiavelli or literally just some prince being referenced?) and “The Republic” (Plato’s book or… you get the idea). Then, I used Doccano to manually annotate each passage from my books to tell if they are a citation or not.

About Augumenting the Dataset

After some failed attempts to make my models work well with new validation data, I started to augment my training dataset by swapping book titles around and creating new training examples. With some probability for each training example the code will swap the citation on some example with another random book title from my dataset and store this as a new training example (while still keeping the old one on the dataset).

Doing this has dramatically improved my model’s performance on validation data. Before this augmentation my model was overfitting on only labeling books if there were many examples with this specific book on my dataset. This was probably happening because (for Deep Learning standards) my ~1000 annotations dataset is quite small.

Fine-Tuning RoBERTa

Huggingface’s transformers library makes it child’s play to download a SOTA model and then fine-tune it on a specific task. You can even load a model with new layers initialized specifically to be fine-tuned on a new task, for this project, for a NER task, the transformers library provides a simple RobertaForTokenClassification class.

The hardest part was to convert my annotations from Doccano’s format to something Pytorch would understand, and I thoroughly copied the code from here for that. And of course, to play with the hyper-parameters until the training yields good results. There are lots of things going on on this part, so I’m just gonna link my Notebook to fine-tune RoBERTa here.

Finally, here are some test strings and the outputs from my model after fine-tuning! Look at this beautiful citations being automatically detected.

Gluing all this together

So far I’ve written about collecting the dataset, annotating and preprocessing it and finally, fine-tuning RoBERTa.

Now, how to actually process the text from all my books. The “algorithm” I wrote goes something like that:

Load all books as lists of strings in memory.
For each line use a regex (+) to detected any book titles.
If the title detected is too short (e.g. less than 3 words)(*) run the whole line through RoBERTa and see if the NER model “validates” the title as a book title.
For each match pull it’s metadata and create a new node on the graph (if needed) and add a link between the book whose text we are currently reading and the book that was cited.
Convert this graph to D3.js format, add some metadata for each book and save it.

(+) I ended up using a giant pre-compiled regex of the form r”Book1 | Book2 …” to look for matches because it was the fastest method I found. More on this on this StackOverflow answer.

(*) I choose this heuristic because I’m assuming that if longer titles like “The Hound of Baskerville” or “The Ballad of Reading Gaol” shows up on some book’s text it has a VERY high probability of being a citation, so I don’t need to waste CPU clocks confirming that with RoBERTa.

How to take less than 8 hours to process all by books with RoBERTa

By using multiprocessing! My book processing function receives as argument the list of lines from a book, it’s metadata and my model and tokenizer to parse the text. By creating a Pool of workers we can give each of them a book at a time to process asynchronously, when one worker finishes a book it starts processing another from the list. Just don’t overdo it since each RoBERTa takes ~1GB of RAM!

Graph-ing everything up!

The graph is built with networkx on the python side of the code, just so I could plot it and run the PageRank algorithm to see which books are the most “linked” on my data. Turns out it’s The Prince!

[(‘The Prince’, 0.01081877301303925),
 (‘Hamlet’, 0.01061937268317937),
 (‘The Divine Comedy’, 0.007740738803800382),
 (‘The Art of War’, 0.006072207782683749),
 (‘Romeo and Juliet’, 0.0051880591489493295),
 (‘Phaedrus’, 0.004890520471208026),
 (‘Poetics’, 0.003865239838704173),
 (‘Alice in Wonderland’, 0.0037039373900383415)]

We just need some simple code to create a JSON with all this links and nodes and load it on D3.js. Here is the format of each node and each link on the final graph. One thing to note is the metadata we can save on each node to use it on the javascript side of the code later.

Again, you can see the end result here!

Some closing remarks

Just some notes that might be useful to someone working with a similar problem:

I’ve never increased my performance dramatically by toying with RoBERTa’s fine-tuning hyper-parameters. After you choose something that converges and are still unsatisfied, I’ll wager that by getting more/better data, or augumenting it like I did, you will get more substancial gains in performance.
Something that wasn’t obvious to me was that when you are using multiprocessing you should get close to 100% CPU use if you want your code to be running the fastest it possibly can (in retrospect it sounds kinda obvious now). If it is not close to 100% this probably means that your code is doing I/O or some processes are stuck (or deadlocked!) on something and can’t proceed with the computation, or you could spawn more processes to speed everything up.