Building a co-occurrence matrix with d3 to analyze overlapping topics in dissertations
Thanks Min An!
The goal of my master’s degree research is to spark new collaboration opportunities between researchers from different fields. But before doing that I need to take a step back and see if there is any collaboration happening already. Cool, but how do start doing this?
When authors write the abstracts for their work they also add some keywords to it. My first guess was that with these keywords I could start to see how theses from different knowledge areas interact with each other.
So I got all the dissertations from my University (2016) and built a matrix to visualize the works with overlapping keywords.
I had two main jobs: get the data and build the d3 visualization.
1️⃣ — Getting the data
Every time a student gets a degree she needs to send the final dissertation to this site: http://repositorio.ufpe.br/. To get the keywords from these dissertations I had to:
- Get the list of all dissertations
- Download the PDFs
- Extract text and get the keywords of each dissertation
🎓 1 — Get the list of all dissertations
Thanks to Scrapy this part was much easier than I thought it would be. To extract data from websites with the framework we need to write a Spider.
“Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).” Source
In the Spider class we define the base URL, how Scrapy should handle the pagination and also define the data that will be extracted.
I got this metadata:
📁 2 — Download the PDFs
I build a python script to download the PDFs using Requests, “an elegant and simple HTTP library for Python, built for human beings”.
🔡 3 — Extract text and get the keywords of each dissertation
For this task I used PyPDF2 and spaCy. Usually the authors put the keywords below the abstract, so I go through the pages of the PDF and if I find words like “keywords” and “palavras-chave” (it means keywords in Portuguese) I save the contents in a txt file.
I use spaCy to tokenize the text as the PDF extraction is not perfect.
After this process I have the text of every page that has the word “keywords” in it. To get only the keyword themselves I go through the words and get all the words that come after “keywords”, because usually they are the last thing in the page. Then I finally load the data into a dataframe to do some transformations and save it to a csv file.
Are keywords good to what I want to do?
One thing I notice is that the keywords can be very generic. And if my goal is to check how topics from different research areas interact I should analyze the abstracts, because there are more chances of finding points of interactions there.
Let’s get a real example to make it more clear. For this dissertation: “Cognitive and non-cognitive aspects in the adaptation of (im)migrants college students” the keywords are:
- Academic Experiences
- Executive Functions
- University Students
But while reading the abstract I found this:
“To achieve the goals the following analysis were performed: descriptive statistics, T-test, analysis of variance (ANOVA), correlational exploratory bivariate analyzes, exploratory factorial analysis and multiple linear regressions.”
This thesis is from the Graduate Program in Cognitive Psychology, but it has topics from Statistics and maybe also from Computer Science, right? And the keywords cannot show that. So one of my next steps will be to build a deep learning model with text from the abstracts.
But we need to start somewhere, so I did something simple: got the first top 10 keywords from each research field and I assume that these words can be used to characterize the field.
First do it, then do it right, then do it better — Addy Osmani
Ok, now we are ready to start building the visualization.
2️⃣ — Building the visualization
And now we can dive deep into the d3 magic.
1 — Transform the data
The data looks like this:
We will build the matrix row by row. First we read the data and create an array for every row.
As we’ll build the squares row by row we need to navigate throught each row and then go through each column.
2 — Place the squares and add color
To place each square in the right place we’ll use the d3.scaleBand() scale. What this scale does is to get an array with values and then assign a coordinate for each array item. All the coordinates for each item plus the bandwidth value add up to the total width.
How d3.scaleBand() words
This scale is awesome because we don’t need to make the computations by hand.
To add a color to each square you can use the d3.schemeCategory20 scale.
That’s a lot of new code, so let’s take a look:
We also create a scale to set the opacity of each square. In our case the opacity value is the number of dissertations that has at least one keyword match. One keyword match means that a keyword appear in both dissertations.
3 — Add columns
We do the same thing we did with rows but now we rotate them:
4 — Add text labels
To add the placeholders we use the rows and columns selections to add a svg text element.
5 — Add sorting functionality
Reorganize the matrix rows is something that can help the analysis. In our case we can organize them by the alphabetical order, by the number of connections of each program (i.e. research field) and by clusters for each knowledge area. This is something simple to do using the d3.scaleBand() matrix.
Each item of the array corresponds to a program id, so if we sort this array in different ways we get different coordinates for each matrix square.
Then we add the functionality to the html select tag:
To add a nice effect we also add a transition to animate the reordering.
The final result
This is just the beginning, now I need to work on how to display the keywords and the dissertations. And of course I’ll come back here to share then with you haha o/ \o
Thanks for reading! 😁