Building a co-occurrence matrix with d3 to analyze overlapping topics in dissertations

Thanks Min An!

The goal of my master’s degree research is to spark new collaboration opportunities between researchers from different fields. But before doing that I need to take a step back and see if there is any collaboration happening already. Cool, but how do start doing this?

When authors write the abstracts for their work they also add some keywords to it. My first guess was that with these keywords I could start to see how theses from different knowledge areas interact with each other.

So I got all the dissertations from my University (2016) and built a matrix to visualize the works with overlapping keywords.

I had two main jobs: get the data and build the d3 visualization.

1️⃣ — Getting the data

Every time a student gets a degree she needs to send the final dissertation to this site: http://repositorio.ufpe.br/. To get the keywords from these dissertations I had to:

  1. Get the list of all dissertations
  2. Download the PDFs
  3. Extract text and get the keywords of each dissertation

🎓 1 — Get the list of all dissertations

Thanks to Scrapy this part was much easier than I thought it would be. To extract data from websites with the framework we need to write a Spider.

“Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).” Source

In the Spider class we define the base URL, how Scrapy should handle the pagination and also define the data that will be extracted.

class MySpider(scrapy.Spider):
    name = 'myspider'

    # All dissertations by issued date
    start_urls = ['http://www.repositorio.ufpe.br/handle/123456789/50/browse?type=dateissued']

    def parse(self, response):
        # follow links to dissertation pages
        for href in response.css('.artifact-title > a::attr(href)'):
            yield response.follow('http://www.repositorio.ufpe.br'+href.extract(), self.parse_dissertation)

        # follow pagination links
        for href in response.css('.next-page-link::attr(href)'):
            yield response.follow('http://www.repositorio.ufpe.br/handle/123456789/50/'+href.extract(), self.parse)

    def parse_dissertation(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        def extract_date(query):
            return response.css(".simple-item-view-other > span::text")[4].extract().strip()

        yield {
            'autor': extract_with_css('.simple-item-view-authors > span::text'),
            'titulo': extract_with_css('.item-summary-view-metadata > h1::text'),
            'data': extract_date(".simple-item-view-other > span::text"),
            'abstract': extract_with_css(".simple-item-view-description > div::text"),
            'area': extract_with_css(".ds-referenceSet-list > li > a::text"),
            'pdf': extract_with_css('.file-wrapper > div > a::attr(href)'),
        }

I got this metadata:

    {
    'author',
    'title',
    'date':
    'abstract',
    'research_field',
    'url_to_pdf'
    }

📁 2 — Download the PDFs

I build a python script to download the PDFs using Requests, “an elegant and simple HTTP library for Python, built for human beings”.

    with open(new_file_name, 'wb') as pdf:
        temp = requests.get("http://repositorio.ufpe.br" + link, stream=True)
        for block in temp.iter_content(512):
            if not block:
                break

    pdf.write(block)

🔡 3 — Extract text and get the keywords of each dissertation

For this task I used PyPDF2 and spaCy. Usually the authors put the keywords below the abstract, so I go through the pages of the PDF and if I find words like “keywords” and “palavras-chave” (it means keywords in Portuguese) I save the contents in a txt file.

I use spaCy to tokenize the text as the PDF extraction is not perfect.

After this process I have the text of every page that has the word “keywords” in it. To get only the keyword themselves I go through the words and get all the words that come after “keywords”, because usually they are the last thing in the page. Then I finally load the data into a dataframe to do some transformations and save it to a csv file.

Are keywords good to what I want to do?

One thing I notice is that the keywords can be very generic. And if my goal is to check how topics from different research areas interact I should analyze the abstracts, because there are more chances of finding points of interactions there.

Let’s get a real example to make it more clear. For this dissertation: “Cognitive and non-cognitive aspects in the adaptation of (im)migrants college students” the keywords are:

But while reading the abstract I found this:

“To achieve the goals the following analysis were performed: descriptive statistics, T-test, analysis of variance (ANOVA), correlational exploratory bivariate analyzes, exploratory factorial analysis and multiple linear regressions.”

This thesis is from the Graduate Program in Cognitive Psychology, but it has topics from Statistics and maybe also from Computer Science, right? And the keywords cannot show that. So one of my next steps will be to build a deep learning model with text from the abstracts.

But we need to start somewhere, so I did something simple: got the first top 10 keywords from each research field and I assume that these words can be used to characterize the field.

First do it, then do it right, then do it better — Addy Osmani‏

Ok, now we are ready to start building the visualization.

2️⃣ — Building the visualization

We’ll use the Mike Bostock Les Misérables Co-occurrence Matrix as our “template”. Let’s start by creating a rect and add it to the background:

    var margin = {
            top: 285,
            right: 0,
            bottom: 10,
            left: 285
        },
        width = 700,
        height = 700;
    var svg = d3.select("graph").append("svg").attr("width", width).attr("height", height);

    svg.append("rect")
        .attr("class", "background")
        .attr("width", width - margin.right)
        .attr("height", height - margin.top)
        .attr("transform", "translate(" + margin.right + "," + margin.top + ")");

    svg.append("rect")
        .attr("class", "background")
        .attr("width", width)
        .attr("height", height);

And now we can dive deep into the d3 magic.

1 — Transform the data

The data looks like this:

    {
        "nodes": [{
                "group": "humanas",
                "index": 0,
                "name": "ADMINISTRAÇÃO"
            },
            {
                "group": "humanas",
                "index": 1,
                "name": "ANTROPOLOGIA"
            },
            [...]
        ],
        "links": [{
                "source": 0,
                "target": 0,
                "value": 0.0
            }, {
                "source": 0,
                "target": 1,
                "value": 2.0
            },
            [...]
        ]
    }

We will build the matrix row by row. First we read the data and create an array for every row.

    d3.json("data/data.json", function(data) {
        var matrix = [];
        var nodes = data.nodes;
        var total_items = nodes.length;

    // Create rows for the matrix
        nodes.forEach(function(node) {
            node.count = 0;
            node.group = groupToInt(node.group);

    matrix[node.index] = d3.range(total_items).map(item_index => {
                return {
                    x: item_index,
                    y: node.index,
                    z: 0
                };
            });
        });
        // Fill matrix with data from links and count how many times each item appears
        data.links.forEach(function(link) {
            matrix[link.source][link.target].z += link.value;
            matrix[link.target][link.source].z += link.value;
            nodes[link.source].count += link.value;
            nodes[link.target].count += link.value;
        });

    });

As we’ll build the squares row by row we need to navigate throught each row and then go through each column.

2 — Place the squares and add color

To place each square in the right place we’ll use the d3.scaleBand() scale. What this scale does is to get an array with values and then assign a coordinate for each array item. All the coordinates for each item plus the bandwidth value add up to the total width.

How d3.scaleBand() words

This scale is awesome because we don’t need to make the computations by hand.

To add a color to each square you can use the d3.schemeCategory20 scale.

That’s a lot of new code, so let’s take a look:

    d3.json("data/data.json", function(data) {

    [...] //transform the data

    var matrixScale = d3.scaleBand().range([0, width]).domain(d3.range(total_items));
    var opacityScale = d3.scaleLinear().domain([0, 10]).range([0.3, 1.0]).clamp(true);
    var colorScale = d3.scaleOrdinal(d3.schemeCategory20);

    // Draw each row (translating the y coordinate)
        var rows = svg.selectAll(".row")
            .data(matrix)
            .enter().append("g")
            .attr("class", "row")
            .attr("transform", (d, i) => {
                return "translate(0," + matrixScale(i) + ")";
            });

    var squares = rows.selectAll(".cell")
            .data(d => d.filter(item => item.z > 0))
            .enter().append("rect")
            .attr("class", "cell")
            .attr("x", d => matrixScale(d.x))
            .attr("width", matrixScale.bandwidth())
            .attr("height", matrixScale.bandwidth())
            .style("fill-opacity", d => opacityScale(d.z)).style("fill", d => {
                return nodes[d.x].group == nodes[d.y].group ? colorScale(nodes[d.x].group) : "grey";
            })
            .on("mouseover", mouseover)
            .on("mouseout", mouseout);
    });

We also create a scale to set the opacity of each square. In our case the opacity value is the number of dissertations that has at least one keyword match. One keyword match means that a keyword appear in both dissertations.

3 — Add columns

We do the same thing we did with rows but now we rotate them:

    d3.json("data/data.json", function(data) {

       [...] //transform the data

       [...] //place the squares and add color

    var columns = svg.selectAll(".column")
            .data(matrix)
            .enter().append("g")
            .attr("class", "column")
            .attr("transform", (d, i) => {
                return "translate(" + matrixScale(i) + ")rotate(-90)";
            });
    });

4 — Add text labels

To add the placeholders we use the rows and columns selections to add a svg text element.

    d3.json("data/data.json", function(data) {

        [...] //transform the data

        [...] //place the squares and add color

        [...] //add columns

    rows.append("text")
            .attr("class", "label")
            .attr("x", -5)
            .attr("y", matrixScale.bandwidth() / 2)
            .attr("dy", ".32em")
            .attr("text-anchor", "end")
            .text((d, i) => capitalize_Words(nodes[i].name));

        columns.append("text")
            .attr("class", "label")
            .attr("y", 100)
            .attr("y", matrixScale.bandwidth() / 2)
            .attr("dy", ".32em")
            .attr("text-anchor", "start")
            .text((d, i) => capitalize_Words(nodes[i].name));
    });

5 — Add sorting functionality

Reorganize the matrix rows is something that can help the analysis. In our case we can organize them by the alphabetical order, by the number of connections of each program (i.e. research field) and by clusters for each knowledge area. This is something simple to do using the d3.scaleBand() matrix.

Each item of the array corresponds to a program id, so if we sort this array in different ways we get different coordinates for each matrix square.

    // Precompute the orders.
    var orders = {
        name: d3.range(total_items).sort((a, b) => {
            return d3.ascending(nodes[a].name, nodes[b].name);
        }),
        count: d3.range(total_items).sort((a, b) => {
            return nodes[b].count - nodes[a].count;
        }),
        group: d3.range(total_items).sort((a, b) => {
            return nodes[b].group - nodes[a].group;
        })
    };

Then we add the functionality to the html select tag:

    d3.select("#order").on("change", function() {
        changeOrder(this.value);
    });

    function changeOrder(value) {
            matrixScale.domain(orders[value]);
            var t = svg.transition().duration(2000);

            t.selectAll(".row")
                .delay((d, i) => matrixScale(i) * 4)
                .attr("transform", function(d, i) {
                    return "translate(0," + matrixScale(i) + ")";
                })
                .selectAll(".cell")
                .delay(d => matrixScale(d.x) * 4)
                .attr("x", d => matrixScale(d.x));

            t.selectAll(".column")
                .delay((d, i) => matrixScale(i) * 4)
                .attr("transform", (d, i) => "translate(" + matrixScale(i) + ")rotate(-90)");
        }

To add a nice effect we also add a transition to animate the reordering.

The final result

And that’s it! Now we just add some white lines and also add a tooltip for each square.You can see the final result here or check the final code here.

This is just the beginning, now I need to work on how to display the keywords and the dissertations. And of course I’ll come back here to share then with you haha o/ \o

Thanks for reading! 😁