One of the best things about software engineering is being able to build your own tools. Each visualisation goes through a series of stages to reach its final form and this is managed and automated using software I’ve designed and built. This software runs through a series of sometimes complex tasks, to produce the interactive visualisations, and PDF documents for printing.
Each visualisation throws up a unique set of challenges, and involves careful exploration to discover the best approach to access, organise and present the data.
The most important requirement is that the data exists in an organised form somewhere that is publicly accessible. When dealing with large volumes of data is isn’t practical to source the data by hand, so I automate this by ‘scraping’ data from the data-source.
For this I use a piece of software called Playwright that is actually designed for running automated tests against websites and web-apps. In normal use it allows developers to write commands that mimic a real user interacting with a site or app, as a way of verifying things work as they expect them to. I’ve repurposed this software to interact with the data source and collate the data I need for a visualisation.
Sometimes it is well structured, organised and consistent, making this part of the process relatively easy, but sometimes the way the data is structured can make things tricky and laborious. I always scrape data in a responsible way, ‘caching’ (storing) data and images on my local machine, and ensuring interactions happen at lower speeds so that there is minimal stress on the server from which the data is being scraped.
Once the data has been sourced, it needs to be cleaned up and organised. This can involve simple things like removing unwanted text from titles, or more complex things like converting vague dates into something more solid that can be used for ordering. The same can also apply to any images, which can sometimes need work to remove things like unwanted borders.
Another important part of this phase is removing any unwanted or unusable data. There might be duplicate entries, or items that are missing critical data. For example there are no surviving colour reproductions of some of van Gogh’s paintings, so I made the decision to remove them from the dataset.
The image associated with each item of data needs to be analysed to extract information about its colour palette. This is done using a ‘K-means clustering’ algorithm. This is a common type of Unsupervised Machine Learning which involves sampling and analysing the pixels that make up an image and how they relate to each other in terms of their colours. This builds up clusters of related colours to produce a limited colour-palette for the image.
By far the greatest amount of work to get to this point has gone into building the software that enables the grid layouts. My original visualisation used a simple rectangular grid, however I wanted a lot more flexibility over the layout. It turned out that nothing appropriate existed, so I built what I needed, in the form of two libraries.
After trying a few different approaches, I realised I could use a mathematical construct called a Coon’s patch to map a grid onto any four-sided surface, allowing the grid to be distorted however I required. I have open-sourced the code into a NPM package written in TypeScript called Warp Grid. It has a powerful and flexible API allowing for some novel and interesting grid designs. And by combining multiple grids, and extruding shapes from paths, complex shapes can be created.
There are a few different approaches to creating the grids. I can either work from Adobe Illustrator and export data representing the shapes I want to use, or I can work completely programmatically using code alone to generate and position the paths and shapes and that will become the grids.
As well as the shapes of the grids, I decide on the appropriate way to order and distribute the data across the grid cells to present the data in the most interesting way possible.
A finished visualisation comprises of an interactive visualisation, and a set of print-ready PDFs. First of all, the grid data needs to be turned into an image. This is done by running through the grid produced in the last step to create visual representations of each piece of data.
A kind of vector image called an SVG (Scalable Vector Graphic) is generated. There are broadly two kinds of images: raster images that are made from pixels, and vector images that are made from shapes and lines. The latter are infinitely scalable and can be broken apart into individual pieces, which is useful for making an image interactive. The vector image is then paired with data to create the interactive visualisation.
The SVG is then converted into a set of PDFs using a separate set of information that defines how a given visualisation will be printed: the size of the paper and how the image should be positioned on the page.