Skip to content

georgiarichards/preventabledeathstracker

Repository files navigation

Preventable Deaths Tracker Web Scraper and Analysis

CodeQL Fetch Reports

Introduction

This repository represents a rewrite of the Preventable Deaths Scraper in JavaScript, initially conducted by Summer Intern and Oxford Computer Scientist, Alex Colby. This rewrite focuses on the explainability of the scraper (all code is documented), the speed of the scraper (we use async code to scrape whilst fetching), the ability to run the scraper on a server (using node.js) and automate its processing.

We also provide a custom wordpress gutenberg plugin to be used alongside the scraped data. This takes the form of a custom block that renders a Heatmap over coroner areas, as defined by the coroner’s society.

Installation and Usage

Scraper

To install the scraper, you will need to have node.js installed. Once you have node.js installed, you can install the scraper by running the following command in the root directory of this repository:

npm install

The scraper can then be run by running the following command in the root directory of this repository:

npm run fetch

This will then save the scraped data to src/data/reports.csv.

Corrections

We've attempted to fully automate the scraping process, but there are some things that we can't automate. These include:

  • Severe typos in some fields (i.e. /0206/2023 is given as a date)
  • Transpositions of fields (i.e. a name being replaced with a ref number)
  • Ambiguity in destinations (i.e. is University Hospitals of Derby and Burton NHS FT one destination or two?)

In these cases, we keep json files recording manual corrections for these in the src/correct/manual_replace directory. These need to be updated every now and then to ensure that the scraper maintains its accuracy.

In order to update these corrections, you'll first need to install node.js and run the following command in the root directory of this repository:

npm install

Then manual corrections for all fields can be added by running the following command in the root directory of this repository:

npm run correct: update all

This will open up an interactive prompt for each failed parse, allowing you to correct, skip or mark the field entry as uncorrectable. Other options for updating individual columns' corrections are available by running npm run correct: update -- --help.

Analyses

All analyses are written in Python and require python 3.8 or above. You'll also need to have pip installed.

Year Count Analysis

To install the dependencies for the year count analysis, you can run the following command in the root directory of this repository:

pip install -r src/analyse/aggregation/requirements.txt

The year count analysis can then be run by running the following command in the root directory of this repository:

python src/analyse/aggregation/year-counts.py

This will save the number of reports per year to src/data/year-counts.csv, in the following format:

year count
2013 173
2014 559
2015 490
... ...

A shortcut to run the analysis is defined in the package.json file and can be run as so:

npm run analyse:year-counts

Medical Cause Analysis

To install the dependencies for the medical cause analysis, you can run the following command in the root directory of this repository:

pip install -r src/analyse/natural-language/requirements.txt

The cause analysis can then be run by running the following command in the root directory of this repository:

python src/analyse/natural-language/cause-tags.py

This will save the analysis to src/data/medical-cause-reports.csv with an additional column tags which contains the predicted causes of death for each report (this column may be blank when prediction fails).

The annotated reports look like this:

ref date area ... tags
2023-0168 22/05/2023 Avon ... [('cerebrovascular accident/event/haemorrhage', 0.434), ...]
2023-0166 19/05/2023 Warwickshire ... nan
2023-0074 27/02/2023 Essex ... [('spontaneous subarachnoid haemorrhage', 0.513), ...]
2023-0073 28/02/2023 Somerset ... nan
2023-0071 23/02/2023 Suffolk ... [('biventricular failure', 0.380), ...]

A shortcut to run the analysis is defined in the package.json file and can be run as so:

npm run analyse:label-medical

WordPress Plugins

The WordPress plugins are written using Project Gutenberg block editor. To install the plugins, you'll need to have node.js installed. Once you have node.js installed, you can install the plugins by running the following command in either of the plugins' project directories:

npm install -g @wordpress/env
npm install

You can then run the development server and build the plugin as so:

wp-env start
npm run start

Layout

There are 6 main directories in the src directory:

  • analyse: Analysis of the scraped data (mostly in Python).
  • correct: Correcting/cleaning the scraped data.
  • data: The raw report data.
  • fetch: Fetching/scraping the report data.
  • parse: Parsing the scraped data (i.e. html -> csv).
  • write: Writing to both the reports.csv file and the log file.

The plugins directory contains WordPress plugins to be used with the report CSVs produced by the scraper (these are probably only interesting if you're interested in data vis/WordPress plugins).

All javascript code is documented with JSDoc and all Python code is written in an interactive Python file (you should hopefully be able to run this like a Jupyter notebook).