Preventable Deaths Tracker Web Scraper and Analysis

Introduction

This repository represents a rewrite of the Preventable Deaths Scraper in JavaScript, initially conducted by Summer Intern and Oxford Computer Scientist, Alex Colby. This rewrite focuses on the explainability of the scraper (all code is documented), the speed of the scraper (we use async code to scrape whilst fetching), the ability to run the scraper on a server (using node.js) and automate its processing.

We also provide a custom wordpress gutenberg plugin to be used alongside the scraped data. This takes the form of a custom block that renders a Heatmap over coroner areas, as defined by the coroner’s society.

Installation and Usage

Scraper

To install the scraper, you will need to have node.js installed. Once you have node.js installed, you can install the scraper by running the following command in the root directory of this repository:

npm install

The scraper can then be run by running the following command in the root directory of this repository:

npm run fetch

This will then save the scraped data to src/data/reports.csv.

Corrections

We've attempted to fully automate the scraping process, but there are some things that we can't automate. These include:

Severe typos in some fields (i.e. /0206/2023 is given as a date)
Transpositions of fields (i.e. a name being replaced with a ref number)
Ambiguity in destinations (i.e. is University Hospitals of Derby and Burton NHS FT one destination or two?)

In these cases, we keep json files recording manual corrections for these in the src/correct/manual_replace directory. These need to be updated every now and then to ensure that the scraper maintains its accuracy.

In order to update these corrections, you'll first need to install node.js and run the following command in the root directory of this repository:

npm install

Then manual corrections for all fields can be added by running the following command in the root directory of this repository:

npm run correct: update all

This will open up an interactive prompt for each failed parse, allowing you to correct, skip or mark the field entry as uncorrectable. Other options for updating individual columns' corrections are available by running npm run correct: update -- --help.

Analyses

All analyses are written in Python and require python 3.8 or above. You'll also need to have pip installed.

Year Count Analysis

To install the dependencies for the year count analysis, you can run the following command in the root directory of this repository:

pip install -r src/analyse/aggregation/requirements.txt

The year count analysis can then be run by running the following command in the root directory of this repository:

python src/analyse/aggregation/year-counts.py

This will save the number of reports per year to src/data/year-counts.csv, in the following format:

year	count
2013	173
2014	559
2015	490
...	...

A shortcut to run the analysis is defined in the package.json file and can be run as so:

npm run analyse:year-counts

Medical Cause Analysis

To install the dependencies for the medical cause analysis, you can run the following command in the root directory of this repository:

pip install -r src/analyse/natural-language/requirements.txt

The cause analysis can then be run by running the following command in the root directory of this repository:

python src/analyse/natural-language/cause-tags.py

This will save the analysis to src/data/medical-cause-reports.csv with an additional column tags which contains the predicted causes of death for each report (this column may be blank when prediction fails).

The annotated reports look like this:

ref	date	area	...	tags
2023-0168	22/05/2023	Avon	...	[('cerebrovascular accident/event/haemorrhage', 0.434), ...]
2023-0166	19/05/2023	Warwickshire	...	nan
2023-0074	27/02/2023	Essex	...	[('spontaneous subarachnoid haemorrhage', 0.513), ...]
2023-0073	28/02/2023	Somerset	...	nan
2023-0071	23/02/2023	Suffolk	...	[('biventricular failure', 0.380), ...]

A shortcut to run the analysis is defined in the package.json file and can be run as so:

npm run analyse:label-medical

WordPress Plugins

The WordPress plugins are written using Project Gutenberg block editor. To install the plugins, you'll need to have node.js installed. Once you have node.js installed, you can install the plugins by running the following command in either of the plugins' project directories:

npm install -g @wordpress/env
npm install

You can then run the development server and build the plugin as so:

wp-env start
npm run start

Layout

There are 6 main directories in the src directory:

analyse: Analysis of the scraped data (mostly in Python).
correct: Correcting/cleaning the scraped data.
data: The raw report data.
fetch: Fetching/scraping the report data.
parse: Parsing the scraped data (i.e. html -> csv).
write: Writing to both the reports.csv file and the log file.

The plugins directory contains WordPress plugins to be used with the report CSVs produced by the scraper (these are probably only interesting if you're interested in data vis/WordPress plugins).

All javascript code is documented with JSDoc and all Python code is written in an interactive Python file (you should hopefully be able to run this like a Jupyter notebook).

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
.github/workflows		.github/workflows
plugins		plugins
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

plugins

plugins

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

package-lock.json

package-lock.json

package.json

package.json

pyproject.toml

pyproject.toml

Repository files navigation

Preventable Deaths Tracker Web Scraper and Analysis

Introduction

Installation and Usage

Scraper

Corrections

Analyses

Year Count Analysis

Medical Cause Analysis

WordPress Plugins

Layout

About

Releases

Packages

Contributors 3

Languages

License

georgiarichards/preventabledeathstracker

Folders and files

Latest commit

History

Repository files navigation

Preventable Deaths Tracker Web Scraper and Analysis

Introduction

Installation and Usage

Scraper

Corrections

Analyses

Year Count Analysis

Medical Cause Analysis

WordPress Plugins

Layout

About

Topics

Resources

License

Stars

Watchers

Forks

Languages