Skip to content

typedb-osi/typedb-bio

Repository files navigation

TypeDB Bio: Biomedical Knowledge Graph

Overview | Installation | Datasets | Examples | How You Can Help | Further Learning

Discord Discussion Forum Stack Overflow Stack Overflow

Overview

TypeDB Bio is an open source biomedical knowledge graph to enable research in areas such as drug discovery, precision medicine and drug repurposing. It provides biomedical researchers an intuitive way to query interconnected and heterogeneous biomedical data in one single place.

For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19.

image

By examining these specific relationships and their attributes, we can further investigate any connected biological components and better understand their inter-relations. This helps researchers to efficiently study the mechanisms of protein interactions, infections, the immune response, and help to find targets for the development of treatments or drugs more efficiently. We can also expand our search to include contextual information as is shown below:

image

The team behind TypeDB Bio consists of a partnership between GSK, Oxford PharmaGenesis and Vaticle

The schema that models the underlying knowledge graph alongside the descriptive query language, TypeQL, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, TypeDB's automated reasoning, allows TypeDB Bio to become an intelligent database of biomedical data in the biomedical field that infers implicit knowledge based on the explicitly stored data. TypeDB Bio can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.

Installation

Prerequesites: Python >= 3.10, JDK >= 11, TypeDB Core >= 2.18.0, TypeDB Python Driver >= 2.18.0, TypeDB Studio >= 2.18.0

Clone this repo:

git clone https://github.com/vaticle/typedb-bio.git

Download the CORD-NER data set from this link and add it to this directory: dataset/cordner

Set up a virtual environment and install the dependencies:

cd <path/to/typedb-bio>/
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Start typedb

typedb server

Start the loader script

python loader.py

Config options can be set in: config.ini Some options can be overridden with command line arguments. For help with those arguments:

python loader.py -h

If using TypeDB Enterprise or Cloud, the connection password can only be supplied via command line for security:

python loader.py -p my-password

Now grab a coffee (or two) while the loader builds the schema and data for you!

Testing

Install the test dependencies:

pip install -r requirements_test.txt

Run the tests:

python -m pytest -v -s tests

Development

Install the development dependencies:

pip install -r requirements_dev.txt
pre-commit install

Examples

TypeQL queries can be run either in TypeDB Studio, in TypeDB Console, or through driver APIs. However, we encourage running the queries on TypeDB Studio to have the best visual experience.

# What are the drugs that interact with the genes associated to the virus Sars?

match
$virus isa virus, has virus-name "SARS";
$gene isa gene;
$drug isa drug;
$rel1 ($gene, $virus) isa gene-virus-association;
$rel2 ($gene, $drug) isa drug-gene-interaction;
offset 0; limit 20;

image

Datasets

Currently the datasets we've integrated include:

  • CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
  • Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
  • Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
  • DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
  • Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
  • Reactome: This dataset connects pathways and their participating proteins.
  • DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
  • SemMed: This is a subset of the SemMed version 4.0 database.
  • TissueNet: A dataset of protein-protein interactions.

In progress:

  • CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
    • TODO: write loader script

We plan to add many more datasets!

How You Can Help

This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:

  • Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
  • Extend the schema by adding relevant rules
  • Create a website
  • Write tutorials and articles for researchers to get started

If you wish to get in touch, please talk to us on the #typedb-bio channel on our Discord (link here).

Further Learning