Quick Start¶
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
python3 -m venv venv && source venv/bin/activate # optional
pip install .
python run.py download
python run.py transform
python run.py merge
Download the Knowledge Graph¶
Prebuilt versions of the KG-COVID-19 knowledge graph build from all available data are available in the following serialization formats:
See here for a description of the KGX TSV format
Previous builds are available for download here. Each build contains the following data:
raw
: the data ingested for this buildtransformed
: the transformed data from each sourcestats
: detailed statistics about the contents of the KGJenkinsfile
: the exact commands used to generate the KGkg-covid-19.nt.gz
: an RDF/Ntriples version of the KGkg-covid-19.tar.gz
: a KGX TSV version of the KGkg-covid-19.jnl.gz
: the Blazegraph journal file (for loading an endpoint)
Knowledge Graph Hub concept¶
A Knowledge Graph Hub (KG Hub) is framework to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML-driven way. The workflow constitutes of 3 steps:
Download data
Transform data for each data source into two TSV files (
edges.tsv
andnodes.tsv
) as specified hereMerge the graphs for each data source of interest using KGX to produce a merged knowledge graph
To facilitate interoperability of datasets, Biolink categories are added to nodes and Biolink association types are added to edges during transformation.
A more thorough explanation of the KG-hub concept is here.
KG-COVID-19 project¶
The KG-COVID-19 project is the first instantiation of such a KG Hub. Thus, KG-COVID-19 is a framework that follows design patterns of the KG Hub to download and transform COVID-19/SARS-COV-2 related datasets and emit a knowledge graph that can then be used for machine learning or others uses, to produce actionable knowledge.
The codebase¶
Prerequisites¶
Java/JDK is required in order for the transform step to work properly. See here for instructions on installing.
Computational requirements¶
On a commodity server with 200 GB of memory, generation of the knowledge graph containing all source data requires a total of 3.7 hours (0.13 hours, 1.5 hours, and 2.1 hours for the download, transform, and merge step, respectively), with a peak memory usage of 34.4 GB and disk use of 37 GB. An estimate of the current build time on a typical server is also available here.
Installation¶
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
python3 -m venv venv && source venv/bin/activate # optional
pip install .
Running the pipeline¶
python run.py download
python run.py transform
python run.py merge
Jupyter notebook¶
We have also prepared a Jupyter notebook demonstrating how to run the pipeline to generate a KG, and also how to use other tooling such as graph sampling for generating holdouts, and graph querying.
A few organizing principles used for data ingest¶
UniProtKB identifiers are used for genes and proteins, where possible
For drug/compound identifiers, there is a preferred namespace. If there are datasets that provide identifiers from multiple namespaces then the choice is determined based on a descending order of preference,
CHEBI
>CHEMBL
>DRUGBANK
>PUBCHEM
Less is more: for each data source, we ingest only the subset of data that is most relevant to the knowledge graph in question (here, it’s KG-COVID-19)
We avoid ingesting data from a source that isn’t authoritative for the data in question (e.g. we do not ingest protein interaction data from a drug database)
Each ingest should make an effort to add provenance data by adding a
provided_by
column for each node and edge in the output TSV file, populated with the source of each datum
Querying the graph¶
A SPARQL endpoint for the merged knowledge graph is available here. For a better experience, consider using yasgui for your querying needs (for yasgui, set http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql as your SPARQL endpoint).
If you are not sure where to start, here are some example SPARQL queries.
How to Contribute¶
Write code to ingest data¶
Most urgent need is for code to ingest data from new sources.
Find a data source to ingest:
An issue tracker with a list of new data sources is here.
Look at the data file(s), and plan how you are going to write out data to nodes and edges:
You’ll need to write out a nodes.tsv
file describing each entity you
are ingesting, and an edges.tsv
describing the relationships between
entities, as described
here.
nodes.tsv
should have at least these columns (you can add more
columns if you like):
id name category
id
should be a CURIE that uses one of these
identifiers.
They are enumerated
here. For
genes, a Uniprot ID is preferred, if available.
category
should be a Biolink
category
in CURIE format, for example biolink:Gene
edges.tsv
should have at least these columns:
subject edge_label object relation
subject
and object
should be id
s that are present in the
nodes.tsv
file (again, as CURIEs that uses one of
these).
edge_label
should be a CURIE for the biolink
edge_label
that describes the relationship. relation
should be a CURIE for the
term from the relation
ontology.
Read how to make a PR, and fork the repo:
Read these instructions about how to make a pull request in github. Fork the code and set up your development environment.
Add a block to ``download.yaml`` to download data file for source:
Add a block of yaml containing the url of the file you need to download for the source (and optionally a brief description) in download.yaml like so - each item will be downloaded when the
run.py download
command is executed:
#
# brief comment about this source, one or more blocks with a url: (and optionally a local_name:, to avoid name collisions)
#
-
# first file
url: http://curefordisease.org/some_data.txt
local_name: some_data.txt
-
# second file
url: http://curefordisease.org/some_more_data.txt
local_name: some_more_data.txt
Add code to ingest and transform data:
Add a new sub-directory in kg_emerging_viruses/transform_utils with a unique name for your source. If the data come from a scientific paper, consider prepending the pubmed ID to the name of the source (e.g.
pmid28355270_hcov229e_a549_cells
)In this sub-directory, write a class that ingests the file(s) you added above in the yaml, which will be in
data/raw/[file name without path]
. Your class should have a constructor and arun()
function, which is called to perform the ingest. It should output data intodata/transformed/[source name]
for all nodes and edges, in tsv format, as described here.- Also add the following metadata in the comments of your script:
data source
files used
release version that you are ingesting
documentation on which fields are relevant and how they map to node and edge properties
In kg_covid_19/transform.py, add a key/value pair to
DATA_SOURCES
. The key should be the[source name]
above, and the value should be the name of the class above. Also add an import statement for the class.In merge.yaml, add a block for your new source, something like:
SOURCE_NAME:
input:
format: tsv
filename:
- data/transformed/[source_name]/nodes.tsv
- data/transformed/[source_name]/edges.tsv\
Submit your PR on github, and link the github issue for the data source you ingested
Might want to run pylint
and mypy
and fix any issues before
submitting your PR.
Contributors¶
Acknowledgements¶
We gratefully acknowledge and thank all COVID-19 data providers for making their data available.