Knowledge Graphs for Microbial data
The source for the framework of this repository is from kg-cookiecutter.
Knowledge Graph Hub concept
Please see here
Prerequisites
Java/JDK is required in order for the transform step to work properly. Installation instructions can be found here.
Setup
Create a python virtual environment (venv, anaconda etc.)
pip install poetry
git clone https://github.com/Knowledge-Graph-Hub/kg-microbe
cd kg-microbe
poetry install
Pipeline Stages:
Download
Transform
Merge
Download
This step download all files from the urls declared in the download.yaml file.
script - poetry run kg download
File currently downloaded:
Traits data from bacteria-arachaea-traits repository. Considering only ‘condensed_traits_NCBI.csv’ for now.
Environments data from the same repository found as a conversion table titled ‘environments.csv’.
ROBOT jar and shell script files. ROBOT is used to convert the OWL format files of ontologies into OBOJSON format to extract nodes and edges from the ontologies. In this case, we also leverage the ‘extract’ feature of ROBOT to get subsets of ontologies. Documentation on ROBOT could be found here.
CHEBI.owl is used as dictionary while running OGER to annotate ‘carbon substrate’ information from the traits data.
NCBITaxon.owl is used as the ontology source to capture organismal classification information.
Transform
In this step, we create nodes and edges corresponding to the four downloaded files mentioned above (#1, #4 and #5).
scripts
All together -
poetry run kg transform
OR
Running transforms individually:
For traits data =>
poetry run kg transform -s TraitsTransform
For CHEBI.owl =>
poetry run kg transform -s ChebiTransform
For NCBITaxon.owl =>
poetry run kg transform -s NCBITransform
For BacDive data =>
poetry run kg transform -s BacDiveTransform
For MediaDive data =>
poetry run kg transform -s MediaDiveTransform
Merge
In this step, all the above transforms are merged and a cumulative nodes and edges files are generated.
script - poetry run kg merge
Data
The final merged data is available here