Evaluation for primekg

Evaluator: Not specified

Evaluated on: 2025-08-14

This is a manual evaluation intended to identify potential barriers to reuse.


Access Level and Types

QuestionAnswerComment
Access to data outside of the knowledge graphYClinicalBERT-based embeddings were used to group disease nodes, providing an embedding-derived version of the graph
API or online access to the knowledge graphN
Multiple access options availableYAvailable via Harvard Dataverse with raw KG (kg raw.csv) and largest connected component (kg giant.csv) https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM
Source code availabilityYFull source code is available on GitHub https://github.com/mims-harvard/PrimeKG
Downloadable knowledge graphYHarvard Dataverse Repo hosts the downloadable KG and intermediate files https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM

Section Score: 4/5

Provenance of Nodes and Edges

QuestionAnswerComment
Source list providedY20 primary data sources listed, including DisGeNET, DrugBank, UMLS, Orphanet, etc. in the paper https://www.nature.com/articles/s41597-023-01960-3
Source versions informationYExplicit versions and download dates provided for each dataset in the Methods/Data Records section
Import dependenciesYPartially - Tools like goatools, beautifulsoup, regex scripts, and vocabulary mappings are mentioned in the GitHub repo, but not all formal dependencies are listed
Node and edge sourcesYEach node contains node source; edges are annotated by type and origin
Edges deduplicationYDuplicates and self-loops were removed during KG preprocessing and merging
Triples source detailsYClear documentation on what triples were derived from which resource (e.g., drug–protein from DrugBank, phenotype–disease from HPO)
Edge type schemaYThe paper documented schema of 30 edge types and their origin ontologies

Section Score: 7/7

Documented standards, schema, construction

QuestionAnswerComment
Biological usable dataYClinical and pharmacological text features are readable and interpretable (e.g., Mayo Clinic descriptions, DrugBank pharmacodynamics)
Resolvable IDsYUses Mondo, DrugBank, HPO, MeSH, Entrez Gene IDs, and UMLS CUIs, which are mappable and resolvable via external resources
Construction documentationYExtensive paper + GitHub repo
Transformation documentationYTransformations like self-loop removal, duplicate dropping, phenotype-disease resolution, and mapping across ontologies are documented
Schema usedYNode and edge formats, and their standardized schema, are explained in the methodology and data files

Section Score: 5/5

Update frequency and versioning

QuestionAnswerComment
Stable versionsNNo version tags (e.g., v1.0, v1.1) are mentioned or used on Dataverse or GitHub
Public tracker informationGitHub Issues tab is not actively used for public feature requests or bug tracking
Knowledge graph contact informationYMaintained by Zitnik Lab at Harvard with lab contact and GitHub maintainers listed
Updated annuallyNOnly one release version is available as of now (May 2022)
Prior versions accessNNo archived prior versions or changelog indicating updates

Section Score: 1/4

Evaluation - Metrics and Fitness for Purpose

QuestionAnswerComment
Use case providedYAutism case study demonstrates disease concept resolution and clinical alignment
Evaluation against other modelsYCompared to other KGs (e.g., SPOKE); benchmarks and references to prior systems included
Defined scopeYFocused on disease-centric precision medicine with defined coverage: 17,080 diseases, 10 biological scales, 20 sources
Multiple evaluation methodsYStructure connectivity, edge density, text embedding-based grouping, and clinical relevance tested
Accuracy metricsYPartially - Uses similarity thresholds (e.g., cosine ≥ 0.98 for disease grouping); no formal metrics like precision/recall provided

Section Score: 5/5

License Information

QuestionAnswerComment
LicenseCC BY 4.0