Evaluation for primekg

Evaluator: Not specified

Evaluated on: 2025-08-14

This is a manual evaluation intended to identify potential barriers to reuse.

Access Level and Types

Question	Answer	Comment
Access to data outside of the knowledge graph	Y	ClinicalBERT-based embeddings were used to group disease nodes, providing an embedding-derived version of the graph
API or online access to the knowledge graph	N
Multiple access options available	Y	Available via Harvard Dataverse with raw KG (kg raw.csv) and largest connected component (kg giant.csv) https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM
Source code availability	Y	Full source code is available on GitHub https://github.com/mims-harvard/PrimeKG
Downloadable knowledge graph	Y	Harvard Dataverse Repo hosts the downloadable KG and intermediate files https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM

Section Score: 4/5

Provenance of Nodes and Edges

Question	Answer	Comment
Source list provided	Y	20 primary data sources listed, including DisGeNET, DrugBank, UMLS, Orphanet, etc. in the paper https://www.nature.com/articles/s41597-023-01960-3
Source versions information	Y	Explicit versions and download dates provided for each dataset in the Methods/Data Records section
Import dependencies	Y	Partially - Tools like goatools, beautifulsoup, regex scripts, and vocabulary mappings are mentioned in the GitHub repo, but not all formal dependencies are listed
Node and edge sources	Y	Each node contains node source; edges are annotated by type and origin
Edges deduplication	Y	Duplicates and self-loops were removed during KG preprocessing and merging
Triples source details	Y	Clear documentation on what triples were derived from which resource (e.g., drug–protein from DrugBank, phenotype–disease from HPO)
Edge type schema	Y	The paper documented schema of 30 edge types and their origin ontologies

Section Score: 7/7

Documented standards, schema, construction

Question	Answer	Comment
Biological usable data	Y	Clinical and pharmacological text features are readable and interpretable (e.g., Mayo Clinic descriptions, DrugBank pharmacodynamics)
Resolvable IDs	Y	Uses Mondo, DrugBank, HPO, MeSH, Entrez Gene IDs, and UMLS CUIs, which are mappable and resolvable via external resources
Construction documentation	Y	Extensive paper + GitHub repo
Transformation documentation	Y	Transformations like self-loop removal, duplicate dropping, phenotype-disease resolution, and mapping across ontologies are documented
Schema used	Y	Node and edge formats, and their standardized schema, are explained in the methodology and data files

Section Score: 5/5

Update frequency and versioning

Question	Answer	Comment
Stable versions	N	No version tags (e.g., v1.0, v1.1) are mentioned or used on Dataverse or GitHub
Public tracker information		GitHub Issues tab is not actively used for public feature requests or bug tracking
Knowledge graph contact information	Y	Maintained by Zitnik Lab at Harvard with lab contact and GitHub maintainers listed
Updated annually	N	Only one release version is available as of now (May 2022)
Prior versions access	N	No archived prior versions or changelog indicating updates

Section Score: 1/4

Evaluation - Metrics and Fitness for Purpose

Question	Answer	Comment
Use case provided	Y	Autism case study demonstrates disease concept resolution and clinical alignment
Evaluation against other models	Y	Compared to other KGs (e.g., SPOKE); benchmarks and references to prior systems included
Defined scope	Y	Focused on disease-centric precision medicine with defined coverage: 17,080 diseases, 10 biological scales, 20 sources
Multiple evaluation methods	Y	Structure connectivity, edge density, text embedding-based grouping, and clinical relevance tested
Accuracy metrics	Y	Partially - Uses similarity thresholds (e.g., cosine ≥ 0.98 for disease grouping); no formal metrics like precision/recall provided

Section Score: 5/5

License Information

Question	Answer	Comment
License		CC BY 4.0