The KG-Registry now includes a Parquet backend that provides enhanced querying capabilities while keeping data size manageable and maintaining the human-readable YAML files in the registry directory.
DuckDB and other dependencies are included in the project. To install:
# Using poetry (recommended)
poetry install
# Or using pip
pip install duckdb pyarrow oaklib
The oaklib (Ontology Access Kit) dependency enables hierarchical taxon filtering using the NCBI Taxonomy.
# Sync the registry data to Parquet files
python -m kg_registry.cli parquet sync --yaml-file registry/kgs.yml --output-dir registry/parquet
# Get statistics about the registry
python -m kg_registry.cli parquet stats --parquet-dir registry/parquet
# Query resources by category
python -m kg_registry.cli parquet query --category KnowledgeGraph --parquet-dir registry/parquet
# Query resources by domain
python -m kg_registry.cli parquet query --domain genomics --parquet-dir registry/parquet
# Search resources by name or description
python -m kg_registry.cli parquet query --search "drug" --parquet-dir registry/parquet
# Query resources by taxon (organism/species)
# This automatically includes descendant taxa (e.g., Mammalia includes humans, mice, etc.)
python -m kg_registry.cli parquet query --taxon NCBITaxon:9606 --parquet-dir registry/parquet
# Query by taxon without including descendants
python -m kg_registry.cli parquet query --taxon NCBITaxon:9606 --taxon-include-descendants=false --parquet-dir registry/parquet
from kg_registry.parquet_backend import ParquetBackend, DuckDBParquetQuerier
# Method 1: Load data into memory for querying
with ParquetBackend() as backend:
# Load data from Parquet files
backend.load_from_parquet("registry/parquet")
# Query resources
active_kgs = backend.query_resources(
category="KnowledgeGraph",
activity_status="active"
)
# Search resources
drug_resources = backend.search_resources("drug")
# Query resources by taxon (with hierarchical expansion)
human_resources = backend.query_by_taxon("NCBITaxon:9606") # Human and descendants
# Query by taxon without expanding to descendants
exact_taxon_resources = backend.query_by_taxon("NCBITaxon:9606", include_descendants=False)
# Get statistics
stats = backend.get_resource_stats()
print(f"Total resources: {stats['total_resources']}")
# Method 2: Query Parquet files directly without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
# Execute custom SQL query directly on Parquet files
results = querier.execute_query("""
SELECT r.id, r.name, r.category, COUNT(p.product_id) as product_count
FROM resources r
LEFT JOIN resource_products p ON r.id = p.resource_id
WHERE r.activity_status = 'active'
GROUP BY r.id, r.name, r.category
HAVING COUNT(p.product_id) > 0
ORDER BY product_count DESC
LIMIT 10
""")
from kg_registry.parquet_backend import sync_yaml_to_parquet
# Sync YAML data to Parquet files
count = sync_yaml_to_parquet("registry/kgs.yml", "registry/parquet")
print(f"Synced {count} resources to Parquet files")
parquet syncSynchronize YAML data to Parquet files.
python -m kg_registry.cli parquet sync [OPTIONS]
Options:
--yaml-file TEXT Path to YAML file to sync (default: registry/kgs.yml)
--output-dir TEXT Directory to store Parquet files (default: registry/parquet)
parquet statsShow statistics about the registry from Parquet files.
python -m kg_registry.cli parquet stats [OPTIONS]
Options:
--parquet-dir TEXT Directory containing Parquet files (default: registry/parquet)
parquet queryQuery resources from Parquet files.
python -m kg_registry.cli parquet query [OPTIONS]
Options:
--category TEXT Filter by category
--domain TEXT Filter by domain
--status TEXT Filter by activity status
--search TEXT Search in name or description
--taxon TEXT Filter by taxon (NCBI Taxonomy ID, e.g., NCBITaxon:9606)
--taxon-include-descendants BOOL Include descendant taxa (default: True)
--parquet-dir TEXT Directory containing Parquet files (default: registry/parquet)
The KG-Registry web interface can query Parquet files directly using DuckDB-WASM in the browser. This allows for complex queries without having to load the entire database.
To set up the web frontend with Parquet support:
python -m kg_registry.cli parquet sync
/advanced-search.html will automatically
load the Parquet files from /registry/parquet/ and enable querying.The Parquet backend maintains a copy of the YAML data in Parquet format. To keep it synchronized:
parquet sync command after updating YAML files# Find resources relevant to humans and their descendants
with ParquetBackend() as backend:
backend.load_from_parquet("registry/parquet")
# Get resources for human and all descendant taxa (primates, etc.)
human_resources = backend.query_by_taxon("NCBITaxon:9606")
# Get only resources explicitly annotated with human
exact_human_resources = backend.query_by_taxon("NCBITaxon:9606", include_descendants=False)
# Using DuckDBParquetQuerier for efficient querying without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
# Find active knowledge graphs in genomics with products
results = querier.execute_query("""
SELECT r.*
FROM resources r
JOIN resource_domains d ON r.id = d.resource_id
JOIN resource_products p ON r.id = p.resource_id
WHERE r.category = 'KnowledgeGraph'
AND r.activity_status = 'active'
AND d.domain = 'genomics'
GROUP BY r.id
""")
# Find resources for specific organisms using the resource_taxa table
mammalian_resources = querier.execute_query("""
SELECT DISTINCT r.id, r.name, t.taxon
FROM resources r
JOIN resource_taxa t ON r.id = t.resource_id
WHERE t.taxon IN ('NCBITaxon:40674', 'NCBITaxon:9606', 'NCBITaxon:10116')
ORDER BY r.name
""")
# Get comprehensive domain statistics
with DuckDBParquetQuerier("registry/parquet") as querier:
domain_stats = querier.execute_query("""
SELECT d.domain,
COUNT(DISTINCT r.id) as resource_count,
COUNT(DISTINCT p.product_id) as product_count,
COUNT(DISTINCT CASE WHEN r.activity_status = 'active' THEN r.id END) as active_count
FROM resource_domains d
JOIN resources r ON d.resource_id = r.id
LEFT JOIN resource_products p ON r.id = p.resource_id
GROUP BY d.domain
ORDER BY resource_count DESC
""")
# Get taxon statistics
taxon_stats = querier.execute_query("""
SELECT t.taxon,
COUNT(DISTINCT r.id) as resource_count
FROM resource_taxa t
JOIN resources r ON t.resource_id = r.id
GROUP BY t.taxon
ORDER BY resource_count DESC
""")
# Export to JSON for web interface
import json
with open('domain_report.json', 'w') as f:
json.dump(domain_stats, f, indent=2)
with open('taxon_report.json', 'w') as f:
json.dump(taxon_stats, f, indent=2)
The Parquet backend is designed to replace the full DuckDB database file while preserving all functionality:
When adding new features to the Parquet backend: