Parquet Backend for KG-Registry

The KG-Registry now includes a Parquet backend that provides enhanced querying capabilities while keeping data size manageable and maintaining the human-readable YAML files in the registry directory.

Features

  • Efficient Storage: Parquet format provides columnar storage with better compression than a full DuckDB database
  • Fast Querying: DuckDB can directly query Parquet files without loading the entire dataset
  • Human-Readable Data: Original YAML files remain unchanged and editable
  • Rich Query Interface: Python API and CLI commands for querying resources
  • Hierarchical Taxon Filtering: Query resources by organism/species with automatic expansion to descendant taxa
  • Statistics: Built-in analytics and statistics generation
  • Synchronization: Easy sync from YAML files to Parquet files
  • GitHub-Friendly: Parquet files can be version controlled with reasonable storage size

Installation

DuckDB and other dependencies are included in the project. To install:

# Using poetry (recommended)
poetry install

# Or using pip
pip install duckdb pyarrow oaklib

The oaklib (Ontology Access Kit) dependency enables hierarchical taxon filtering using the NCBI Taxonomy.

Quick Start

1. Sync YAML Data to Parquet

# Sync the registry data to Parquet files
python -m kg_registry.cli parquet sync --yaml-file registry/kgs.yml --output-dir registry/parquet

2. Query Resources

# Get statistics about the registry
python -m kg_registry.cli parquet stats --parquet-dir registry/parquet

# Query resources by category
python -m kg_registry.cli parquet query --category KnowledgeGraph --parquet-dir registry/parquet

# Query resources by domain
python -m kg_registry.cli parquet query --domain genomics --parquet-dir registry/parquet

# Search resources by name or description
python -m kg_registry.cli parquet query --search "drug" --parquet-dir registry/parquet

# Query resources by taxon (organism/species)
# This automatically includes descendant taxa (e.g., Mammalia includes humans, mice, etc.)
python -m kg_registry.cli parquet query --taxon NCBITaxon:9606 --parquet-dir registry/parquet

# Query by taxon without including descendants
python -m kg_registry.cli parquet query --taxon NCBITaxon:9606 --taxon-include-descendants=false --parquet-dir registry/parquet

Python API

Basic Usage

from kg_registry.parquet_backend import ParquetBackend, DuckDBParquetQuerier

# Method 1: Load data into memory for querying
with ParquetBackend() as backend:
    # Load data from Parquet files
    backend.load_from_parquet("registry/parquet")
    
    # Query resources
    active_kgs = backend.query_resources(
        category="KnowledgeGraph",
        activity_status="active"
    )
    
    # Search resources
    drug_resources = backend.search_resources("drug")

    # Query resources by taxon (with hierarchical expansion)
    human_resources = backend.query_by_taxon("NCBITaxon:9606")  # Human and descendants

    # Query by taxon without expanding to descendants
    exact_taxon_resources = backend.query_by_taxon("NCBITaxon:9606", include_descendants=False)

    # Get statistics
    stats = backend.get_resource_stats()
    print(f"Total resources: {stats['total_resources']}")

# Method 2: Query Parquet files directly without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
    # Execute custom SQL query directly on Parquet files
    results = querier.execute_query("""
        SELECT r.id, r.name, r.category, COUNT(p.product_id) as product_count
        FROM resources r
        LEFT JOIN resource_products p ON r.id = p.resource_id
        WHERE r.activity_status = 'active'
        GROUP BY r.id, r.name, r.category
        HAVING COUNT(p.product_id) > 0
        ORDER BY product_count DESC
        LIMIT 10
    """)

Syncing Data

from kg_registry.parquet_backend import sync_yaml_to_parquet

# Sync YAML data to Parquet files
count = sync_yaml_to_parquet("registry/kgs.yml", "registry/parquet")
print(f"Synced {count} resources to Parquet files")

CLI Commands

parquet sync

Synchronize YAML data to Parquet files.

python -m kg_registry.cli parquet sync [OPTIONS]

Options:
  --yaml-file TEXT    Path to YAML file to sync (default: registry/kgs.yml)
  --output-dir TEXT   Directory to store Parquet files (default: registry/parquet)

parquet stats

Show statistics about the registry from Parquet files.

python -m kg_registry.cli parquet stats [OPTIONS]

Options:
  --parquet-dir TEXT  Directory containing Parquet files (default: registry/parquet)

parquet query

Query resources from Parquet files.

python -m kg_registry.cli parquet query [OPTIONS]

Options:
  --category TEXT                    Filter by category
  --domain TEXT                      Filter by domain
  --status TEXT                      Filter by activity status
  --search TEXT                      Search in name or description
  --taxon TEXT                       Filter by taxon (NCBI Taxonomy ID, e.g., NCBITaxon:9606)
  --taxon-include-descendants BOOL   Include descendant taxa (default: True)
  --parquet-dir TEXT                 Directory containing Parquet files (default: registry/parquet)

Web Frontend

The KG-Registry web interface can query Parquet files directly using DuckDB-WASM in the browser. This allows for complex queries without having to load the entire database.

To set up the web frontend with Parquet support:

  1. Export the registry data to Parquet files:
    python -m kg_registry.cli parquet sync
    
  2. The advanced search interface at /advanced-search.html will automatically load the Parquet files from /registry/parquet/ and enable querying.

Benefits over Full DuckDB Database

  1. Size: Parquet files are significantly smaller than a full DuckDB database
  2. Version Control: Parquet files can be effectively tracked in Git
  3. Performance: Queries only read the columns they need
  4. Compatibility: Parquet is an open standard supported by many tools
  5. Portability: Parquet files can be easily shared and used with other systems

Data Synchronization

The Parquet backend maintains a copy of the YAML data in Parquet format. To keep it synchronized:

  1. Manual Sync: Run the parquet sync command after updating YAML files
  2. Automated Sync: Integrate the sync command into your CI/CD pipeline
  3. Programmatic Sync: Use the Python API to sync data in scripts

Example Use Cases

1. Finding Resources by Organism

# Find resources relevant to humans and their descendants
with ParquetBackend() as backend:
    backend.load_from_parquet("registry/parquet")

    # Get resources for human and all descendant taxa (primates, etc.)
    human_resources = backend.query_by_taxon("NCBITaxon:9606")

    # Get only resources explicitly annotated with human
    exact_human_resources = backend.query_by_taxon("NCBITaxon:9606", include_descendants=False)

2. Finding Resources with Complex Criteria

# Using DuckDBParquetQuerier for efficient querying without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
    # Find active knowledge graphs in genomics with products
    results = querier.execute_query("""
        SELECT r.*
        FROM resources r
        JOIN resource_domains d ON r.id = d.resource_id
        JOIN resource_products p ON r.id = p.resource_id
        WHERE r.category = 'KnowledgeGraph'
          AND r.activity_status = 'active'
          AND d.domain = 'genomics'
        GROUP BY r.id
    """)

    # Find resources for specific organisms using the resource_taxa table
    mammalian_resources = querier.execute_query("""
        SELECT DISTINCT r.id, r.name, t.taxon
        FROM resources r
        JOIN resource_taxa t ON r.id = t.resource_id
        WHERE t.taxon IN ('NCBITaxon:40674', 'NCBITaxon:9606', 'NCBITaxon:10116')
        ORDER BY r.name
    """)

3. Generating Analytics Reports

# Get comprehensive domain statistics
with DuckDBParquetQuerier("registry/parquet") as querier:
    domain_stats = querier.execute_query("""
        SELECT d.domain,
               COUNT(DISTINCT r.id) as resource_count,
               COUNT(DISTINCT p.product_id) as product_count,
               COUNT(DISTINCT CASE WHEN r.activity_status = 'active' THEN r.id END) as active_count
        FROM resource_domains d
        JOIN resources r ON d.resource_id = r.id
        LEFT JOIN resource_products p ON r.id = p.resource_id
        GROUP BY d.domain
        ORDER BY resource_count DESC
    """)

    # Get taxon statistics
    taxon_stats = querier.execute_query("""
        SELECT t.taxon,
               COUNT(DISTINCT r.id) as resource_count
        FROM resource_taxa t
        JOIN resources r ON t.resource_id = r.id
        GROUP BY t.taxon
        ORDER BY resource_count DESC
    """)

    # Export to JSON for web interface
    import json
    with open('domain_report.json', 'w') as f:
        json.dump(domain_stats, f, indent=2)
    with open('taxon_report.json', 'w') as f:
        json.dump(taxon_stats, f, indent=2)

Migration from Full DuckDB Database

The Parquet backend is designed to replace the full DuckDB database file while preserving all functionality:

  1. YAML files remain authoritative: All edits should still be made to YAML files
  2. Efficient querying: Use Parquet files for complex queries instead of full database
  3. Backward compatibility: CLI interface maintains the same structure
  4. Web support: Advanced search interface works with both backends

Contributing

When adding new features to the Parquet backend:

  1. Update the backend schema if needed
  2. Add corresponding tests
  3. Update this documentation
  4. Ensure YAML files remain the source of truth