Parquet Backend for KG-Registry

The KG-Registry now includes a Parquet backend that provides enhanced querying capabilities while keeping data size manageable and maintaining the human-readable YAML files in the registry directory.

Features

Efficient Storage: Parquet format provides columnar storage with better compression than a full DuckDB database
Fast Querying: DuckDB can directly query Parquet files without loading the entire dataset
Human-Readable Data: Original YAML files remain unchanged and editable
Rich Query Interface: Python API and CLI commands for querying resources
Statistics: Built-in analytics and statistics generation
Synchronization: Easy sync from YAML files to Parquet files
GitHub-Friendly: Parquet files can be version controlled with reasonable storage size

Installation

DuckDB is included as a dependency in the project. To install:

# Using poetry (recommended)
poetry install

# Or using pip
pip install duckdb pyarrow

Quick Start

1. Sync YAML Data to Parquet

# Sync the registry data to Parquet files
python -m kg_registry.cli parquet sync --yaml-file registry/kgs.yml --output-dir registry/parquet

2. Query Resources

# Get statistics about the registry
python -m kg_registry.cli parquet stats --parquet-dir registry/parquet

# Query resources by category
python -m kg_registry.cli parquet query --category KnowledgeGraph --parquet-dir registry/parquet

# Query resources by domain
python -m kg_registry.cli parquet query --domain genomics --parquet-dir registry/parquet

# Search resources by name or description
python -m kg_registry.cli parquet query --search "drug" --parquet-dir registry/parquet

Python API

Basic Usage

from kg_registry.parquet_backend import ParquetBackend, DuckDBParquetQuerier

# Method 1: Load data into memory for querying
with ParquetBackend() as backend:
    # Load data from Parquet files
    backend.load_from_parquet("registry/parquet")
    
    # Query resources
    active_kgs = backend.query_resources(
        category="KnowledgeGraph",
        activity_status="active"
    )
    
    # Search resources
    drug_resources = backend.search_resources("drug")
    
    # Get statistics
    stats = backend.get_resource_stats()
    print(f"Total resources: {stats['total_resources']}")

# Method 2: Query Parquet files directly without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
    # Execute custom SQL query directly on Parquet files
    results = querier.execute_query("""
        SELECT r.id, r.name, r.category, COUNT(p.product_id) as product_count
        FROM resources r
        LEFT JOIN resource_products p ON r.id = p.resource_id
        WHERE r.activity_status = 'active'
        GROUP BY r.id, r.name, r.category
        HAVING COUNT(p.product_id) > 0
        ORDER BY product_count DESC
        LIMIT 10
    """)

Syncing Data

from kg_registry.parquet_backend import sync_yaml_to_parquet

# Sync YAML data to Parquet files
count = sync_yaml_to_parquet("registry/kgs.yml", "registry/parquet")
print(f"Synced {count} resources to Parquet files")

CLI Commands

`parquet sync`

Synchronize YAML data to Parquet files.

python -m kg_registry.cli parquet sync [OPTIONS]

Options:
  --yaml-file TEXT    Path to YAML file to sync (default: registry/kgs.yml)
  --output-dir TEXT   Directory to store Parquet files (default: registry/parquet)

`parquet stats`

Show statistics about the registry from Parquet files.

python -m kg_registry.cli parquet stats [OPTIONS]

Options:
  --parquet-dir TEXT  Directory containing Parquet files (default: registry/parquet)

`parquet query`

Query resources from Parquet files.

python -m kg_registry.cli parquet query [OPTIONS]

Options:
  --category TEXT     Filter by category
  --domain TEXT       Filter by domain
  --status TEXT       Filter by activity status
  --search TEXT       Search in name or description
  --parquet-dir TEXT  Directory containing Parquet files (default: registry/parquet)

Web Frontend

The KG-Registry web interface can query Parquet files directly using DuckDB-WASM in the browser. This allows for complex queries without having to load the entire database.

To set up the web frontend with Parquet support:

Export the registry data to Parquet files:
```
python -m kg_registry.cli parquet sync
```
The advanced search interface at /advanced-search.html will automatically load the Parquet files from /registry/parquet/ and enable querying.

Benefits over Full DuckDB Database

Size: Parquet files are significantly smaller than a full DuckDB database
Version Control: Parquet files can be effectively tracked in Git
Performance: Queries only read the columns they need
Compatibility: Parquet is an open standard supported by many tools
Portability: Parquet files can be easily shared and used with other systems

Data Synchronization

The Parquet backend maintains a copy of the YAML data in Parquet format. To keep it synchronized:

Manual Sync: Run the parquet sync command after updating YAML files
Automated Sync: Integrate the sync command into your CI/CD pipeline
Programmatic Sync: Use the Python API to sync data in scripts

Example Use Cases

1. Finding Resources with Complex Criteria

# Using DuckDBParquetQuerier for efficient querying without loading into memory
with DuckDBParquetQuerier("registry/parquet") as querier:
    # Find active knowledge graphs in genomics with products
    results = querier.execute_query("""
        SELECT r.* 
        FROM resources r
        JOIN resource_domains d ON r.id = d.resource_id
        JOIN resource_products p ON r.id = p.resource_id
        WHERE r.category = 'KnowledgeGraph'
          AND r.activity_status = 'active'
          AND d.domain = 'genomics'
        GROUP BY r.id
    """)

2. Generating Analytics Reports

# Get comprehensive domain statistics
with DuckDBParquetQuerier("registry/parquet") as querier:
    domain_stats = querier.execute_query("""
        SELECT d.domain, 
               COUNT(DISTINCT r.id) as resource_count,
               COUNT(DISTINCT p.product_id) as product_count,
               COUNT(DISTINCT CASE WHEN r.activity_status = 'active' THEN r.id END) as active_count
        FROM resource_domains d
        JOIN resources r ON d.resource_id = r.id
        LEFT JOIN resource_products p ON r.id = p.resource_id
        GROUP BY d.domain
        ORDER BY resource_count DESC
    """)
    
    # Export to JSON for web interface
    import json
    with open('domain_report.json', 'w') as f:
        json.dump(domain_stats, f, indent=2)

Migration from Full DuckDB Database

The Parquet backend is designed to replace the full DuckDB database file while preserving all functionality:

YAML files remain authoritative: All edits should still be made to YAML files
Efficient querying: Use Parquet files for complex queries instead of full database
Backward compatibility: CLI interface maintains the same structure
Web support: Advanced search interface works with both backends

Contributing

When adding new features to the Parquet backend:

Update the backend schema if needed
Add corresponding tests
Update this documentation
Ensure YAML files remain the source of truth