Parquet is a columnar storage file format designed for efficiency with big data processing frameworks. These files can be directly queried using tools like DuckDB, Apache Spark, or Pandas without loading the entire dataset into memory.
File | Description | Download |
---|---|---|
resources.parquet |
Main resources table containing all knowledge graph registry entries | Download |
resource_domains.parquet |
Resource-domain relationships for better querying by domain | Download |
resource_products.parquet |
Products associated with each resource | Download |
These Parquet files can be queried using various tools:
import pandas as pd
# Read Parquet file into DataFrame
resources_df = pd.read_parquet('resources.parquet')
# Query data
active_resources = resources_df[resources_df['activity_status'] == 'active']
print(f"Total active resources: {len(active_resources)}")
import duckdb
# Connect to in-memory database
conn = duckdb.connect(':memory:')
# Query directly from Parquet files
result = conn.execute("""
SELECT category, COUNT(*) as count
FROM 'resources.parquet'
GROUP BY category
ORDER BY count DESC
""").fetchall()
for category, count in result:
print(f"{category}: {count} resources")
library(arrow)
# Read Parquet file
resources <- read_parquet("resources.parquet")
# Explore data
summary(resources)
# Filter active resources
active <- resources[resources$activity_status == "active", ]
print(paste("Total active resources:", nrow(active)))
For more information about using these Parquet files with the KG Registry, see our Parquet backend documentation.
Try the Advanced Search to query these files directly in your browser using SQL.