Author: Luke Heberling, Created: 2026-03-27

MetaDB

A unified query layer for cloud-native data analytics

Overview

MetaDB is Xperiflow's high-performance query interface for analytics on cloud-stored data. It enables SQL-based exploration and analysis of large datasets stored across distributed cloud storage systems—without the complexity of managing connections, credentials, or knowing exact storage locations.

At its core, MetaDB provides three essential capabilities:

Unified Protocol Access — Query files across different storage zones using intuitive protocols like shared:// or routine://
Cloud Abstraction — Interact with Azure Blob Storage as if it were a local database
Analytical Performance — Leverage DuckDB's columnar processing engine optimized for big data analytics

Architecture Overview

The Problem MetaDB Solves

The Challenge of Cloud Data Access

Modern data platforms store vast amounts of data in cloud object storage. While this approach offers scalability and cost-efficiency, it introduces complexity:

Challenge	Traditional Approach	MetaDB Approach
Storage Paths	Know exact container names, account URLs, and folder structures	Use simple protocol prefixes like `shared://`
Authentication	Manage connection strings, SAS tokens, or service principals	Automatic credential injection
Data Formats	Load data into memory before querying	Query Parquet files directly in-place
Performance	Full file downloads for simple queries	Columnar pushdown—read only what you need

Challenge

Traditional Approach

MetaDB Approach

Storage Paths

Know exact container names, account URLs, and folder structures

Use simple protocol prefixes like

shared://

Authentication

Manage connection strings, SAS tokens, or service principals

Automatic credential injection

Data Formats

Load data into memory before querying

Query Parquet files directly in-place

Performance

Full file downloads for simple queries

Columnar pushdown—read only what you need

Why Not Just Use Traditional Databases?

Traditional databases require data ingestion—moving data from storage into database tables before querying. This creates:

Latency: Minutes or hours of ETL before analysis
Duplication: Data exists in storage AND the database
Cost: Compute resources for transformation and storage

MetaDB eliminates this by querying data where it lives—directly in cloud storage.

Core Concepts

Protocols: Your Data Address Book

Protocols in MetaDB are intuitive prefixes that map to physical storage locations. Think of them as bookmarks to different areas of the platform's data landscape.

Protocol	Purpose	Example Use Case
`shared://`	Platform-wide shared data	Reference data, lookup tables, configurations
`routine://`	Routine execution artifacts	Model outputs, intermediate results, logs
`framework://`	Framework and system data	Schema definitions, system metadata

Protocol

Purpose

Example Use Case

shared://

Platform-wide shared data

Reference data, lookup tables, configurations

routine://

Routine execution artifacts

Model outputs, intermediate results, logs

framework://

Framework and system data

Schema definitions, system metadata

Conceptual Example:

Query: "Show me all records from the sales forecast"

Instead of: az://site-hash-abc123/shared/forecasts/sales_q4.parquet

You write: shared://forecasts/sales_q4.parquet

The protocol system provides:

Simplicity — Human-readable paths instead of cloud URIs
Portability — Same query works across environments (dev, staging, production)
Security — Access control enforced at the protocol level

The Alias Resolution Engine

When you write a query using protocols, MetaDB's Alias Resolution Engine translates your friendly paths into actual cloud storage addresses.

This translation happens transparently—you never see or manage the underlying cloud paths.

Powered by DuckDB

MetaDB is built on DuckDB, an embedded analytical database engine. DuckDB was chosen for several key reasons:

Columnar Storage Processing

Unlike traditional row-based databases, DuckDB processes data column-by-column. When you query only 3 columns from a 50-column Parquet file, DuckDB reads only those 3 columns—reducing I/O by up to 90%.

Vectorized Execution

DuckDB processes data in batches (vectors) rather than row-by-row, leveraging modern CPU architectures for parallel processing.

Zero-Copy Reads

Data flows directly from cloud storage to query results without intermediate staging.

How MetaDB Differs from MetaFileSystem

Xperiflow provides two complementary systems for working with cloud storage:

Capability	MetaFileSystem	MetaDB
Primary Use	File operations	Data analytics
Operations	Read, write, copy, delete files	Query, aggregate, join, filter
Interface	File system API	SQL queries
Data Format	Any file type	Optimized for Parquet/CSV
Use Case	"Save this model artifact"	"What's the average across all records?"

Capability

MetaFileSystem

MetaDB

Primary Use

File operations

Data analytics

Operations

Read, write, copy, delete files

Query, aggregate, join, filter

Interface

File system API

SQL queries

Data Format

Any file type

Optimized for Parquet/CSV

Use Case

"Save this model artifact"

"What's the average across all records?"

Think of it this way:

MetaFileSystem = Your cloud file explorer
MetaDB = Your cloud SQL workbench

Both systems share the same protocol conventions (shared://, routine://, framework://), ensuring a consistent mental model across the platform.

Storage Architecture

The Three Storage Zones

MetaDB organizes data into logical zones, each serving a distinct purpose:

Zone	Protocol	Purpose	Contents
Shared	`shared://`	Platform-wide data	Reference datasets, lookups, global configs
Framework	`framework://`	System-level data	Schemas, metadata, templates
Routine	`routine://`	Execution data	Instance data, run artifacts, method outputs

Zone

Protocol

Purpose

Contents

Shared

shared://

Platform-wide data

Reference datasets, lookups, global configs

Framework

framework://

System-level data

Schemas, metadata, templates

Routine

routine://

Execution data

Instance data, run artifacts, method outputs

Storage Zone Details

Shared Zone (shared://)

The shared zone contains platform-wide data accessible across all contexts:

Reference datasets (country codes, currency mappings)
Lookup tables shared between routines
Global configuration data

Framework Zone (framework://)

System-level data supporting the Xperiflow runtime:

Schema definitions
Framework metadata
System templates

Routine Zone (routine://)

Organized by routine instance and execution run:

Path	Description
`routine://instances/`	Root for all routine instances
`routine://instances/[instance_id]/`	Specific routine instance
`routine://instances/[instance_id]/shared/`	Instance member data (persists across runs)
`routine://instances/[instance_id]/runs/[run_id]/`	Single execution run
`routine://instances/[instance_id]/runs/[run_id]/artifacts/`	Output artifacts from the run

Path

Description

routine://instances/

Root for all routine instances

routine://instances/[instance_id]/

Specific routine instance

routine://instances/[instance_id]/shared/

Instance member data (persists across runs)

routine://instances/[instance_id]/runs/[run_id]/

Single execution run

routine://instances/[instance_id]/runs/[run_id]/artifacts/

Output artifacts from the run

Data Locality and Performance

The Principle of Data Locality

MetaDB is designed around a fundamental principle: move computation to the data, not data to the computation.

Traditional: Slow and expensive due to full data transfer

MetaDB: Fast and efficient—process data in-place

How DuckDB Enables This

Predicate Pushdown: Filters are applied at the storage level, reducing data transfer
Projection Pushdown: Only requested columns are read from Parquet files
Partition Pruning: When using Hive partitioning, entire file groups are skipped

Example Impact:

Scenario	Traditional Approach	MetaDB Approach
Query 1M rows, need 100	Download 1M rows, filter locally	Push filter, return 100 rows
50-column table, need 3	Download all 50 columns	Read only 3 columns
2 years of data, need January	Download 24 months	Read only January partition

Scenario

Traditional Approach

MetaDB Approach

Query 1M rows, need 100

Download 1M rows, filter locally

Push filter, return 100 rows

50-column table, need 3

Download all 50 columns

Read only 3 columns

2 years of data, need January

Download 24 months

Read only January partition

Parquet: The Preferred Format

MetaDB is optimized for Apache Parquet files—a columnar storage format designed for analytics.

Why Parquet?

Feature	CSV	Parquet
Storage	Row-based text format	Column-based binary format
Compression	Poor	Excellent (up to 90% smaller)
Column Selection	Must read all columns	Read only what's needed
Schema	Inferred at read time	Embedded with data types
Performance	Good for small files	Optimized for analytics at scale

Feature

CSV

Parquet

Storage

Row-based text format

Column-based binary format

Compression

Poor

Excellent (up to 90% smaller)

Column Selection

Must read all columns

Read only what's needed

Schema

Inferred at read time

Embedded with data types

Performance

Good for small files

Optimized for analytics at scale

Hive Partitioning (Future Direction)

Xperiflow is moving toward Hive-style partitioning for large datasets. In this scheme, folder names encode partition values:

Path	Partition Values
`sales_data/year=2024/month=01/data.parquet`	year=2024, month=01
`sales_data/year=2024/month=02/data.parquet`	year=2024, month=02
`sales_data/year=2025/month=01/data.parquet`	year=2025, month=01

Path

Partition Values

sales_data/year=2024/month=01/data.parquet

year=2024, month=01

sales_data/year=2024/month=02/data.parquet

year=2024, month=02

sales_data/year=2025/month=01/data.parquet

year=2025, month=01

When you query with a filter on year or month, only relevant partitions are scanned—dramatically improving performance for time-series and historical data.

Integration with Xperiflow Components

Routines and Artifacts

When routines execute, they produce artifacts stored in the routine zone. MetaDB provides direct query access to these artifacts:

Vector Storage (MetaVectorDB)

For AI/ML workloads, MetaVectorDB extends MetaDB with vector similarity search capabilities—storing embeddings as Parquet files and using DuckDB for efficient nearest-neighbor queries.

Best Practices

When to Use MetaDB

✅ Use MetaDB for:

Ad-hoc data exploration and analysis
Aggregating data across multiple Parquet files
Joining reference data with routine outputs
Quick insights without data loading delays

❌ Consider alternatives for:

File management operations (use MetaFileSystem)
Transactional updates (use SQL Server)
Real-time streaming (use appropriate streaming tools)

Query Optimization Tips

Filter Early: Apply WHERE clauses to reduce data scanned
Select Specific Columns: Avoid SELECT * when you need only a few columns
Leverage Partitions: Structure queries to align with partition columns
Use Appropriate Protocols: Query from the correct zone for fastest access

Glossary

Term	Definition
Alias Resolution	The process of translating protocol paths (e.g., `shared://` ) to actual cloud storage URLs
Columnar Storage	Data organization by columns rather than rows, optimized for analytical queries
DuckDB	The embedded analytical database engine powering MetaDB
Hive Partitioning	A directory-based partitioning scheme where folder names encode partition values
MetaFileSystem	Xperiflow's file operation interface (complementary to MetaDB)
Parquet	A columnar file format optimized for big data analytics
Protocol	A prefix (e.g., `shared://` , `routine://` ) that maps to a storage zone
Pushdown	Optimization where operations (filters, projections) are executed at the storage level
Vectorized Execution	Processing data in batches for CPU efficiency

Term

Definition

Alias Resolution

The process of translating protocol paths (e.g.,

shared://

) to actual cloud storage URLs

Columnar Storage

Data organization by columns rather than rows, optimized for analytical queries

DuckDB

The embedded analytical database engine powering MetaDB

Hive Partitioning

A directory-based partitioning scheme where folder names encode partition values

MetaFileSystem

Xperiflow's file operation interface (complementary to MetaDB)

Parquet

A columnar file format optimized for big data analytics

Protocol

A prefix (e.g.,

shared://

routine://

) that maps to a storage zone

Pushdown

Optimization where operations (filters, projections) are executed at the storage level

Vectorized Execution

Processing data in batches for CPU efficiency

Overview​

Architecture Overview​

The Problem MetaDB Solves​

The Challenge of Cloud Data Access​

Why Not Just Use Traditional Databases?​

Core Concepts​

Protocols: Your Data Address Book​

The Alias Resolution Engine​

Powered by DuckDB​

How MetaDB Differs from MetaFileSystem​

Storage Architecture​

The Three Storage Zones​

Storage Zone Details​

Data Locality and Performance​

The Principle of Data Locality​

How DuckDB Enables This​

Parquet: The Preferred Format​

Why Parquet?​

Hive Partitioning (Future Direction)​

Integration with Xperiflow Components​

Routines and Artifacts​

Vector Storage (MetaVectorDB)​

Best Practices​

When to Use MetaDB​

Query Optimization Tips​

Glossary​

Overview

Architecture Overview

The Problem MetaDB Solves

The Challenge of Cloud Data Access

Why Not Just Use Traditional Databases?

Core Concepts

Protocols: Your Data Address Book

The Alias Resolution Engine

Powered by DuckDB

How MetaDB Differs from MetaFileSystem

Storage Architecture

The Three Storage Zones

Storage Zone Details

Data Locality and Performance

The Principle of Data Locality

How DuckDB Enables This

Parquet: The Preferred Format

Why Parquet?

Hive Partitioning (Future Direction)

Integration with Xperiflow Components

Routines and Artifacts

Vector Storage (MetaVectorDB)

Best Practices

When to Use MetaDB

Query Optimization Tips

Glossary