Skip to main content
Author: Luke Heberling, Created: 2026-03-27

MetaDB

A unified query layer for cloud-native data analytics


Overview

MetaDB is Xperiflow's high-performance query interface for analytics on cloud-stored data. It enables SQL-based exploration and analysis of large datasets stored across distributed cloud storage systems—without the complexity of managing connections, credentials, or knowing exact storage locations.

At its core, MetaDB provides three essential capabilities:

  1. Unified Protocol Access — Query files across different storage zones using intuitive protocols like shared:// or routine://
  2. Cloud Abstraction — Interact with Azure Blob Storage as if it were a local database
  3. Analytical Performance — Leverage DuckDB's columnar processing engine optimized for big data analytics

Architecture Overview

loading...

The Problem MetaDB Solves

The Challenge of Cloud Data Access

Modern data platforms store vast amounts of data in cloud object storage. While this approach offers scalability and cost-efficiency, it introduces complexity:

Challenge

Traditional Approach

MetaDB Approach

Storage Paths

Know exact container names, account URLs, and folder structures

Use simple protocol prefixes like

shared://

Authentication

Manage connection strings, SAS tokens, or service principals

Automatic credential injection

Data Formats

Load data into memory before querying

Query Parquet files directly in-place

Performance

Full file downloads for simple queries

Columnar pushdown—read only what you need

Why Not Just Use Traditional Databases?

Traditional databases require data ingestion—moving data from storage into database tables before querying. This creates:

  • Latency: Minutes or hours of ETL before analysis
  • Duplication: Data exists in storage AND the database
  • Cost: Compute resources for transformation and storage

MetaDB eliminates this by querying data where it lives—directly in cloud storage.


Core Concepts

Protocols: Your Data Address Book

Protocols in MetaDB are intuitive prefixes that map to physical storage locations. Think of them as bookmarks to different areas of the platform's data landscape.

Protocol

Purpose

Example Use Case

shared://

Platform-wide shared data

Reference data, lookup tables, configurations

routine://

Routine execution artifacts

Model outputs, intermediate results, logs

framework://

Framework and system data

Schema definitions, system metadata

Conceptual Example:

Query: "Show me all records from the sales forecast"

Instead of: az://site-hash-abc123/shared/forecasts/sales_q4.parquet

You write: shared://forecasts/sales_q4.parquet

The protocol system provides:

  • Simplicity — Human-readable paths instead of cloud URIs
  • Portability — Same query works across environments (dev, staging, production)
  • Security — Access control enforced at the protocol level

The Alias Resolution Engine

When you write a query using protocols, MetaDB's Alias Resolution Engine translates your friendly paths into actual cloud storage addresses.

loading...

This translation happens transparently—you never see or manage the underlying cloud paths.

Powered by DuckDB

MetaDB is built on DuckDB, an embedded analytical database engine. DuckDB was chosen for several key reasons:

Columnar Storage Processing

Unlike traditional row-based databases, DuckDB processes data column-by-column. When you query only 3 columns from a 50-column Parquet file, DuckDB reads only those 3 columns—reducing I/O by up to 90%.

loading...

Vectorized Execution

DuckDB processes data in batches (vectors) rather than row-by-row, leveraging modern CPU architectures for parallel processing.

Zero-Copy Reads

Data flows directly from cloud storage to query results without intermediate staging.


How MetaDB Differs from MetaFileSystem

Xperiflow provides two complementary systems for working with cloud storage:

Capability

MetaFileSystem

MetaDB

Primary Use

File operations

Data analytics

Operations

Read, write, copy, delete files

Query, aggregate, join, filter

Interface

File system API

SQL queries

Data Format

Any file type

Optimized for Parquet/CSV

Use Case

"Save this model artifact"

"What's the average across all records?"

Think of it this way:

  • MetaFileSystem = Your cloud file explorer
  • MetaDB = Your cloud SQL workbench

Both systems share the same protocol conventions (shared://, routine://, framework://), ensuring a consistent mental model across the platform.


Storage Architecture

The Three Storage Zones

MetaDB organizes data into logical zones, each serving a distinct purpose:

Zone

Protocol

Purpose

Contents

Shared

shared://

Platform-wide data

Reference datasets, lookups, global configs

Framework

framework://

System-level data

Schemas, metadata, templates

Routine

routine://

Execution data

Instance data, run artifacts, method outputs

Storage Zone Details

Shared Zone (shared://)

The shared zone contains platform-wide data accessible across all contexts:

  • Reference datasets (country codes, currency mappings)
  • Lookup tables shared between routines
  • Global configuration data

Framework Zone (framework://)

System-level data supporting the Xperiflow runtime:

  • Schema definitions
  • Framework metadata
  • System templates

Routine Zone (routine://)

Organized by routine instance and execution run:

Path

Description

routine://instances/

Root for all routine instances

routine://instances/[instance_id]/

Specific routine instance

routine://instances/[instance_id]/shared/

Instance member data (persists across runs)

routine://instances/[instance_id]/runs/[run_id]/

Single execution run

routine://instances/[instance_id]/runs/[run_id]/artifacts/

Output artifacts from the run


Data Locality and Performance

The Principle of Data Locality

MetaDB is designed around a fundamental principle: move computation to the data, not data to the computation.

loading...

Traditional: Slow and expensive due to full data transfer

MetaDB: Fast and efficient—process data in-place

How DuckDB Enables This

  1. Predicate Pushdown: Filters are applied at the storage level, reducing data transfer
  2. Projection Pushdown: Only requested columns are read from Parquet files
  3. Partition Pruning: When using Hive partitioning, entire file groups are skipped

Example Impact:

Scenario

Traditional Approach

MetaDB Approach

Query 1M rows, need 100

Download 1M rows, filter locally

Push filter, return 100 rows

50-column table, need 3

Download all 50 columns

Read only 3 columns

2 years of data, need January

Download 24 months

Read only January partition


Parquet: The Preferred Format

MetaDB is optimized for Apache Parquet files—a columnar storage format designed for analytics.

Why Parquet?

Feature

CSV

Parquet

Storage

Row-based text format

Column-based binary format

Compression

Poor

Excellent (up to 90% smaller)

Column Selection

Must read all columns

Read only what's needed

Schema

Inferred at read time

Embedded with data types

Performance

Good for small files

Optimized for analytics at scale

Hive Partitioning (Future Direction)

Xperiflow is moving toward Hive-style partitioning for large datasets. In this scheme, folder names encode partition values:

Path

Partition Values

sales_data/year=2024/month=01/data.parquet

year=2024, month=01

sales_data/year=2024/month=02/data.parquet

year=2024, month=02

sales_data/year=2025/month=01/data.parquet

year=2025, month=01

When you query with a filter on year or month, only relevant partitions are scanned—dramatically improving performance for time-series and historical data.


Integration with Xperiflow Components

Routines and Artifacts

When routines execute, they produce artifacts stored in the routine zone. MetaDB provides direct query access to these artifacts:

loading...

Vector Storage (MetaVectorDB)

For AI/ML workloads, MetaVectorDB extends MetaDB with vector similarity search capabilities—storing embeddings as Parquet files and using DuckDB for efficient nearest-neighbor queries.


Best Practices

When to Use MetaDB

Use MetaDB for:

  • Ad-hoc data exploration and analysis
  • Aggregating data across multiple Parquet files
  • Joining reference data with routine outputs
  • Quick insights without data loading delays

Consider alternatives for:

  • File management operations (use MetaFileSystem)
  • Transactional updates (use SQL Server)
  • Real-time streaming (use appropriate streaming tools)

Query Optimization Tips

  1. Filter Early: Apply WHERE clauses to reduce data scanned
  2. Select Specific Columns: Avoid SELECT * when you need only a few columns
  3. Leverage Partitions: Structure queries to align with partition columns
  4. Use Appropriate Protocols: Query from the correct zone for fastest access

Glossary

Term

Definition

Alias Resolution

The process of translating protocol paths (e.g.,

shared://

) to actual cloud storage URLs

Columnar Storage

Data organization by columns rather than rows, optimized for analytical queries

DuckDB

The embedded analytical database engine powering MetaDB

Hive Partitioning

A directory-based partitioning scheme where folder names encode partition values

MetaFileSystem

Xperiflow's file operation interface (complementary to MetaDB)

Parquet

A columnar file format optimized for big data analytics

Protocol

A prefix (e.g.,

shared://

,

routine://

) that maps to a storage zone

Pushdown

Optimization where operations (filters, projections) are executed at the storage level

Vectorized Execution

Processing data in batches for CPU efficiency

Was this page helpful?