MetaDB
A unified query layer for cloud-native data analytics
Overview
MetaDB is Xperiflow's high-performance query interface for analytics on cloud-stored data. It enables SQL-based exploration and analysis of large datasets stored across distributed cloud storage systems—without the complexity of managing connections, credentials, or knowing exact storage locations.
At its core, MetaDB provides three essential capabilities:
- Unified Protocol Access — Query files across different storage zones using intuitive protocols like
shared://orroutine:// - Cloud Abstraction — Interact with Azure Blob Storage as if it were a local database
- Analytical Performance — Leverage DuckDB's columnar processing engine optimized for big data analytics
Architecture Overview
loading...The Problem MetaDB Solves
The Challenge of Cloud Data Access
Modern data platforms store vast amounts of data in cloud object storage. While this approach offers scalability and cost-efficiency, it introduces complexity:
Challenge | Traditional Approach | MetaDB Approach |
|---|---|---|
Storage Paths | Know exact container names, account URLs, and folder structures | Use simple protocol prefixes like |
Authentication | Manage connection strings, SAS tokens, or service principals | Automatic credential injection |
Data Formats | Load data into memory before querying | Query Parquet files directly in-place |
Performance | Full file downloads for simple queries | Columnar pushdown—read only what you need |
Why Not Just Use Traditional Databases?
Traditional databases require data ingestion—moving data from storage into database tables before querying. This creates:
- Latency: Minutes or hours of ETL before analysis
- Duplication: Data exists in storage AND the database
- Cost: Compute resources for transformation and storage
MetaDB eliminates this by querying data where it lives—directly in cloud storage.
Core Concepts
Protocols: Your Data Address Book
Protocols in MetaDB are intuitive prefixes that map to physical storage locations. Think of them as bookmarks to different areas of the platform's data landscape.
Protocol | Purpose | Example Use Case |
|---|---|---|
shared:// | Platform-wide shared data | Reference data, lookup tables, configurations |
routine:// | Routine execution artifacts | Model outputs, intermediate results, logs |
framework:// | Framework and system data | Schema definitions, system metadata |
Conceptual Example:
Query: "Show me all records from the sales forecast"
Instead of:
az://site-hash-abc123/shared/forecasts/sales_q4.parquetYou write:
shared://forecasts/sales_q4.parquet
The protocol system provides:
- Simplicity — Human-readable paths instead of cloud URIs
- Portability — Same query works across environments (dev, staging, production)
- Security — Access control enforced at the protocol level
The Alias Resolution Engine
When you write a query using protocols, MetaDB's Alias Resolution Engine translates your friendly paths into actual cloud storage addresses.
loading...This translation happens transparently—you never see or manage the underlying cloud paths.
Powered by DuckDB
MetaDB is built on DuckDB, an embedded analytical database engine. DuckDB was chosen for several key reasons:
Columnar Storage Processing
Unlike traditional row-based databases, DuckDB processes data column-by-column. When you query only 3 columns from a 50-column Parquet file, DuckDB reads only those 3 columns—reducing I/O by up to 90%.
loading...Vectorized Execution
DuckDB processes data in batches (vectors) rather than row-by-row, leveraging modern CPU architectures for parallel processing.
Zero-Copy Reads
Data flows directly from cloud storage to query results without intermediate staging.
How MetaDB Differs from MetaFileSystem
Xperiflow provides two complementary systems for working with cloud storage:
Capability | MetaFileSystem | MetaDB |
|---|---|---|
Primary Use | File operations | Data analytics |
Operations | Read, write, copy, delete files | Query, aggregate, join, filter |
Interface | File system API | SQL queries |
Data Format | Any file type | Optimized for Parquet/CSV |
Use Case | "Save this model artifact" | "What's the average across all records?" |
Think of it this way:
- MetaFileSystem = Your cloud file explorer
- MetaDB = Your cloud SQL workbench
Both systems share the same protocol conventions (shared://, routine://, framework://), ensuring a consistent mental model across the platform.
Storage Architecture
The Three Storage Zones
MetaDB organizes data into logical zones, each serving a distinct purpose:
Zone | Protocol | Purpose | Contents |
|---|---|---|---|
Shared | shared:// | Platform-wide data | Reference datasets, lookups, global configs |
Framework | framework:// | System-level data | Schemas, metadata, templates |
Routine | routine:// | Execution data | Instance data, run artifacts, method outputs |
Storage Zone Details
Shared Zone (shared://)
The shared zone contains platform-wide data accessible across all contexts:
- Reference datasets (country codes, currency mappings)
- Lookup tables shared between routines
- Global configuration data
Framework Zone (framework://)
System-level data supporting the Xperiflow runtime:
- Schema definitions
- Framework metadata
- System templates
Routine Zone (routine://)
Organized by routine instance and execution run:
Path | Description |
|---|---|
routine://instances/ | Root for all routine instances |
routine://instances/[instance_id]/ | Specific routine instance |
routine://instances/[instance_id]/shared/ | Instance member data (persists across runs) |
routine://instances/[instance_id]/runs/[run_id]/ | Single execution run |
routine://instances/[instance_id]/runs/[run_id]/artifacts/ | Output artifacts from the run |
Data Locality and Performance
The Principle of Data Locality
MetaDB is designed around a fundamental principle: move computation to the data, not data to the computation.
loading...Traditional: Slow and expensive due to full data transfer
MetaDB: Fast and efficient—process data in-place
How DuckDB Enables This
- Predicate Pushdown: Filters are applied at the storage level, reducing data transfer
- Projection Pushdown: Only requested columns are read from Parquet files
- Partition Pruning: When using Hive partitioning, entire file groups are skipped
Example Impact:
Scenario | Traditional Approach | MetaDB Approach |
|---|---|---|
Query 1M rows, need 100 | Download 1M rows, filter locally | Push filter, return 100 rows |
50-column table, need 3 | Download all 50 columns | Read only 3 columns |
2 years of data, need January | Download 24 months | Read only January partition |
Parquet: The Preferred Format
MetaDB is optimized for Apache Parquet files—a columnar storage format designed for analytics.
Why Parquet?
Feature | CSV | Parquet |
|---|---|---|
Storage | Row-based text format | Column-based binary format |
Compression | Poor | Excellent (up to 90% smaller) |
Column Selection | Must read all columns | Read only what's needed |
Schema | Inferred at read time | Embedded with data types |
Performance | Good for small files | Optimized for analytics at scale |
Hive Partitioning (Future Direction)
Xperiflow is moving toward Hive-style partitioning for large datasets. In this scheme, folder names encode partition values:
Path | Partition Values |
|---|---|
sales_data/year=2024/month=01/data.parquet | year=2024, month=01 |
sales_data/year=2024/month=02/data.parquet | year=2024, month=02 |
sales_data/year=2025/month=01/data.parquet | year=2025, month=01 |
When you query with a filter on year or month, only relevant partitions are scanned—dramatically improving performance for time-series and historical data.
Integration with Xperiflow Components
Routines and Artifacts
When routines execute, they produce artifacts stored in the routine zone. MetaDB provides direct query access to these artifacts:
loading...Vector Storage (MetaVectorDB)
For AI/ML workloads, MetaVectorDB extends MetaDB with vector similarity search capabilities—storing embeddings as Parquet files and using DuckDB for efficient nearest-neighbor queries.
Best Practices
When to Use MetaDB
✅ Use MetaDB for:
- Ad-hoc data exploration and analysis
- Aggregating data across multiple Parquet files
- Joining reference data with routine outputs
- Quick insights without data loading delays
❌ Consider alternatives for:
- File management operations (use MetaFileSystem)
- Transactional updates (use SQL Server)
- Real-time streaming (use appropriate streaming tools)
Query Optimization Tips
- Filter Early: Apply WHERE clauses to reduce data scanned
- Select Specific Columns: Avoid
SELECT *when you need only a few columns - Leverage Partitions: Structure queries to align with partition columns
- Use Appropriate Protocols: Query from the correct zone for fastest access
Glossary
Term | Definition |
|---|---|
Alias Resolution | The process of translating protocol paths (e.g., ) to actual cloud storage URLs |
Columnar Storage | Data organization by columns rather than rows, optimized for analytical queries |
DuckDB | The embedded analytical database engine powering MetaDB |
Hive Partitioning | A directory-based partitioning scheme where folder names encode partition values |
MetaFileSystem | Xperiflow's file operation interface (complementary to MetaDB) |
Parquet | A columnar file format optimized for big data analytics |
Protocol | A prefix (e.g., , ) that maps to a storage zone |
Pushdown | Optimization where operations (filters, projections) are executed at the storage level |
Vectorized Execution | Processing data in batches for CPU efficiency |