MetaFileSystem
Summary: The MetaFileSystem is XperiFlow’s unified storage layer for file data. It provides one consistent way to store and retrieve files across multiple logical stores (framework, project, shared, routine, ephemeral) by routing each request to the right store based on the path’s protocol prefix. It supports the usual file system operations and is designed to handle large volumes of data.
Overview
When working with data science workflows, you need reliable file storage that can handle everything from small configuration files to large datasets. The MetaFileSystem solves this by providing a single, consistent interface regardless of where your files are physically stored.
The MetaFileSystem provides:
Feature | Description |
|---|---|
Unified Access | One interface to work with files across all storage locations |
Protocol-Based Routing | Intuitive path prefixes that route to the correct storage |
Implements the fsspec AbstractFileSystem | The MetaFileSystem can be used anywhere an fsspec filesystem is expected to work. |
Think of the MetaFileSystem like a smart filing cabinet that knows exactly where everything is stored, even if the actual documents are spread across different rooms in the building.
Why the MetaFileSystem Exists
The Problem
Modern data science and ML workflows generate significant amounts of data: trained models, intermediate datasets, configuration files, artifacts, logs, and user-uploaded files. These files often have different lifecycles, access patterns, and security requirements:
- Some files are project-specific and should only be accessible within that project's context
- Some files need to be shared across routines or people within an organization
- Some files are ephemeral—temporary working data that doesn't need long-term persistence
Without an abstraction layer, you would need to:
- Manage multiple storage connections manually
- Handle different authentication mechanisms for each storage backend
- Manually organize files across disparate systems
- Write different code for each storage type
The Solution
The MetaFileSystem eliminates this complexity by providing:
Benefit | What It Means |
|---|---|
Logical Separation | Different storage contexts (protocols) for different use cases |
Consistent API | The same file operations work across all storage areas |
Security Isolation | Each storage area can have independent access controls |
Path-Based Routing | Simply use a protocol prefix to target the right storage area |
Backend Abstraction | The underlying storage technology can change without affecting your workflows |
Core Concepts
Physical Storage Architecture
The MetaFileSystem abstracts over cloud-based blob storage (such as Azure Blob Storage). This has several important implications:
Aspect | What This Means |
|---|---|
Network Access | File operations involve network calls to cloud storage—they're not free local reads like reading from your laptop's hard drive |
Latency Considerations | Reading and writing files has network latency. For performance-critical operations, minimize the number of file operations |
No Traditional Folders | Blob storage doesn't have true folders. The MetaFileSystem simulates folder structures using path prefixes in file names |
Scalability | Cloud storage scales automatically—you don't need to worry about running out of disk space |
Storage Protocols and File Stores
The MetaFileSystem organizes storage into distinct file stores, each identified by a protocol prefix. When you construct a file path, you specify which store to use by prefixing the path with the protocol:
{protocol}://{path/to/file}
Available File Stores
Protocol | Store Type | Purpose | Typical Contents |
|---|---|---|---|
routine:// | Routine Store | Routine execution data and artifacts | Run outputs, model artifacts, execution results |
shared:// | Shared Store | Cross-routine resources accessible to multiple workflows | Reference data, shared reports, common configurations |
framework:// | Framework Store | System-wide resources provided by Xperiflow | Templates, default configurations, system libraries |
ephemeral:// | Ephemeral Store | Temporary storage for intermediate processing | Cache data, working files, intermediate calculations |
project-[id]:// | Project Store | Project-specific isolated storage | Project configs, analysis results, project datasets |
Example Paths
Path | What It Accesses |
|---|---|
routine://data/cluster_results.parquet | A data file in routine storage |
shared://reports/quarterly_summary.xlsx | A shared report accessible to multiple routines |
framework://templates/default_config.json | A system template provided by Xperiflow |
project-42://analysis/customer_segments.csv | A file specific to Project 42 |
Key Insight: You don't need to know the physical storage location. The protocol tells the system where to look, and the MetaFileSystem handles the rest.
Available Storage Types
The MetaFileSystem is designed to handle the diverse file types generated by data science and analytics workflows. Here's what you can store:
File Category | Examples | Typical Extensions |
|---|---|---|
Data Files | Datasets, tables, query results | .parquet , , , |
Model Artifacts | Trained ML models, model weights | .pkl , , |
Configuration | Settings, parameters, mappings | .json , , |
Reports & Outputs | Generated reports, visualizations | .xlsx , , |
Logs & Diagnostics | Execution logs, error traces | .log , |
Reference Data | Lookup tables, master data | .parquet , |
Intermediate Results | Temporary processing outputs | .pkl , |
Important: The MetaFileSystem is optimized for structured and semi-structured data files typical in analytics workflows. It's not currently intended for storing application binaries, media files, or other non-workflow content.
Common Operations
The MetaFileSystem supports standard operations through a consistent interface:
File Operations
Operation | Description | Example Use |
|---|---|---|
Read | Retrieve file contents | Load a dataset for analysis |
Write | Create or update a file | Save model results |
List | View directory contents | Browse available files |
Delete | Remove a file | Clean up temporary data |
Copy | Duplicate a file | Create backup before processing |
Info | Get file metadata | Check file size and modification date |
Directory Operations
Directories are not explicitly supported in the current Meta File System implementation. Directory operations are implicit, meaning a directory exists only if it contains at least one file. Deleting a directory simply removes all items within it.
Operation | Description |
|---|---|
Remove Directory | Delete empty directories |
Walk | Recursively traverse directory trees |
Signed URLs
For scenarios where direct file access is needed (e.g., downloads in web applications), the MetaFileSystem can generate signed URLs:
Request: Generate temporary access URL
Path: project-42://reports/customer_segments.xlsx
Response: https://storage.azure.com/container/path/file.xlsx?sig=xxxxx&exp=1234567890
↑ ↑
Direct access URL Time-limited signature
Use Cases:
- Allowing users to download files directly
- Embedding files in reports or dashboards
- Sharing files temporarily with external systems
File Attributes
Files can carry custom attributes—key-value pairs that describe the file beyond standard metadata:
File: routine://instances/abc123/runs/run-001/cluster_model.pkl
───────────────────────────────────────────────────────────────
Standard Metadata:
• Name: cluster_model.pkl
• Size: 15.2 MB
• Created: 2024-12-15 09:30:00
• Modified: 2024-12-15 09:30:00
• Version: 1
Custom Attributes:
• model_type: "kmeans"
• num_custers: 5
• silhouette_score: 0.72
• training_date: "2024-12-15"
Attributes enable:
- Workflow context: Track processing state
- Integration data: Store external system references
Access Controls
Access controls are currently governed at the protocol level. There are currently no per-file/directory permissions that can be set by user/group. This is a future enhancement for the meta filesystem.
Where the MetaFileSystem is Used
Routine Execution
When routines run, the MetaFileSystem manages all file storage automatically:
loading...Routine Storage Structure
Within routine storage, files are organized hierarchically:
routine://
└── instances/
└── {routine_instance_id}/
├── shared/ ← Instance-level shared files
│ └── {filename}.{ext} (persists across runs)
└── runs/
└── {routine_run_id}/ ← Run-specific data
└── artifacts/
└── {artifact}.{ext}
Storage Scopes Within Routines
Scope | Access | Lifetime | Use Case |
|---|---|---|---|
Routine Instance | Read-only | Persistent | Access artifacts from previous runs |
Shared Instance | Read/Write | Persistent | State shared across runs |
Shared Run | Read/Write | Per-run | Files shared within a single run |
Fileshare | Read/Write | Global | Cross-routine accessible storage |
Common Scenarios
Scenario 1: Accessing Routine Artifacts
After running KMeans clustering, you want to access the results:
- Artifacts stored at:
routine://instances/[instance_id]/runs/[run_id]/artifacts/ - Access cluster assignments: System retrieves from optimized storage
- Fast metadata lookup: Know file size and creation time instantly
- Load data: Actual content retrieved only when needed
Scenario 2: Sharing Data Across Routines
Your ML Classification routine needs reference data from a previous analysis:
- Store reference data:
shared://reference/customer_categories.parquet - Multiple routines access: Both KMeans and Classification can read
- Single source of truth: Updates visible to all consumers
- No duplication: Data stored once, accessed many times
Scenario 3: Project-Specific Storage
Different projects need isolated storage:
- Project A files:
project-1://analysis/results.parquet - Project B files:
project-2://analysis/results.parquet - Complete isolation: Projects can't accidentally access each other's data
- Same interface: Use identical code patterns across projects
Best Practices
1. Choose the Right Storage Area
If you need to... | Use this protocol |
|---|---|
Store routine-specific outputs | routine:// |
Share data across routines | shared:// |
Store temporary/cache files | ephemeral:// |
Access system templates | framework:// |
Store project-specific data | project-[id]:// |
2. Organize Files Logically
Good Structure:
routine://instances/[id]/runs/[run_id]/
├── artifacts/
│ ├── cluster_assignments.parquet
│ └── clustering_metrics.json
└── logs/
└── execution.log
Avoid:
routine://
├── file1.csv
├── file2.csv
├── model.pkl
├── output.xlsx
└── temp.json
3. Consider the Cost of Reading and Writing the same files
Due to the distributed network storage of the metafilesystem, each read is not as fast or “free” as a local filesystem read. This means that every call has to go over a network to retrieve or send results. Consider this when interacting with the meta filesystem to reduce unnecessary reads/writes.
- Batch operations when possible to reduce network round-trips
- Be mindful of file sizes—large files take longer to transfer
4. Use Attributes for Context
Instead of encoding information in file names:
❌ model_kmeans_k5_score72_2024-12-15.pkl
Use attributes:
✅ cluster_model.pkl
Attributes: {model_type: "kmeans", k: 5, score: 0.72, date: "2024-12-15"}
Troubleshooting
File Not Found
Symptom: Attempting to access a file returns "File not found"
Possible Causes:
- Incorrect protocol prefix
- File path typo
- File was deleted or moved
- Accessing a project file without proper project context
Resolution: Verify the complete path including protocol prefix
This file could not be deleted. Deleting a system-generated file is not permitted.
Symptom: Attempting to delete a file that was generated by the application rather than a user.
Possible Causes:
- Accessing unintended data
- Attempting to delete something to clear up space that isn’t allowed
- Misinterpreting when something is system generated versus externally generated
Resolution: There isn’t a workaround for this as it is the intended functionality of the application. If you are 100 percent confident that the file is externally generated, then report the bug accordingly.
Summary
The MetaFileSystem is the intelligent storage foundation of Xperiflow:
Feature | What It Provides |
|---|---|
Protocol-based routing | Simple, intuitive file access |
Multiple storage areas | Organized, purpose-driven storage |
Signed URLs | Secure external access |
Custom attributes | Rich file metadata |
Backend abstraction | Optimized storage without complexity |
By abstracting storage complexity, the MetaFileSystem lets you focus on your data science workflows while ensuring your files are safely stored, easily accessible, and properly organized.