TidyRun Serialization Guide¶

Overview¶

This page covers serialization only: storing and retrieving Python objects (serialize, deserialize, metadata, encoders, and LazyDict).

For deferred-compute APIs (DAG, Job, ParametrizedJob) and DAG execution patterns, see the dedicated DAG Guide.

The TidyRun serialization framework provides a comprehensive, extensible system for storing and retrieving Python objects, including nested dictionaries, DataFrames, Series, and arbitrary Python objects, using a filesystem hierarchy.

Key Features¶

Type-Aware Encoding: Automatically selects the best format (folder, parquet, HDF5, JSON, or pickle) based on value type
Metadata Sidecars: Each output is accompanied by a .tidyrun metadata file recording the encoding format and version
Lazy Evaluation: Directories deserialize into LazyDict objects that load values on-demand on each access
Optional S3 Support: s3://... locations work when the boto3 extra is installed
Recursive Concatenation: LazyDict.concat() method provides pandas-style aggregation across nested structures
Parallel Leaf Loading: LazyDict.concat(max_workers=...) can load selected leaf values concurrently
Fallback Chain: Intelligent fallback routing (e.g., parquet → HDF5 when parquet encoding fails)
Extensible Pipeline: Users can provide custom encoder sequences and compose encoders
Direct Import Path: Use tidyrun.serialization for the serialization API

Quick Start¶

Basic Serialization¶

from tidyrun import serialize, deserialize
import pandas as pd

# Serialize a simple value
data = {"experiment": "run_1", "results": 42}
serialize(data, "./output/my_result")

# Deserialize back (returns LazyDict for folders)
loaded = deserialize("./output/my_result")
print(loaded["experiment"])  # "run_1" — loaded on access
print(loaded["results"])     # 42

DataFrame Support¶

import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})

# Serializes to Parquet (with metadata sidecar)
serialize(df, "./output/dataframe")

# Deserializes directly as DataFrame
loaded_df = deserialize("./output/dataframe")

Nested Data with LazyDict¶

# Nested dictionary
nested = {
    "run_1": {"metrics": pd.DataFrame(...)},
    "run_2": {"metrics": pd.DataFrame(...)},
}

serialize(nested, "./outputs/runs")

# Deserialize as LazyDict — values load on access
loaded = deserialize("./outputs/runs")

# Lazy access
metrics_1 = loaded["run_1"]["metrics"]  # Loaded only when accessed

# Recursive materialization
full_dict = loaded.to_dict()  # Materializes all nested values

DAG APIs¶

For deferred-compute APIs (DAG, Job, ParametrizedJob) and local multithreaded DAG evaluation, see the dedicated DAG page: DAG Guide.

Optional S3 Support¶

TidyRun can read and write s3://bucket/prefix/... locations when the optional S3 dependency is installed:

pip install tidyrun[s3]

Under the hood, S3 serialization stages through a local temporary directory and then uploads the generated files. Deserialization downloads the object tree to a temporary local directory and then reuses the normal local loader.

For tests and local development, the S3 round-trip test suite uses the moto mock backend.

Concatenation Across Runs¶

# Given a nested structure like:
# outputs/
#   run_1/
#     metrics.parquet
#   run_2/
#     metrics.parquet
#   ...

loaded = deserialize("./outputs")

# Concatenate all DataFrames, keyed by run ID
combined = loaded.concat(names=["run_id"])
# Returns: DataFrame with multi-index (run_id) and all metrics stacked

# With transformation (e.g., add timestamp)
combined = loaded.concat(
    names=["run_id"],
    transform=lambda df: df.assign(loaded_at=pd.Timestamp.now())
)

# Transform to scalar values (wrapped as a one-row "value" column)
totals = loaded.concat(
    names=["run_id"],
    transform=lambda df: df["metric"].sum(),
)

# With filtering (load only specific runs)
combined = loaded.concat(
    names=["run_id"],
    select=lambda path: path[0] in ["run_1", "run_2"]
)

# Parallel loading of leaf values (useful when leaf files are large)
combined = loaded.concat(
    names=["run_id"],
    max_workers=8,
)

# select(path) argument:
# - path: tuple of keys to the leaf (e.g. ("run_1", "metrics"))
# Note: select is evaluated at multiple depths while traversing nested folders.
# If you index deeper elements like path[1], guard with len(path) first.

Quick Reference¶

One-Minute Overview¶

TidyRun serialization provides save/load behavior with automatic format selection:

from tidyrun import serialize, deserialize

serialize({"df": my_dataframe, "config": settings}, "./output/result")

result = deserialize("./output/result")
df = result["df"]

What Gets Stored Where¶

Type	Format	File	Notes
`dict`	Folder tree	`key/name/`	Keys encoded via TOML; values recurse
`pd.DataFrame`	Parquet	`*.parquet`	Falls back to HDF5 on failure
`pd.Series`	Parquet	`*.parquet`	Falls back to HDF5 on failure
Scalar (`int`, `str`, `bool`, `float`, `date`, `datetime`, `time`)	JSON	`*.json`	JSON-serializable scalars
Other objects	Pickle	`*.pickle`	Last resort

Every output gets a .tidyrun metadata sidecar recording the selected format.

LazyDict in Practice¶

Directories deserialize as LazyDict objects for on-demand loading:

result = deserialize("./large_output")
value = result["key"]
full_dict = result.to_dict()

Encoder Fallback Chain¶

If one encoder cannot serialize a value, the next compatible encoder is tried:

dict -> folder tree
DataFrame -> parquet, then HDF5
Series -> parquet, then HDF5
JSON-serializable values -> JSON
Everything else -> pickle

Common Patterns¶

Save experiment results:

serialize(
    {
        "config": hyperparams,
        "metrics": pd.DataFrame(training_log),
        "model": trained_model,
    },
    "./experiments/exp_001",
)

Load and compare runs:

runs = deserialize("./experiments")
comparison_table = runs.concat(names=["exp_id"])

API Summary¶

Function	Purpose
`serialize(value, target, encoders=None)`	Save a value to disk
`deserialize(source, encoders=None)`	Load a value from disk
`LazyDict.to_dict()`	Materialize a nested `LazyDict`
`LazyDict.concat(names, transform, select, max_workers=None)`	Recursively concatenate DataFrames with optional parallel leaf loading
`encode_key(key)`	Encode a Python type to a filename-safe key
`decode_key(name)`	Decode a stored key name

Architecture¶

Module Structure¶

src/tidyrun/serialization/
├── __init__.py          # Public API exports
├── types.py             # Type definitions, exceptions, constants
├── paths.py             # Path/location helpers
├── metadata.py          # Metadata I/O and format mapping
├── encoders.py          # Encoder implementations (dict, parquet, hdf5, etc.)
├── lazy_dict.py         # LazyDict class
└── api.py               # Main serialize/deserialize functions

Encoder Pipeline¶

The default encoder pipeline tries encoders in this order:

dict-folder: Maps dictionaries to directory trees (keys become folder names)
dataframe-parquet: Stores DataFrames as .parquet files
series-parquet: Stores pandas Series as .parquet files
pandas-hdf5: Fallback for DataFrames/Series (HDF5 format with key "data")
fallback-json: JSON serialization for scalar types (int, float, str, list, dict, etc.)
fallback-pickle: Last resort for arbitrary Python objects

Encoders are tried in order; the first whose predicate returns True is used.

Fallback Mechanism¶

When an encoder fails (e.g., parquet cannot serialize a multi-index DataFrame), it raises GoToNextEncoderException to signal "skip me and try the next one." This allows:

DataFrame with multi-index → fails parquet → tries HDF5 ✓
Series when parquet engine unavailable → fails parquet → tries HDF5 ✓
Custom object → fails all structured formats → falls back to pickle ✓

Metadata Sidecars¶

Every serialized value gets a .tidyrun metadata file:

# output.tidyrun
version = 1
encoding = "dataframe-parquet"
suffix = ".parquet"

This metadata:

Tracks the encoding format used
Enables schema versioning for future compatibility
Allows deserialization without requiring file extension guessing

API Reference¶

Core Functions¶

`serialize(value, target, encoders=None)`¶

Serializes a Python value to disk using the configured encoder pipeline.

Parameters: - value (Any): The value to serialize (dict, DataFrame, Series, scalar, etc.) - target (str | PathLike): Where to write the output (with or without extension) - encoders (Iterable[EncoderSpec], optional): Custom encoder pipeline; defaults to default_encoders()

S3 support: - When target is an s3://... URI, TidyRun stages the output locally and uploads the resulting tree to S3 - This requires the optional boto3 dependency, available via pip install tidyrun[s3]

Behavior: - Selects first encoder whose predicate matches the value - Calls the encoder's serializer function - Writes metadata sidecar with encoding info - If serializer raises GoToNextEncoderException, tries next matching encoder

Raises: - TidyRunSerializationError: No encoder found for value type - NotImplementedError: S3 support requested without the optional dependency installed - Various I/O errors (permission denied, disk full, etc.)

Example:

serialize({"data": pd.DataFrame(...)}, "./results/exp_1")
# Writes:
#   results/exp_1.tidyrun      (metadata)
#   results/exp_1/data/        (folder for nested dict)
#   results/exp_1/data/data.tidyrun
#   results/exp_1/data/data.parquet

`deserialize(source, encoders=None)`¶

Deserializes a value from disk using metadata to determine format.

Parameters: - source (str | PathLike): Location to read from (with or without extension) - encoders (Iterable[EncoderSpec], optional): Custom encoder pipeline

S3 support: - When source is an s3://... URI, TidyRun downloads the object tree to a temporary local directory and deserializes from there - This requires the optional boto3 dependency, available via pip install tidyrun[s3]

Returns: - For dict-folder format: LazyDict (not materialized dict) - For DataFrame: pd.DataFrame - For Series: pd.Series - For scalar/pickle: Original Python object

Raises: - TidyRunDeserializationError: Missing metadata, unknown encoder, invalid data - NotImplementedError: Remote locations not yet supported - Various I/O errors

Example:

result = deserialize("./results/exp_1")
# Returns LazyDict with keys from the directory structure

LazyDict¶

`LazyDict.getitem(key)`¶

Lazily loads the value associated with key.

loaded = deserialize("./results")
# Value not loaded yet

value = loaded["run_1"]  # Loaded here on first access
value_again = loaded["run_1"]  # Loaded again on access

`LazyDict.to_dict()`¶

Recursively materializes all nested LazyDicts into a plain Python dict.

loaded = deserialize("./nested")
full_dict = loaded.to_dict()  # All values now loaded into memory

`LazyDict.concat(names=None, transform=None, select=None)`¶

Recursively concatenates DataFrame/Series leaves using pandas.concat.

Parameters: - names (list[str], optional): Names for multi-index levels created from nested keys - transform (Callable[[Any], Any], optional): Function to apply to each leaf value before concatenation (e.g., add metadata) - select (Callable[[tuple, Any], bool], optional): Predicate to filter which values to include; receives (path_tuple, value)

Returns: - pd.DataFrame: Concatenated result with multi-index if nested

Raises: - ValueError: No matching values after filtering - TypeError: Selected values are not DataFrame or Series

Example:

# Simple concatenation
combined = loaded.concat(names=["experiment_id"])

# With transformation
combined = loaded.concat(
    names=["experiment_id"],
    transform=lambda df: df.assign(dataset="train")
)

# With filtering
combined = loaded.concat(
    names=["experiment_id"],
    select=lambda path: "important" in path
)

Encoders and Predicates¶

`is_mapping(value) -> bool`¶

Returns True if value is a dict (will be encoded as folder).

`is_dataframe(value) -> bool`¶

Returns True if value is a pandas DataFrame (detected at runtime if pandas installed).

`is_json_serializable(value) -> bool`¶

Returns True if value can round-trip through json.dumps and json.loads.

`can_encode_with_parquet(value) -> bool`¶

Returns True if DataFrame and a parquet engine (pyarrow or fastparquet) is available.

`can_encode_with_hdf5(value) -> bool`¶

Returns True if value is a DataFrame or Series (and pandas is installed).

`default_encoders() -> tuple[EncoderSpec, ...]`¶

Returns the default 6-encoder pipeline in priority order.

Exceptions¶

`TidyRunSerializationError`¶

Raised when serialization fails (e.g., no encoder matches the value type).

from tidyrun.serialization import TidyRunSerializationError

try:
    serialize(some_unsupported_type(), "./output")
except TidyRunSerializationError as e:
    print(f"Cannot serialize: {e}")

`TidyRunDeserializationError`¶

Raised when deserialization fails (e.g., missing metadata, invalid format).

from tidyrun.serialization import TidyRunDeserializationError

try:
    deserialize("./invalid_path")
except TidyRunDeserializationError as e:
    print(f"Cannot deserialize: {e}")

`GoToNextEncoderException`¶

Internal exception raised by encoders to signal fallback to next encoder. Users typically don't need to handle this directly.

Key Encoding¶

TidyRun keys (used as folder/file names in the hierarchy) are encoded using TOML for type safety.

`encode_key(key) -> str`¶

Encodes a simple Python type to a path-safe string using TOML.

Supported types: str, int, float, bool, date, datetime, time

String keys are encoded to preserve round-trip type safety. Plain strings are left unquoted when safe (for example, "hello" -> "hello"). Strings that would otherwise be interpreted as another TOML type (for example "true", "42", or date-like values) are TOML-quoted to ensure they decode back to strings.

Encoded keys are used as path parts, so they must satisfy these constraints:

Non-empty
No / or \\
Must not start with .
Must not end with .tidyrun

from datetime import date, datetime
from tidyrun.keys import encode_key, decode_key

encoded = encode_key(42)                    # "42"
encoded = encode_key("hello")               # "hello"
encoded = encode_key("true")                # '"true"'
encoded = encode_key(True)                  # "true"
encoded = encode_key(date(2026, 5, 10))     # "2026-05-10"
encoded = encode_key(datetime(2026, 5, 10, 13, 37, 42))  # "2026-05-10T13:37:42"

`decode_key(name) -> Key`¶

Decodes a key name back to its original Python type.

original = decode_key("hello")    # "hello"
original = decode_key('"true"')   # "true"
original = decode_key("42")       # 42

Raises: - TidyRunKeyEncodingError: Unsupported key type - TidyRunKeyDecodingError: Invalid encoded name format

Customization¶

Custom Encoder¶

To add support for a custom type:

from tidyrun.serialization import EncoderSpec, serialize

def is_my_type(value):
    return isinstance(value, MyType)

def encode_my_type(value, target):
    # Write value to target location
    ...

def decode_my_type(source):
    # Read and return value from source location
    ...

my_encoder = EncoderSpec(
    name="my-custom-type",
    predicate=is_my_type,
    serializer=encode_my_type,
    deserializer=decode_my_type,
)

# Use custom encoder
from tidyrun.serialization import default_encoders

custom_pipeline = (my_encoder,) + default_encoders()
serialize(my_value, "./output", encoders=custom_pipeline)

Custom Encoder Pipeline¶

To override the default pipeline order:

from tidyrun.serialization import default_encoders, EncoderSpec

# Reorder: put HDF5 before Parquet
encoders = default_encoders()
reordered = (
    encoders[0],  # dict-folder
    encoders[3],  # pandas-hdf5
    encoders[1],  # dataframe-parquet
    *encoders[2:],  # rest
)

serialize(df, "./output", encoders=reordered)

Performance Considerations¶

Lazy Loading¶

LazyDict does not load values until accessed:

loaded = deserialize("./large_structure")  # Fast: only reads metadata

result = loaded["expensive_dataframe"]  # Slow: loads large file here
result = loaded["expensive_dataframe"]  # Loaded again on access

Concatenation Memory¶

concat() materializes all selected leaves into memory before concatenating. For very large structures, consider filtering with the select parameter:

# Load all 1000 experiments at once (high memory)
combined = loaded.concat()

# Load only 10 specific experiments (low memory)
combined = loaded.concat(
    select=lambda path: path[0] in experiments[:10]
)

Parquet Engine Selection¶

Parquet serialization uses pyarrow by default; fastparquet is tried if pyarrow is unavailable. HDF5 is tried if parquet encoding fails.

Limitations and Future Work¶

Current Limitations¶

Remote Storage: S3 is supported as an optional backend via tidyrun[s3]. Other cloud storage providers such as GCS and Azure Blob Storage are not yet supported.
Schema Evolution: No automatic schema migration for DataFrames. Users must handle schema changes manually (e.g., use transform in concat).
Parquet Multi-Index: Multi-index DataFrames cannot be serialized to parquet and fall back to HDF5.

Planned Features¶

Additional Remote Backends: Support GCS, Azure Blob Storage, and other fsspec-compatible stores.
Virtual Keys: Glob-like patterns in LazyDict keys for dynamic filtering
Schema Hinting: Pre-load metadata from Parquet files to detect schema mismatches before full materialization
Custom Metadata: Allow users to store arbitrary metadata alongside outputs

Testing¶

Run the full test suite:

pixi run pytest tests/serialization tests/test_keys.py

Key test modules:

tests/serialization/test_api.py: End-to-end serialize/deserialize, metadata, fallback sequencing, S3 round-trip
tests/serialization/test_encoders.py: Encoder predicates and detection logic
tests/serialization/test_lazy_dict.py: LazyDict access patterns and concatenation
tests/test_keys.py: Key encoding and decoding

Examples¶

Example 1: Saving Experiment Results¶

import pandas as pd
from tidyrun import serialize, deserialize

# After running an experiment
results = {
    "config": {"lr": 0.001, "epochs": 100},
    "metrics": pd.DataFrame({
        "epoch": [1, 2, 3],
        "loss": [0.5, 0.3, 0.2]
    }),
    "model_weights": some_large_array,  # Will pickle
}

serialize(results, "./experiments/exp_001")

# Later, load with lazy access
loaded = deserialize("./experiments/exp_001")
print(loaded["config"])  # Loaded on access
model = loaded["model_weights"]  # Pickled data

Example 2: Comparing Multiple Runs¶

runs = {
    "run_a": {
        "metrics": pd.DataFrame({"accuracy": [0.8, 0.85, 0.9]}),
    },
    "run_b": {
        "metrics": pd.DataFrame({"accuracy": [0.75, 0.82, 0.88]}),
    },
}

serialize(runs, "./comparison")

# Load and aggregate
loaded = deserialize("./comparison")
combined = loaded.concat(names=["run_id"])
print(combined)
# Output:
#             accuracy
# run_id
# run_a   0        0.80
#         1        0.85
#         2        0.90
# run_b   0        0.75
#         1        0.82
#         2        0.88

Example 3: Filtered Concatenation¶

# Load results from multiple time periods
results = deserialize("./results_by_month")

# Concatenate only 2026 results, adding month info
combined = results.concat(
    names=["month"],
    select=lambda path: path[0].startswith("2026"),
    transform=lambda df: df.assign(period="2026")
)

TidyRun Serialization Guide¶

Overview¶

Key Features¶

Quick Start¶

Basic Serialization¶

DataFrame Support¶

Nested Data with LazyDict¶

DAG APIs¶

Optional S3 Support¶

Concatenation Across Runs¶

Quick Reference¶

One-Minute Overview¶

What Gets Stored Where¶

LazyDict in Practice¶

Encoder Fallback Chain¶

Common Patterns¶

API Summary¶

Architecture¶

Module Structure¶

Encoder Pipeline¶

Fallback Mechanism¶

Metadata Sidecars¶

API Reference¶

Core Functions¶

serialize(value, target, encoders=None)¶

deserialize(source, encoders=None)¶

LazyDict¶

LazyDict.__getitem__(key)¶

LazyDict.to_dict()¶

LazyDict.concat(names=None, transform=None, select=None)¶

Encoders and Predicates¶

is_mapping(value) -> bool¶

is_dataframe(value) -> bool¶

is_json_serializable(value) -> bool¶

can_encode_with_parquet(value) -> bool¶

can_encode_with_hdf5(value) -> bool¶

default_encoders() -> tuple[EncoderSpec, ...]¶

Exceptions¶

TidyRunSerializationError¶

TidyRunDeserializationError¶

GoToNextEncoderException¶

Key Encoding¶

encode_key(key) -> str¶

decode_key(name) -> Key¶

Customization¶

Custom Encoder¶

Custom Encoder Pipeline¶

Performance Considerations¶

Lazy Loading¶

Concatenation Memory¶

Parquet Engine Selection¶

Limitations and Future Work¶

Current Limitations¶

Planned Features¶

Testing¶

Examples¶

Example 1: Saving Experiment Results¶

Example 2: Comparing Multiple Runs¶

Example 3: Filtered Concatenation¶

`serialize(value, target, encoders=None)`¶

`deserialize(source, encoders=None)`¶

`LazyDict.getitem(key)`¶

`LazyDict.to_dict()`¶

`LazyDict.concat(names=None, transform=None, select=None)`¶

`is_mapping(value) -> bool`¶

`is_dataframe(value) -> bool`¶

`is_json_serializable(value) -> bool`¶

`can_encode_with_parquet(value) -> bool`¶

`can_encode_with_hdf5(value) -> bool`¶

`default_encoders() -> tuple[EncoderSpec, ...]`¶

`TidyRunSerializationError`¶

`TidyRunDeserializationError`¶

`GoToNextEncoderException`¶

`encode_key(key) -> str`¶

`decode_key(name) -> Key`¶