Skip to content

Iceberg Storage

Apache Iceberg provides the table format for feature tables, with Garage (v2.2.0) as the S3-compatible storage backend. Together they deliver ACID transactions, time-travel queries, schema evolution, and snapshot management.

For how Iceberg and Garage fit into the overall architecture, see System Design.

Catalog Structure

All feature tables are organized under a single Iceberg catalog:

quantum_catalog                         -- Iceberg catalog
  └── quantum_features                  -- Database
        ├── molecules                   -- Table
        ├── ansatz_info                 -- Table
        ├── performance_metrics         -- Table
        ├── vqe_results                 -- Table
        ├── initial_parameters          -- Table
        ├── optimal_parameters          -- Table
        ├── vqe_iterations              -- Table
        ├── iteration_parameters        -- Table
        ├── hamiltonian_terms           -- Table
        ├── ml_iteration_features       -- ML feature table
        ├── ml_run_summary              -- ML feature table
        └── processing_metadata         -- Audit table

Catalog Configuration

Configured via compose/spark-defaults.conf, mounted into Spark containers:

Configuration Value Description
Catalog name quantum_catalog Identifier used in SQL queries
Catalog type hadoop Metadata stored in the warehouse path
Warehouse s3a://features/warehouse/ Root location for table data and metadata in Garage
IO implementation HadoopFileIO File I/O layer for reading/writing through S3A

S3A Configuration

Spark reaches Garage through the Hadoop S3A filesystem driver, also in spark-defaults.conf:

Setting Value
Endpoint http://garage:3901
Region garage
Path-style access true
SSL false
Fast upload true

Credentials are provided via AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. In docker-compose.ml.yaml, these map from the project-level S3_ACCESS_KEY and S3_SECRET_KEY.

Physical Storage Layout

s3a://features/warehouse/
└── quantum_features/
    ├── vqe_results/
    │   ├── metadata/
    │   │   ├── v1.metadata.json
    │   │   ├── v2.metadata.json
    │   │   ├── snap-1234567890.avro
    │   │   └── snap-1234567891.avro
    │   └── data/
    │       ├── processing_date=2025-01-10/
    │       │   ├── part-00000.parquet
    │       │   └── part-00001.parquet
    │       └── processing_date=2025-01-11/
    │           └── part-00000.parquet
    ├── molecules/
    │   ├── metadata/
    │   └── data/
    ├── ml_iteration_features/
    │   ├── metadata/
    │   └── data/
    └── ml_run_summary/
        ├── metadata/
        └── data/

Metadata Files

File Type Description
v*.metadata.json Table schema, partition spec, snapshot pointer, table properties
snap-*.avro Snapshot metadata with manifest list references
Manifest lists Lists of manifest files for a given snapshot
Manifest files Lists of data files, partition values, file-level statistics

Snapshot Tagging

Each Spark write creates a new snapshot, tagged with a version identifier from the processing batch ID. This enables reproducible ML training by referencing a specific snapshot tag.

Tagging is done in process_incremental_data(). Initial writes use v_{batch_id} tags; incremental appends use v_incr_{batch_id}.

Garage Integration

Garage (v2.2.0) provides S3-compatible storage for raw JSON data and processed Parquet feature tables. It replaced MinIO from v1.x.

The Garage service is defined in compose/docker-compose.ml.yaml and configured via compose/garage.toml.template (uses env var substitution for secrets).

Buckets

Bucket Purpose Writer
raw-results JSON files from Redpanda Connect Redpanda Connect S3 output
features Processed Parquet files and Iceberg metadata Apache Spark
mlflow-artifacts MLflow experiment artifacts MLflow tracking server

Ports

Port Service
3901 S3 API (used by Redpanda Connect, Spark, rclone)
3903 Admin API (bucket creation, key management)

Raw Data Layout

Written by Redpanda Connect (see Kafka Streaming):

s3://raw-results/
└── experiments/
    └── experiment.vqe/
        ├── 1-1711900800000000000.json
        ├── 2-1711900801000000000.json
        └── ...

File naming: {counter}-{unix_nano_timestamp}.json.

Tip

Garage is S3 compatible. Use aws configure --profile garage with your S3_ACCESS_KEY / S3_SECRET_KEY from .env to browse files with aws s3.

Partitioning Strategy

Partitioning is set for expected query patterns:

Table Partition Columns Rationale
molecules processing_date Time-based filtering for incremental loads
ansatz_info processing_date, basis_set Filter by date and basis set
performance_metrics processing_date, basis_set Performance comparisons across basis sets
vqe_results processing_date, basis_set, backend Query by date, basis set, and backend
initial_parameters processing_date, basis_set Parameter analysis by date and basis set
optimal_parameters processing_date, basis_set Optimized parameter lookup
vqe_iterations processing_date, basis_set, backend Iteration analysis with backend filtering
iteration_parameters processing_date, basis_set Per-iteration parameter tracking
hamiltonian_terms processing_date, basis_set, backend Hamiltonian structure analysis

Iceberg uses partition metadata to skip irrelevant data files at query time, reducing I/O for partition-filtered queries.

Processing Metadata Table

An audit table tracks all processing runs:

Column Type Description
processing_batch_id string Batch identifier
processing_name string Processing job name
processing_timestamp timestamp When the batch was processed
processing_date date Processing date
table_names array<string> Tables written in this batch
table_versions array<string> Snapshot tags for each table
record_counts array<bigint> Records written per table
source_data_info string Source data description

Table Maintenance

Iceberg tables require periodic maintenance including snapshot expiration, data file compaction, and orphan file cleanup. See the Iceberg Maintenance documentation.

References