Docker Compose Deployment¶

The full platform runs as a set of Docker containers managed by Docker Compose. This deployment includes the simulation engine, Apache Kafka for data streaming, Redpanda Connect for S3 sink streaming, Apache Spark for data processing, Apache Airflow for workflow orchestration, Garage for S3-compatible object storage, MLflow for experiment tracking, and monitoring exporters.

MLflow has not yet been integrated into the actual workflow. Ensuring stability of other components, including the simulation module, takes precedence.

Overview¶

Each component runs in its own container on a shared Docker bridge network. The compose file is at compose/docker-compose.ml.yaml. It defines:

Simulation containers - Quantum Pipeline GPU and CPU containers (in the batch profile, started by Airflow DAGs)
Streaming layer - Apache Kafka (KRaft mode), Schema Registry, Redpanda Connect (S3 sink to Garage)
Processing layer - Spark master and worker
Orchestration - Airflow 3.x with CeleryExecutor: API server, scheduler, DAG processor, Celery worker, triggerer, with PostgreSQL and Redis
Storage - Garage v2.2.0 S3-compatible object storage
ML tracking - MLflow experiment tracking server (backed by PostgreSQL + Garage)
Monitoring exporters - StatsD exporter, PostgreSQL exporter, Redis exporter, NVIDIA GPU exporter

Port mappings and dependencies are listed in the service tables below.

Quick Start¶

Prerequisites¶

Docker Engine 24.0 or later
Docker Compose v2.20 or later
NVIDIA Container Toolkit (for GPU containers)
just command runner (recommended)

Step 1: Clone the Repository¶

# codeberg mirror
git clone https://codeberg.org/piotrkrzysztof/quantum-pipeline.git

# or github
git clone https://github.com/straightchlorine/quantum-pipeline.git

cd quantum-pipeline

Step 2: Configure Environment¶

Run the first-time setup script, which generates secrets and Garage configuration:

just setup

This creates a .env file from .env.ml.example and generates the required secrets. You can also do it manually:

cp .env.ml.example .env
chmod 600 .env

Edit .env to set credentials, resource limits, and service ports. See the Environment Variables reference for all options.

Step 3: Build Images¶

Build the simulation container images:

# CPU only
just docker-build cpu

# GPU (set CUDA_ARCH for your GPU, default 8.6/Ampere)
CUDA_ARCH=6.1 just docker-build gpu

# Both
just docker-build all

Step 4: Deploy¶

just up

This runs docker compose --env-file .env -f compose/docker-compose.ml.yaml up -d.

Step 5: Verify¶

Check that all services are running:

docker compose --env-file .env -f compose/docker-compose.ml.yaml ps

Service Configuration¶

Streaming Services¶

Service	Image	Port(s)	Description
`kafka`	`apache/kafka:4.2.0`	9092 (internal), 9094 (external)	Message broker with KRaft mode
`schema-registry`	`confluentinc/cp-schema-registry:8.2.0`	8081	Avro schema management
`redpanda-connect`	`docker.redpanda.com/redpandadata/connect`	4195 (HTTP API: health, metrics)	S3 sink streaming from Kafka to Garage

Kafka runs in KRaft mode (no ZooKeeper). Schema Registry manages Avro schemas. Redpanda Connect replaces the older Kafka Connect + S3 Sink connector setup, streaming data from Kafka topics directly to Garage (S3).

Alternative: Confluent Kafka Connect¶

An override file at compose/docker-compose.ml.kafka-connect.yaml provides a Confluent Kafka Connect worker with the S3 Sink connector (v12.1.0) as an alternative to Redpanda Connect. To use it, pass both compose files and scale Redpanda Connect to zero:

docker compose --env-file .env \
  -f compose/docker-compose.ml.yaml \
  -f compose/docker-compose.ml.kafka-connect.yaml \
  up -d --scale redpanda-connect=0

The override adds two services: kafka-connect (port 8083) which installs the S3 connector plugin on startup, and kafka-connect-init which registers the garage-sink connector pointing at the raw-results bucket. Both depend on Kafka, Schema Registry, and Garage being healthy.

Processing Services¶

Service	Image	Port(s)	Description
`spark-master`	`quantum-pipeline-spark:4.0.2-python3.12`	8080 (UI), 7077 (RPC)	Spark cluster master
`spark-worker`	`quantum-pipeline-spark:4.0.2-python3.12`	-	Spark executor node (default: 2 GB memory, 2 cores)

Orchestration Services¶

Service	Image	Port(s)	Description
`airflow-apiserver`	Custom (docker/airflow)	8084	Airflow API server and web UI
`airflow-scheduler`	Custom (docker/airflow)	-	DAG scheduling
`airflow-dag-processor`	Custom (docker/airflow)	-	DAG parsing and processing
`airflow-worker`	Custom (docker/airflow)	-	Celery task execution
`airflow-triggerer`	Custom (docker/airflow)	-	Deferred task execution
`airflow-init`	Custom (docker/airflow)	-	Database migration, user creation, Spark connection
`postgres`	`postgres:16-alpine`	- (internal only)	Airflow + MLflow metadata database (see note below)
`redis`	`redis:7.2-bookworm`	6379 (internal only)	Celery broker

The Airflow common environment also configures rclone remotes for Garage and Cloudflare R2 via RCLONE_CONFIG_* environment variables. This allows the R2 sync DAG to copy data between Garage (local S3) and Cloudflare R2 (remote S3) without a separate rclone configuration file.

Storage Services¶

Service	Image	Port(s)	Description
`garage`	`dxflrs/garage:v2.2.0`	3901 (S3 API), 3903 (admin)	S3-compatible object storage

Garage stores raw VQE results, feature datasets, and Iceberg warehouse data. Buckets are configured during just setup. The S3 endpoint is http://garage:3901 from within the Docker network. The admin API on port 3903 exposes cluster status and metrics that can be scraped by Prometheus.

ML Tracking¶

Service	Image	Port(s)	Description
`mlflow`	`ghcr.io/mlflow/mlflow:v3.10.1-full`	5000	Experiment tracking and artifact storage

MLflow uses PostgreSQL for the backend store and Garage (S3) for artifact storage (s3://mlflow-artifacts/).

Monitoring Exporters¶

These services expose Prometheus metrics for scraping by an external monitoring server:

Service	Image	Port(s)	Description
`airflow-statsd-exporter`	`prom/statsd-exporter`	9102 (metrics), 9125/udp (StatsD)	Converts Airflow StatsD metrics to Prometheus format
`postgres-exporter`	`prometheuscommunity/postgres-exporter`	9187	PostgreSQL metrics
`redis-exporter`	`oliver006/redis_exporter`	9121	Redis metrics
`nvidia-gpu-exporter`	`utkuozdemir/nvidia_gpu_exporter:1.2.0`	9835	NVIDIA GPU utilization, memory, temperature

Some of the images above float on an unpinned or latest tag: Redpanda Connect and the StatsD, PostgreSQL, and Redis exporters. A redeploy can pull a newer image and change behavior, so pin these to specific versions before any long-running data generation.

Batch Simulation Containers¶

These containers are defined in the batch profile and are not started by just up. They are launched on-demand by Airflow DAGs via docker compose run:

Service	Image	Description
`quantum-pipeline-gpu1`	`quantum-pipeline:gpu`	GPU simulation (device 0, GTX 1060 6GB)
`quantum-pipeline-gpu2`	`quantum-pipeline:gpu`	GPU simulation (device 1, GTX 1050 Ti)
`quantum-pipeline-cpu`	`quantum-pipeline:cpu`	CPU simulation

Each batch container connects to Kafka and Schema Registry, and exports monitoring metrics via the MONITORING_ENABLED, PUSHGATEWAY_URL, and MONITORING_EXPORT_FORMAT environment variables.

Networking¶

All services share a single Docker bridge network (quantum-ml-network). Services communicate using container names as hostnames (e.g., kafka:9092, garage:3901). The Kafka external listener on port 9094 allows access from outside the Docker network.

Volumes¶

The deployment uses named Docker volumes for persistent data:

Volume	Service	Purpose
`quantum-garage-data`	Garage	Object storage data
`quantum-garage-meta`	Garage	Object storage metadata
`quantum-ml-spark-warehouse`	Spark	Spark SQL warehouse
`quantum-ml-spark-ivy-cache`	Spark	Cached JAR dependencies
`quantum-ml-airflow-postgres`	PostgreSQL	Airflow + MLflow metadata
`quantum-ml-airflow-logs`	Airflow	Task execution logs
`quantum-ml-airflow-ivy-cache`	Airflow	Cached JARs for SparkSubmit
`quantum-ml-redis-data`	Redis	Celery broker data
`quantum-ml-kafka-data`	Kafka	Kafka log segments

Bind mounts provide access to project files: DAG definitions (docker/airflow/), Spark configuration (compose/spark-defaults.conf), Garage configuration (compose/garage.toml), Redpanda Connect config (compose/redpanda-connect.yaml), simulation data (data/), and output (gen/).

Justfile Commands¶

Command	Description
`just setup`	First-time setup: generates secrets, Garage config, `.env` file
`just up`	Start the compose stack
`just down`	Force-remove running batch containers, then stop the stack (includes `--profile batch`)
`just logs [service]`	Tail compose logs (all services, or a specific one)
`just docker-build cpu`	Build CPU simulation image
`just docker-build gpu`	Build GPU simulation image
`just docker-build all`	Build both CPU and GPU images

Scaling¶

Scale Spark workers with:

docker compose --env-file .env -f compose/docker-compose.ml.yaml up -d --scale spark-worker=3

For additional simulation instances, add new service definitions with a unique container_name, distinct Kafka topic, appropriate GPU device assignment, and separate output volume mounts.

Stopping the Deployment¶

# Stop services, preserve data (also force-removes any running batch containers)
just down

# Or manually, including volume cleanup
docker compose --env-file .env -f compose/docker-compose.ml.yaml --profile batch down -v