Grafana Dashboards¶

Grafana provides the visualization layer for monitoring simulations. The repository includes a pre-built dashboard covering batch generation progress, VQE performance comparison, scientific accuracy, GPU utilization, system resources, data platform health, databases, and orchestration.

Setup¶

The main Docker Compose file compose/docker-compose.ml.yaml ships the pipeline services and metric exporters, but not the monitoring backends. It assumes a PushGateway, Prometheus, and Grafana are already running. See Running a PushGateway for the collection side.

For a basic Prometheus and Grafana setup, see the Grafana documentation.

Dashboard Import¶

Auto-provisioning¶

If Grafana is configured with the provisioning file at monitoring/grafana/provisioning/dashboards/dashboard.yml, dashboards are loaded automatically from /var/lib/grafana/dashboards inside the container. Mount the dashboard JSON there and Grafana picks it up on startup.

Manual Import¶

The primary dashboard is located at monitoring/grafana/dashboards/quantum-pipeline.json.

To import manually:

In Grafana, navigate to Dashboards > Import.
Click Upload JSON file and select quantum-pipeline.json from the repository.
Alternatively, paste the JSON content directly into the Import via panel json text field.
In the Prometheus data source dropdown, select your configured Prometheus instance.
Click Import.

The dashboard will appear under the name Quantum Pipeline - ML Stack.

Template Variables¶

The dashboard includes template variables for interactive filtering. The variables have no display label set, so Grafana shows the variable name (lowercase) in the dashboard header.

Variable	Description	Source
`DS_PROMETHEUS`	Prometheus data source selector	Data source query
`tier`	Batch generation tier	`label_values(qp_batch_total, tier)`
`lane`	Batch generation lane	`label_values(qp_batch_done, lane)`
`optimizer`	Filter by optimization algorithm	`label_values(qp_vqe_total_time, optimizer)`
`molecule`	Filter by molecule symbols	`label_values(qp_vqe_total_time, molecule_symbols)`
`container_type`	Filter by container configuration (cpu, gpu1, gpu2)	`label_values(qp_vqe_total_time, container_type)`

The optimizer, molecule, and container_type variables support multi-select; tier and lane are single-select. Each query variable offers an "All" option for viewing data across all configurations at once.

Dashboard Layout¶

The dashboard has 73 panels organized into 9 rows. Six rows (VQE Performance, Batch Generation, System Resources, Data Platform, Databases, Orchestration) are collapsed by default; three (Quality Metrics, GPU, Block Storage) are expanded.

Row 1: VQE Performance (7 panels)¶

Detailed performance analysis across container configurations.

Panel	Type	PromQL
VQE Efficiency	Gauge	`qp_vqe_efficiency{...}`
Overhead Ratio	Gauge	`qp_vqe_overhead_ratio{...}`
Iterations per Second	Time Series	`qp_vqe_iterations_per_second{...}`
Setup Ratio	Time Series	`qp_vqe_setup_ratio{...}`
VQE Total Time	Time Series	`qp_vqe_total_time{...}`
Time per Iteration	Time Series	`qp_vqe_time_per_iteration{...}`
Iteration Count	Time Series	`qp_vqe_iterations_count{...}`

Row 2: Batch Generation (7 panels)¶

Tracks the state and throughput of ML data generation runs.

Panel	Type	PromQL
Pending	Stat	`sum(qp_batch_pending{tier=~"$tier"})`
Progress	Gauge	`sum(qp_batch_done{tier=~"$tier"}) / sum(qp_batch_total{tier=~"$tier"}) * 100`
Throughput (runs/hour)	Time Series	`increase(qp_batch_done{tier=~"$tier",lane=~"$lane"}[10m]) * 6`
Failed Over Time	Time Series	`qp_batch_failed{tier=~"$tier",lane=~"$lane"}`
Done	Stat	`sum(qp_batch_done{tier=~"$tier"})`
Failed	Stat	`sum(qp_batch_failed{tier=~"$tier"})`
Per-Lane Progress	Bar Gauge	`qp_batch_done{tier=~"$tier",lane=~"$lane"}`

Row 3: Quality Metrics (4 panels)¶

Monitors the scientific quality of simulation results.

Panel	Type	PromQL
Accuracy Score	Bar Gauge	`qp_vqe_accuracy_score{...}`
Reference Energy	Time Series	`qp_vqe_reference_energy{...}`
Ground State Energy	Time Series	`qp_vqe_minimum_energy{...}`
Energy Error (mHa)	Time Series	`qp_vqe_energy_error_millihartree{...}`

Row 4: GPU (8 panels)¶

GPU metrics sourced from nvidia_gpu_exporter on port :9835.

Panel	Description
GPU Utilization	GPU core utilization percentage
GPU Power Draw	Power consumption in watts
Fan Speed	Fan speed percentage
Power State	Current GPU power state
Graphics Clock (MHz)	Core clock frequency
Memory Free	Available GPU memory
GPU Temperature	Temperature in Celsius
GPU Memory Usage	Memory utilization percentage

Row 5: System Resources (4 panels)¶

Hardware utilization for each simulation container.

Panel	PromQL
CPU Usage	`qp_sys_cpu_percent{container_type=~"$container_type"}`
Memory Usage	`qp_sys_memory_percent{container_type=~"$container_type"}`
Uptime	`qp_sys_uptime_seconds{container_type=~"$container_type"}`
Load Average (1m)	`qp_sys_cpu_load_1m{container_type=~"$container_type"}`

Row 6: Data Platform (8 panels)¶

Redpanda Connect pipeline metrics from the internal metrics endpoint on port :4195.

Panel	Description
Redpanda Input Latency (ms)	Input processing latency
Redpanda Output Latency (ms)	Output processing latency
Redpanda Batch Throughput	Message batch processing rate
Redpanda Errors/sec	Error rate from `output_error_total`
Redpanda Output Connection	Output connection status
Redpanda Input Connection	Input connection status
Redpanda Messages In/sec	Inbound message throughput
Redpanda Messages Out/sec	Outbound message throughput

Row 7: Databases (18 panels)¶

Metrics from postgres-exporter on port :9187 and redis-exporter on port :9121.

Covers: exporter health, connections, cache hit ratio, transactions/sec, deadlocks, tuple operations, DB size, active queries, locks (Postgres); keys, evicted keys, network I/O, uptime, hit rate, ops/sec, memory, clients (Redis).

Row 8: Block Storage (8 panels)¶

Garage metrics from the admin API on port :3903.

Covers: disk usage, stored blocks, cluster health, connected nodes, storage nodes, partitions OK, resync queue, admin request rate.

Row 9: Orchestration (9 panels)¶

Airflow metrics from the StatsD exporter on port :9102.

Panel	Description
DAGBag Size	Number of DAGs loaded
Airflow Task Instances	Task instance counts by state
DAG Parse Time	Time to parse DAG files
Scheduler Executable Tasks	Tasks ready for scheduling
Triggerer Capacity	Triggerer slot utilization
Executor Tasks	Tasks in executor by state
Scheduler Heartbeat	Scheduler liveness
DAG Import Errors	DAG loading failures
Pool Slots (default_pool)	Pool slot utilization

Alerting¶

Alert rules are provisioned from monitoring/grafana/provisioning/alerting/rules.yml. They cover batch stalls, accuracy degradation, resource saturation, GPU temperature, and service availability. Alerting is present in the configuration but has not been tested yet.

For a complete reference of available metrics and their Prometheus names, see Performance Metrics.