Performance Metrics¶

The simulation module collects metrics during VQE simulation execution, organized into three categories: system metrics, VQE execution metrics, and batch progress metrics. All metrics are exported by performance_monitor.py.

Metric Categories¶

System Metrics (`qp_sys_*`)¶

System metrics track hardware resource utilization for each simulation container. A background monitoring thread collects them and pushes to the Prometheus PushGateway at regular intervals.

Metric	Prometheus Name	Type	Unit	Description
CPU Usage	`qp_sys_cpu_percent`	Gauge	%	Current CPU utilization percentage
Memory Usage	`qp_sys_memory_percent`	Gauge	%	Current memory utilization percentage
Memory Used	`qp_sys_memory_used_bytes`	Gauge	bytes	Current memory consumption in bytes
CPU Load (1m)	`qp_sys_cpu_load_1m`	Gauge	load	1-minute load average
Container Uptime	`qp_sys_uptime_seconds`	Gauge	seconds	Time since container start

GPU Metrics

GPU-specific Prometheus metrics (utilization, memory, temperature) are not exported by PerformanceMonitor directly. GPU usage varies too rapidly to produce reliable statistics from within the simulation process. Instead, GPU metrics are collected by nvidia_gpu_exporter on port :9835, which reads nvidia-smi on each Prometheus scrape (every 15s by default). The Grafana dashboard has a dedicated GPU row with 8 panels sourced from this exporter.

VQE Execution Metrics (`qp_vqe_*`)¶

VQE execution metrics are exported after each simulation completes. These include timing breakdowns, result values, and derived efficiency metrics.

Metric	Prometheus Name	Type	Description
Total Execution Time	`qp_vqe_total_time`	Gauge	End-to-end simulation time
Hamiltonian Build Time	`qp_vqe_hamiltonian_time`	Gauge	Time to construct the molecular Hamiltonian
Qubit Mapping Time	`qp_vqe_mapping_time`	Gauge	Time for fermionic-to-qubit operator mapping
VQE Optimization Time	`qp_vqe_vqe_time`	Gauge	Time spent in VQE optimization loop
Minimum Energy	`qp_vqe_minimum_energy`	Gauge	Best energy found during optimization (Hartree)
Iterations Count	`qp_vqe_iterations_count`	Gauge	Total optimizer iterations to convergence
Optimal Parameters	`qp_vqe_optimal_parameters_count`	Gauge	Number of optimized variational parameters

Accuracy Metrics (`qp_vqe_*`)¶

These metrics compare VQE results against the Hartree-Fock (HF) reference energy computed by PySCF for each molecule. HF is an upper bound approximation - a good VQE result will be at or below the HF energy.

Dashboard Metric	Prometheus Name	Type	Description
Reference Energy	`qp_vqe_reference_energy`	Gauge	HF reference energy from PySCF (Ha)
Energy Error (Ha)	`qp_vqe_energy_error_hartree`	Gauge	`VQE_total_energy - HF_energy` (Ha)
Energy Error (mHa)	`qp_vqe_energy_error_millihartree`	Gauge	Same error in millihartree
Accuracy Score	`qp_vqe_accuracy_score`	Gauge	Log-scaled score from 0 to 100

The accuracy score uses a logarithmic damping function: score = max(0, min(100, 100 * (1 - log10(|error_mHa| + 1) / 5))). A score of 100 means the VQE energy matches HF exactly. From the formula, ~1 mHa error (chemical accuracy) scores ~94, ~10 mHa scores ~79, and ~99 mHa scores ~60. The score reaches 0 at ~100 Ha error.

Note

The reference is the HF ground state energy from PySCF, not a literature or FCI value. Since HF is an approximation, VQE can find lower (better) energies, resulting in negative error values. This is expected behavior for a well-optimized VQE run, not a measurement error.

Derived Efficiency Metrics (`qp_vqe_*`)¶

These are calculated from the timing and iteration data and pushed alongside VQE metrics.

Metric	Prometheus Name	Type	Description
Iterations per Second	`qp_vqe_iterations_per_second`	Gauge	`iterations_count / vqe_time`
Time per Iteration	`qp_vqe_time_per_iteration`	Gauge	`vqe_time / iterations_count`
Overhead Ratio	`qp_vqe_overhead_ratio`	Gauge	`(total_time - vqe_time) / vqe_time`
VQE Efficiency	`qp_vqe_efficiency`	Gauge	`vqe_time / total_time`
Setup Ratio	`qp_vqe_setup_ratio`	Gauge	`(hamiltonian_time + mapping_time) / total_time`

Batch Progress Metrics (`qp_batch_*`)¶

Batch progress metrics are pushed by scripts/generate_ml_batch.py and track the state of ML data generation runs across tiers and lanes.

Metric	Prometheus Name	Type	Labels	Description
Total Runs	`qp_batch_total`	Gauge	`tier`	Total number of runs in the batch
Completed	`qp_batch_done`	Gauge	`tier`, `lane`	Runs completed successfully
Failed	`qp_batch_failed`	Gauge	`tier`, `lane`	Runs that failed
Pending	`qp_batch_pending`	Gauge	`tier`, `lane`	Runs waiting to start
In Progress	`qp_batch_in_progress`	Gauge	`tier`, `lane`	Runs currently executing
Last Completion	`qp_batch_last_completion_ts`	Gauge	`tier`, `lane`	Unix timestamp of most recent completion

Labels¶

Every metric includes the following labels where applicable:

Label	Description	Example Values
`container_type`	Simulation container configuration	`cpu`, `gpu1`, `gpu2`
`molecule_symbols`	Chemical formula of the molecule	`H2`, `LiH`, `BeH2`, `H2O`, `NH3`
`optimizer`	Optimization algorithm	`L-BFGS-B`, `COBYLA`, `SLSQP`
`basis_set`	Basis set for the simulation	`sto-3g`, `cc-pvdz`
`molecule_id`	Unique identifier for the molecule instance	`0`, `1`, `2`
`backend_type`	Quantum simulation backend	`aer_simulator`, `statevector`
`tier`	Batch generation tier (batch metrics only)	`sto3g`, `ccpvdz`
`lane`	Batch generation lane (batch metrics only)	`cpu`, `gpu1`, `gpu2`

Full Metric List¶

# System Resource Metrics
qp_sys_cpu_percent{container_type}
qp_sys_memory_percent{container_type}
qp_sys_memory_used_bytes{container_type}
qp_sys_cpu_load_1m{container_type}
qp_sys_uptime_seconds{container_type}

# VQE Performance Metrics
qp_vqe_total_time{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_vqe_time{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_hamiltonian_time{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_mapping_time{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_iterations_count{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_minimum_energy{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_optimal_parameters_count{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}

# Scientific Accuracy Metrics
qp_vqe_reference_energy{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_energy_error_hartree{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_energy_error_millihartree{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_accuracy_score{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}

# Derived Efficiency Metrics
qp_vqe_iterations_per_second{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_time_per_iteration{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_overhead_ratio{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_efficiency{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}
qp_vqe_setup_ratio{container_type, molecule_id, molecule_symbols, optimizer, backend_type, basis_set}

# Batch Progress Metrics
qp_batch_total{tier}
qp_batch_done{tier, lane}
qp_batch_failed{tier, lane}
qp_batch_pending{tier, lane}
qp_batch_in_progress{tier, lane}
qp_batch_last_completion_ts{tier, lane}

Collection Configuration¶

Environment Variables¶

Monitoring is configured via environment variables (or constructor parameters, which take priority):

Variable	Default	Description
`MONITORING_ENABLED`	`false`	Enable monitoring
`MONITORING_INTERVAL`	`10`	Collection interval in seconds
`PUSHGATEWAY_URL`	`http://localhost:9091`	Prometheus PushGateway URL
`MONITORING_EXPORT_FORMAT`	`prometheus`	`json`, `prometheus`, or `both`
`CONTAINER_TYPE`	`unknown`	Label for the container (set automatically in Docker)

Constructor parameters take priority over environment variables, which take priority over settings.py defaults. See the PerformanceMonitor.__init__ method for details.

Enabling Monitoring¶

Activate monitoring with environment variables:

export MONITORING_ENABLED=true
export PUSHGATEWAY_URL=http://pushgateway:9091

When enabled, the container will:

Collect CPU and memory metrics in a background thread
Push system metrics to the PushGateway at configurable intervals
Export VQE metrics (iterations, energy, timing) on completion

See Environment Variables for the full list of MONITORING_* variables.

Monitoring Overhead

The monitoring thread introduces minimal overhead and runs independently of VQE computation.

Export Formats¶

Metrics can be exported in multiple formats:

Prometheus (Default)JSONBoth

Metrics are pushed to the PushGateway in Prometheus exposition format. Scrape targets are configured in prometheus.yml.

export MONITORING_ENABLED=true
export MONITORING_EXPORT_FORMAT=prometheus
export PUSHGATEWAY_URL=http://pushgateway:9091

Metrics are saved to a local JSON file after simulation completion:

export MONITORING_ENABLED=true
export MONITORING_EXPORT_FORMAT=json

Export to both Prometheus and JSON simultaneously:

export MONITORING_ENABLED=true
export MONITORING_EXPORT_FORMAT=both
export PUSHGATEWAY_URL=http://pushgateway:9091

PushGateway Hostname

The PushGateway hostname depends on where you run the gateway. The default http://localhost:9091 is for local development. When the gateway runs on a shared network, set the URL to match its hostname (for example http://pushgateway:9091).

Running a PushGateway¶

The Docker Compose stack in compose/docker-compose.ml.yaml ships the metric exporters and the pipeline services, but it does not bundle a PushGateway, Prometheus, or Grafana. You need to run a PushGateway yourself before any qp_* metric can be collected. If nothing is reachable at PUSHGATEWAY_URL, pushes fail and the error is logged and swallowed, so no metrics are recorded.

The simplest setup runs the official prom/pushgateway image:

docker run -d --name pushgateway -p 9091:9091 prom/pushgateway

Then point the pipeline at it with PUSHGATEWAY_URL and add the gateway as a scrape target in prometheus.yml so Prometheus pulls the pushed metrics.

PushGateway Grouping Key¶

VQE metrics are pushed with a grouping key that prevents overwrites between concurrent simulation runs:

/metrics/job/qp-vqe/container_type/{type}/molecule/{symbol}/optimizer/{optimizer}

System metrics use a separate job name: qp-sys-{container_type}.

Alerting¶

Alerting rules are defined at monitoring/grafana/provisioning/alerting/rules.yml and cover batch generation stalls, accuracy degradation, resource saturation, GPU overheating, and service availability. Alerting is present in the configuration but has not been tested yet and is not covered in detail here.

For instructions on visualizing metrics in Grafana, see Grafana Dashboards.