Troubleshooting¶
This page provides solutions to common issues encountered when installing, configuring, and running the Quantum Pipeline. Problems are organized by category, each following a consistent Symptom / Cause / Solution format.
Installation Issues¶
Simulation Issues¶
VQE Convergence Failure¶
Symptom: The simulation completes but reports that convergence was not achieved. The optimizer reached the maximum iteration limit without finding a minimum.
Cause: The default iteration limit may be too low for complex molecules, or the initial parameters are stuck in a barren plateau region.
Solution:
# Increase the maximum number of iterations
quantum-pipeline -f molecules.json --max-iterations 500
# Try Hartree-Fock initialization to avoid barren plateaus
quantum-pipeline -f molecules.json --init-strategy hf
# Try a different optimizer
quantum-pipeline -f molecules.json --optimizer COBYLA
# Loosen convergence tolerance for faster (less precise) convergence
quantum-pipeline -f molecules.json --convergence --threshold 1e-4
Out of Memory During Simulation¶
Symptom: The process is killed with MemoryError or the OOM killer terminates the container during simulation of large molecules.
Cause: Large molecules (12+ qubits) require significant memory for statevector simulation. Memory usage scales exponentially with qubit count.
Solution:
# For Docker deployments, increase container memory limit
docker run --memory=16g straightchlorine/quantum-pipeline:latest ...
# Use a simpler basis set to reduce qubit count
quantum-pipeline -f molecules.json --basis sto3g
# Reduce ansatz repetitions (default is 2, lowering to 1 cuts parameter count)
quantum-pipeline -f molecules.json --ansatz-reps 1
Slow Simulation Performance¶
Symptom: Iterations take significantly longer than expected based on the performance baselines.
Cause: The simulation may be running on CPU when GPU acceleration is available, or system resources are contended.
Solution:
# Verify GPU is being used (check logs for "Using GPU backend")
quantum-pipeline -f molecules.json --log-level DEBUG
# Ensure no other heavy processes are running
top
# Check that the CUDA runtime is accessible
nvidia-smi
Incorrect Energy Values¶
Symptom: The final energy value is significantly higher than expected reference values.
Cause: Random parameter initialization may have led to convergence on a local minimum rather than the global minimum. This is especially common for molecules with complex optimization landscapes.
Solution:
Run multiple simulations with different seeds, or use --init-strategy hf to start from the Hartree-Fock state. The HF strategy avoids the worst barren plateaus - for example, L-BFGS-B with random init on 6-31g can spend 1000+ iterations arriving at a positive energy for H2, while HF init reaches a good result in 50 iterations.
Docker Issues¶
Docker Socket Permission Denied (GID Mismatch)¶
Symptom: Airflow or batch generation containers fail with permission denied when trying to access /var/run/docker.sock.
Cause: The DOCKER_GID build argument does not match the Docker group ID on the host. The Airflow container needs this GID to access the Docker daemon for launching simulation containers.
Solution:
# Find the Docker group ID on your host
getent group docker | cut -d: -f3
# Rebuild with the correct GID (e.g., if your Docker GID is 999)
DOCKER_GID=999 docker compose build airflow-worker
# Or set it in your .env file
echo "DOCKER_GID=999" >> .env
The default DOCKER_GID is 970. If your host uses a different value, the container's airflow user will not be in the right group and Docker socket access will fail.
Volume Mount Permissions¶
Symptom: Permission denied errors when containers try to write to mounted volumes.
Cause: The container user does not have write access to the host directory.
Solution: Fix host permissions with chmod -R 777 ./data ./logs or run with --user "$(id -u):$(id -g)".
Stale Docker State¶
Symptom: Configuration changes are not taking effect.
Solution: Rebuild without cache (docker compose build --no-cache) and restart with docker compose down -v && docker compose up -d.
GPU Issues¶
Wrong CUDA_ARCH for Your GPU¶
Symptom: The GPU image builds successfully but simulations crash with CUDA errors, produce incorrect results, or fall back to CPU silently.
Cause: The CUDA_ARCH build argument does not match your GPU's compute capability. The qiskit-aer GPU build compiles CUDA kernels for a specific architecture, and a mismatch means those kernels cannot run on your hardware.
Solution:
# Check your GPU's compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Rebuild with the correct architecture
# Common values:
# 6.1 = GTX 10xx (Pascal)
# 7.5 = RTX 20xx (Turing)
# 8.6 = RTX 30xx (Ampere)
# 8.9 = RTX 40xx (Ada Lovelace)
CUDA_ARCH=7.5 just docker-build gpu
If you are unsure, check the NVIDIA CUDA GPUs page for your card's compute capability.
CUDA Not Available¶
For CUDA setup issues (RuntimeError: CUDA is not available) or driver version mismatches (CUDA driver version is insufficient), see GPU Acceleration for complete setup and troubleshooting instructions. The GPU image requires a host driver compatible with CUDA 12.6.
GPU Out of Memory¶
Symptom: CUDA out of memory. Tried to allocate X MiB error during simulation.
Cause: The molecule requires more GPU memory than is available on the device.
Solution:
# Check GPU memory usage
nvidia-smi
# Use a simpler basis set to reduce memory requirements
quantum-pipeline -f molecules.json --basis sto3g
# Fall back to CPU for very large molecules (omit --gpu to use CPU)
quantum-pipeline -f molecules.json
Kafka Issues¶
Schema Registry Errors¶
Symptom: SchemaRegistryError: Subject not found or schema compatibility errors.
Solution: Check health with curl http://localhost:8081/subjects. For development, reset compatibility with:
curl -X PUT http://localhost:8081/config \
-H "Content-Type: application/json" \
-d '{"compatibility": "NONE"}'
Message Delivery Failures¶
Symptom: Messages produced but never appear in object storage.
Solution: Check Redpanda Connect health with curl http://localhost:4195/ping. Check Garage admin status on port 3903. Review logs with docker compose logs redpanda-connect | tail -50.
Topic Not Created¶
Symptom: Messages fail to publish because the target topic does not exist.
Solution:
# List existing topics
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092
# Create topic manually
docker compose exec kafka kafka-topics --create \
--topic experiment.vqe \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 1
Spark Issues¶
Spark Job Out of Memory¶
Symptom: Spark job fails with java.lang.OutOfMemoryError: Java heap space or executor lost errors.
Cause: The Spark workers do not have enough memory to process the data volume.
Solution:
# Increase Spark executor memory in docker-compose
environment:
SPARK_WORKER_MEMORY: 8g
# Or set memory in the Spark configuration
spark.executor.memory=4g
spark.driver.memory=2g
Spark Job Submission Failure¶
Symptom: Airflow task fails with SparkSubmitOperator errors, or the job never appears in the Spark UI.
Solution: Verify Spark master is accessible (curl http://spark-master:8080). Check the Airflow connection at Admin > Connections > spark_default (Host: spark://spark-master, Port: 7077).
S3A Connection Errors¶
Symptom: Spark fails with com.amazonaws.SdkClientException: Unable to execute HTTP request when reading from or writing to object storage.
Cause: The S3 endpoint configuration is incorrect, or credentials are missing.
Solution:
Verify that the following Spark configuration values match your Garage setup:
spark.hadoop.fs.s3a.endpoint = http://garage:3901
spark.hadoop.fs.s3a.access.key = <your-access-key>
spark.hadoop.fs.s3a.secret.key = <your-secret-key>
spark.hadoop.fs.s3a.path.style.access = true
spark.hadoop.fs.s3a.connection.ssl.enabled = false
Also verify that Garage is running:
Airflow Issues¶
DAG Not Visible in Web UI¶
Symptom: The quantum_feature_processing DAG does not appear in the Airflow web interface.
Solution: Verify the DAG file exists (docker compose exec airflow-webserver ls /opt/airflow/dags/), check for import errors in Admin > DAG Import Errors, and force a rescan with docker compose exec airflow-scheduler airflow dags reserialize.
Task Failures with Retry Exhaustion¶
Symptom: Tasks fail repeatedly and exhaust all retries.
Solution: Check task logs in the Airflow UI (DAG > Task Instance > Logs), verify dependent services with docker compose ps, then clear the failed task:
docker compose exec airflow-webserver airflow tasks clear \
quantum_feature_processing -t run_quantum_processing \
-s 2025-01-01 -e 2025-12-31 --yes
Database Connection Errors¶
Symptom: Airflow fails to start with sqlalchemy.exc.OperationalError: could not connect to server.
Solution: Check PostgreSQL status (docker compose ps postgres), then initialize with docker compose exec airflow-webserver airflow db init.
Monitoring Issues¶
PushGateway Not Reachable¶
Symptom: ConnectionError: HTTPConnectionPool host='pushgateway' port=9091 in simulation logs, or metrics silently not appearing.
Cause: The PushGateway container is not running, the PUSHGATEWAY_URL environment variable points to the wrong address, or there is a network mismatch between the simulation container and the monitoring stack.
Solution:
# Check PushGateway is running
docker compose ps pushgateway
# Verify the URL from inside the simulation container
docker compose exec quantum-pipeline curl -s http://pushgateway:9091/metrics | head -5
# Check the configured URL
docker compose exec quantum-pipeline env | grep PUSHGATEWAY
# Common fixes:
# - From Docker containers, use http://pushgateway:9091 (service name)
# - From the host, use http://localhost:9091
# - Make sure MONITORING_ENABLED is set to true
Metrics Not Appearing in Prometheus¶
Symptom: The Grafana dashboard shows "No data" for all panels. Prometheus queries return empty results.
Cause: The PushGateway is not receiving metrics, or Prometheus is not scraping the PushGateway.
Solution:
# Check PushGateway has received metrics
curl http://localhost:9091/metrics | head -50
# Check Prometheus targets
# Navigate to: http://localhost:9090/targets
# Verify pushgateway target shows "UP"
# Verify the simulation was started with monitoring enabled
# Either set the environment variable:
export MONITORING_ENABLED=true
# Or pass the CLI flag:
quantum-pipeline -f molecules.json --enable-performance-monitoring
Grafana Cannot Connect to Prometheus¶
Symptom: Grafana data source test fails with "Connection refused".
Solution: Use http://prometheus:9090 from within Docker, or http://localhost:9090 from the host. Verify via Configuration > Data Sources > Prometheus > Save & Test.
Dashboard Import Fails¶
Symptom: Importing the thesis dashboard JSON fails with a validation error.
Solution: The dashboard was created with Grafana 10.x. Ensure the Prometheus data source is named Prometheus (the default) or update the DS_PROMETHEUS variable in the dashboard JSON to match your data source name.
Batch Generation Issues¶
Container Exits with rc=125¶
Symptom: Batch generation reports a container failure with return code 125.
Cause: The Docker image was not found. This usually means the image has not been built locally or the image name in the batch configuration does not match.
Solution:
# Build the required images
just docker-build cpu
just docker-build gpu
# Verify images exist
docker images | grep quantum-pipeline
Container Exits with rc=1¶
Symptom: Batch generation reports a container failure with return code 1.
Cause: The simulation application itself encountered an error. This could be an invalid molecule/optimizer/basis combination, out-of-memory, or a bug.
Solution:
# Check the container logs for the failed run
docker logs <container-id>
# Look at the batch state file for details
cat gen/ml_batch_state.json | python -m json.tool
The batch system is idempotent - rerunning the generation script will skip completed configurations and retry failed ones. Trigger via the vqe_batch_generation Airflow DAG or directly with python scripts/generate_ml_batch.py.
Batch Progress Not Visible¶
Symptom: Batch generation is running but qp_batch_* metrics do not appear in Grafana.
Cause: The batch script pushes progress metrics to the PushGateway. If the PushGateway is not reachable from the host (where the batch script runs), metrics will not be recorded.
Solution:
Make sure the PushGateway is running and accessible from the host at http://localhost:9091. The batch script runs on the host, not inside a container, so it needs host-level access to the PushGateway.
If your issue is not covered here, check the FAQ or open an issue on the Codeberg repository.