Troubleshooting¶
This page provides solutions to common issues encountered when installing, configuring, and running the Quantum Pipeline. Problems are organized by category, each following a consistent Symptom / Cause / Solution format.
Installation Issues¶
Simulation Issues¶
VQE Convergence Failure¶
Symptom: The simulation completes but reports that convergence was not achieved. The optimizer reached the maximum iteration limit without finding a minimum.
Cause: The default iteration limit may be too low for complex molecules, or the initial parameters are stuck in a barren plateau region.
Solution:
# Increase the maximum number of iterations
python quantum_pipeline.py -f molecules.json --max-iterations 500
# Try a different optimizer
python quantum_pipeline.py -f molecules.json --optimizer COBYLA
# Reduce convergence tolerance for faster (less precise) convergence
python quantum_pipeline.py -f molecules.json --convergence --threshold 1e-4
Out of Memory During Simulation¶
Symptom: The process is killed with MemoryError or the OOM killer terminates the container during simulation of large molecules.
Cause: Large molecules (12+ qubits) require significant memory for statevector simulation. Memory usage scales exponentially with qubit count.
Solution:
# For Docker deployments, increase container memory limit
docker run --memory=16g straightchlorine/quantum-pipeline:latest ...
# Use a simpler basis set to reduce qubit count
python quantum_pipeline.py -f molecules.json --basis sto3g
# Reduce ansatz repetitions (default is 2)
python quantum_pipeline.py -f molecules.json --ansatz-reps 2
Slow Simulation Performance¶
Symptom: Iterations take significantly longer than expected based on the performance baselines.
Cause: The simulation may be running on CPU when GPU acceleration is available, or system resources are contended.
Solution:
# Verify GPU is being used (check logs for "Using GPU backend")
python quantum_pipeline.py -f molecules.json --log-level DEBUG
# Ensure no other heavy processes are running
top
# Check that the CUDA runtime is accessible
nvidia-smi
Incorrect Energy Values¶
Symptom: The final energy value is significantly higher than expected reference values.
Cause: Random parameter initialization may have led to convergence on a local minimum rather than the global minimum. This is especially common for molecules with complex optimization landscapes.
Solution:
Run multiple simulations to increase the chance of finding a good minimum. The GPU-accelerated configuration allows more iterations in the same time budget, improving exploration of the parameter space.
Docker Issues¶
Volume Mount Permissions¶
Symptom: Permission denied errors when containers try to write to mounted volumes.
Cause: The container user does not have write access to the host directory.
Solution: Fix host permissions with chmod -R 777 ./data ./logs or run with --user "$(id -u):$(id -g)".
Stale Docker State¶
Symptom: Configuration changes are not taking effect.
Solution: Rebuild without cache (docker compose build --no-cache) and restart with docker compose down -v && docker compose up -d.
Kafka Issues¶
Schema Registry Errors¶
Symptom: SchemaRegistryError: Subject not found or schema compatibility errors.
Solution: Check health with curl http://localhost:8081/subjects. For development, reset compatibility with:
curl -X PUT http://localhost:8081/config \
-H "Content-Type: application/json" \
-d '{"compatibility": "NONE"}'
Message Delivery Failures¶
Symptom: Messages produced but never appear in MinIO.
Solution: Check connector status with curl http://localhost:8083/connectors/minio-sink/status | python -m json.tool (see minio-sink-config.json). Restart with curl -X POST http://localhost:8083/connectors/minio-sink/restart. Check logs with docker compose logs kafka-connect | tail -50.
Topic Not Created¶
Symptom: Messages fail to publish because the target topic does not exist.
Solution:
# List existing topics
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092
# Create topic manually
docker compose exec kafka kafka-topics --create \
--topic vqe_decorated_result_v1 \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 1
Spark Issues¶
Spark Job Out of Memory¶
Symptom: Spark job fails with java.lang.OutOfMemoryError: Java heap space or executor lost errors.
Cause: The Spark workers do not have enough memory to process the data volume.
Solution:
# Increase Spark executor memory (see docker-compose.thesis.yaml for thesis defaults)
# https://github.com/straightchlorine/quantum-pipeline/src/branch/master/docker-compose.thesis.yaml
environment:
SPARK_WORKER_MEMORY: 8g
# Or set memory in the Spark configuration
spark.executor.memory=4g
spark.driver.memory=2g
Spark Job Submission Failure¶
Symptom: Airflow task fails with SparkSubmitOperator errors, or the job never appears in the Spark UI.
Solution: Verify Spark master is accessible (curl http://spark-master:8080). Check the Airflow connection at Admin > Connections > spark_default (Host: spark://spark-master, Port: 7077).
S3A Connection Errors¶
Symptom: Spark fails with com.amazonaws.SdkClientException: Unable to execute HTTP request when reading from or writing to MinIO.
Cause: MinIO endpoint configuration is incorrect, or credentials are missing.
Solution:
Verify that the following Spark configuration values are correct (these are set in docker-compose.thesis.yaml):
spark.hadoop.fs.s3a.endpoint = http://minio:9000
spark.hadoop.fs.s3a.access.key = <your-access-key>
spark.hadoop.fs.s3a.secret.key = <your-secret-key>
spark.hadoop.fs.s3a.path.style.access = true
spark.hadoop.fs.s3a.connection.ssl.enabled = false
Also verify that MinIO is running and the target bucket exists:
# Check MinIO health
curl http://minio:9000/minio/health/live
# List buckets (using mc client)
docker compose exec minio mc ls local/
Airflow Issues¶
DAG Not Visible in Web UI¶
Symptom: The quantum_feature_processing DAG does not appear in the Airflow web interface.
Solution: Verify the DAG file exists (docker compose exec airflow-webserver ls /opt/airflow/dags/), check for import errors in Admin > DAG Import Errors, and force a rescan with docker compose exec airflow-scheduler airflow dags reserialize.
Task Failures with Retry Exhaustion¶
Symptom: Tasks fail repeatedly and exhaust all retries.
Solution: Check task logs in the Airflow UI (DAG > Task Instance > Logs), verify dependent services with docker compose ps, then clear the failed task:
docker compose exec airflow-webserver airflow tasks clear \
quantum_feature_processing -t run_quantum_processing \
-s 2025-01-01 -e 2025-12-31 --yes
Database Connection Errors¶
Symptom: Airflow fails to start with sqlalchemy.exc.OperationalError: could not connect to server.
Solution: Check PostgreSQL status (docker compose ps postgres), then initialize with docker compose exec airflow-webserver airflow db init.
GPU Issues¶
For CUDA setup issues (RuntimeError: CUDA is not available) or driver version mismatches (CUDA driver version is insufficient), see GPU Acceleration for complete setup and troubleshooting instructions. The GPU image requires host driver version 520+.
GPU Out of Memory¶
Symptom: CUDA out of memory. Tried to allocate X MiB error during simulation.
Cause: The molecule requires more GPU memory than is available on the device.
Solution:
# Check GPU memory usage
nvidia-smi
# Use a simpler basis set to reduce memory requirements
python quantum_pipeline.py -f molecules.json --basis sto3g
# Fall back to CPU for very large molecules (omit --gpu to use CPU)
python quantum_pipeline.py -f molecules.json
Monitoring Issues¶
Metrics Not Appearing in Prometheus¶
Symptom: The Grafana dashboard shows "No data" for all panels. Prometheus queries return empty results.
Cause: The PushGateway is not receiving metrics, or Prometheus is not scraping the PushGateway.
Solution:
# Check PushGateway has received metrics
curl http://localhost:9091/metrics | head -50
# Check Prometheus targets
# Navigate to: http://localhost:9090/targets
# Verify pushgateway target shows "UP"
# Verify the simulation was started with monitoring enabled
python quantum_pipeline.py -f molecules.json --enable-performance-monitoring
Grafana Cannot Connect to Prometheus¶
Symptom: Grafana data source test fails with "Connection refused".
Solution: Use http://prometheus:9090 from within Docker, or http://localhost:9090 from the host. Verify via Configuration > Data Sources > Prometheus > Save & Test.
PushGateway Errors¶
Symptom: ConnectionError: HTTPConnectionPool host='pushgateway' port=9091 in simulation logs.
Solution: Verify the PushGateway is running (docker compose ps pushgateway) and reachable from the simulation container (docker compose exec quantum-pipeline curl http://pushgateway:9091/metrics).
Dashboard Import Fails¶
Symptom: Importing the thesis dashboard JSON fails with a validation error.
Solution: The dashboard was created with Grafana 10.x. Ensure the Prometheus data source is named Prometheus (the default) or update the DS_PROMETHEUS variable in the dashboard JSON to match your data source name.
If your issue is not covered here, check the FAQ or open an issue on the GitHub repository.