Troubleshooting¶

This page provides solutions to common issues encountered when installing, configuring, and running the Quantum Pipeline. Problems are organized by category, each following a consistent Symptom / Cause / Solution format.

Installation Issues¶

Unsupported Python Version¶

Symptom: pip install quantum-pipeline fails with an error stating the package requires a different Python version, or imports fail after install.

Cause: The active interpreter is outside the supported range.

Solution: Install into a supported Python version. Tools such as uv, conda, or pyenv make it easy to pin the right interpreter. See the Installation guide for the supported versions and detailed setup.

GPU Container Fails to Start¶

Symptom: A GPU container exits immediately with could not select device driver "nvidia" or unknown or invalid runtime name: nvidia.

Cause: The NVIDIA Container Toolkit is not installed, or the Docker daemon is not configured with the nvidia runtime.

Solution: Install the NVIDIA Container Toolkit on the host and restart Docker. See GPU Acceleration for the full driver and runtime setup.

Simulation Issues¶

VQE Convergence Failure¶

Symptom: The simulation completes but reports that convergence was not achieved. The optimizer reached the maximum iteration limit without finding a minimum.

Cause: The default iteration limit may be too low for complex molecules, or the initial parameters are stuck in a barren plateau region.

Solution:

# Increase the maximum number of iterations
quantum-pipeline -f data/molecules.json --max-iterations 500

# Try Hartree-Fock initialization to avoid barren plateaus
quantum-pipeline -f data/molecules.json --init-strategy hf

# Try a different optimizer
quantum-pipeline -f data/molecules.json --optimizer COBYLA

# Loosen convergence tolerance for faster (less precise) convergence
quantum-pipeline -f data/molecules.json --convergence --threshold 1e-4

Out of Memory During Simulation¶

Symptom: The process is killed with MemoryError or the OOM killer terminates the container during simulation of large molecules.

Cause: Large molecules (12+ qubits) require significant memory for statevector simulation. Memory usage scales exponentially with qubit count.

Solution:

# For Docker deployments, increase container memory limit
docker run --memory=16g straightchlorine/quantum-pipeline:latest ...

# Use a simpler basis set to reduce qubit count
quantum-pipeline -f data/molecules.json --basis sto3g

# Reduce ansatz repetitions (default is 2, lowering to 1 cuts parameter count)
quantum-pipeline -f data/molecules.json --ansatz-reps 1

Slow Simulation Performance¶

Symptom: Iterations take significantly longer than expected based on the performance baselines.

Cause: The simulation may be running on CPU when GPU acceleration is available, or system resources are contended.

Solution:

# Verify GPU is being used (check logs for "Using GPU backend")
quantum-pipeline -f data/molecules.json --log-level DEBUG

# Ensure no other heavy processes are running
top

# Check that the CUDA runtime is accessible
nvidia-smi

Incorrect Energy Values¶

Symptom: The final energy value is significantly higher than expected reference values.

Cause: Random parameter initialization may have led to convergence on a local minimum rather than the global minimum. This is especially common for molecules with complex optimization landscapes.

Solution:

Run multiple simulations with different seeds, or use --init-strategy hf to start from the Hartree-Fock state. The HF strategy avoids the worst barren plateaus - for example, L-BFGS-B with random init on 6-31g can spend 1000+ iterations arriving at a positive energy for H2, while HF init reaches a good result in 50 iterations.

Docker Issues¶

Docker Socket Permission Denied (GID Mismatch)¶

Symptom: Airflow or batch generation containers fail with permission denied when trying to access /var/run/docker.sock.

Cause: The DOCKER_GID build argument does not match the Docker group ID on the host. The Airflow container needs this GID to access the Docker daemon for launching simulation containers.

Solution:

# Find the Docker group ID on your host
getent group docker | cut -d: -f3

# Rebuild with the correct GID (e.g., if your Docker GID is 999)
DOCKER_GID=999 docker compose build airflow-worker

# Or set it in your .env file
echo "DOCKER_GID=999" >> .env

The default DOCKER_GID is 970. If your host uses a different value, the container's airflow user will not be in the right group and Docker socket access will fail.

Volume Mount Permissions¶

Symptom: Permission denied errors when containers try to write to mounted volumes.

Cause: The container user does not have write access to the host directory.

Solution: Fix host permissions with chmod -R 777 ./data ./logs or run with --user "$(id -u):$(id -g)".

Stale Docker State¶

Symptom: Configuration changes are not taking effect.

Solution: Rebuild without cache (docker compose build --no-cache) and restart with docker compose down -v && docker compose up -d.

GPU Issues¶

Wrong CUDA_ARCH for Your GPU¶

Symptom: The GPU image builds successfully but simulations crash with CUDA errors, produce incorrect results, or fall back to CPU silently.

Cause: The CUDA_ARCH build argument does not match your GPU's compute capability. The qiskit-aer GPU build compiles CUDA kernels for a specific architecture, and a mismatch means those kernels cannot run on your hardware.

Solution:

# Check your GPU's compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader

# Rebuild with the correct architecture
# Common values:
#   6.1 = GTX 10xx (Pascal)
#   7.5 = RTX 20xx (Turing)
#   8.6 = RTX 30xx (Ampere)
#   8.9 = RTX 40xx (Ada Lovelace)
CUDA_ARCH=7.5 just docker-build gpu

If you are unsure, check the NVIDIA CUDA GPUs page for your card's compute capability.

CUDA Not Available¶

For CUDA setup issues (RuntimeError: CUDA is not available) or driver version mismatches (CUDA driver version is insufficient), see GPU Acceleration for complete setup and troubleshooting instructions. The GPU image requires a host driver compatible with CUDA 12.6.

GPU Out of Memory¶

Symptom: CUDA out of memory. Tried to allocate X MiB error during simulation.

Cause: The molecule requires more GPU memory than is available on the device.

Solution:

# Check GPU memory usage
nvidia-smi

# Use a simpler basis set to reduce memory requirements
quantum-pipeline -f data/molecules.json --basis sto3g

# Fall back to CPU for very large molecules (omit --gpu to use CPU)
quantum-pipeline -f data/molecules.json

Kafka Issues¶

Schema Registry Errors¶

Symptom: SchemaRegistryError: Subject not found or schema compatibility errors.

Solution: Check health with curl http://localhost:8081/subjects. For development, reset compatibility with:

curl -X PUT http://localhost:8081/config \
    -H "Content-Type: application/json" \
    -d '{"compatibility": "NONE"}'

Message Delivery Failures¶

Symptom: Messages produced but never appear in object storage.

Solution: Check Redpanda Connect health with curl http://localhost:4195/ping. Check Garage admin status on port 3903. Review logs with docker compose logs redpanda-connect | tail -50.

Topic Not Created¶

Symptom: Messages fail to publish because the target topic does not exist.

Solution:

# List existing topics
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092

# Create topic manually
docker compose exec kafka kafka-topics --create \
    --topic experiment.vqe \
    --bootstrap-server localhost:9092 \
    --partitions 3 \
    --replication-factor 1

Spark Issues¶

Spark Job Out of Memory¶

Symptom: Spark job fails with java.lang.OutOfMemoryError: Java heap space or executor lost errors.

Cause: The Spark workers do not have enough memory to process the data volume.

Solution:

# Increase Spark executor memory in docker-compose
environment:
  SPARK_WORKER_MEMORY: 8g

# Or set memory in the Spark configuration
spark.executor.memory=4g
spark.driver.memory=2g

Spark Job Submission Failure¶

Symptom: Airflow task fails with SparkSubmitOperator errors, or the job never appears in the Spark UI.

Solution: Verify Spark master is accessible (curl http://spark-master:8080). Check the Airflow connection at Admin > Connections > spark_default (Host: spark://spark-master, Port: 7077).

S3A Connection Errors¶

Symptom: Spark fails with com.amazonaws.SdkClientException: Unable to execute HTTP request when reading from or writing to object storage.

Cause: The S3 endpoint configuration is incorrect, or credentials are missing.

Solution:

Verify that the following Spark configuration values match your Garage setup:

spark.hadoop.fs.s3a.endpoint = http://garage:3901
spark.hadoop.fs.s3a.access.key = <your-access-key>
spark.hadoop.fs.s3a.secret.key = <your-secret-key>
spark.hadoop.fs.s3a.path.style.access = true
spark.hadoop.fs.s3a.connection.ssl.enabled = false

Also verify that Garage is running:

# Check Garage container status
docker compose ps garage

Airflow Issues¶

DAG Not Visible in Web UI¶

Symptom: The quantum_feature_processing DAG does not appear in the Airflow web interface.

Solution: Verify the DAG file exists (docker compose exec airflow-webserver ls /opt/airflow/dags/), check for import errors in Admin > DAG Import Errors, and force a rescan with docker compose exec airflow-scheduler airflow dags reserialize.

Task Failures with Retry Exhaustion¶

Symptom: Tasks fail repeatedly and exhaust all retries.

Solution: Check task logs in the Airflow UI (DAG > Task Instance > Logs), verify dependent services with docker compose ps, then clear the failed task:

docker compose exec airflow-webserver airflow tasks clear \
    quantum_feature_processing -t run_quantum_processing \
    -s 2025-01-01 -e 2025-12-31 --yes

Database Connection Errors¶

Symptom: Airflow fails to start with sqlalchemy.exc.OperationalError: could not connect to server.

Solution: Check PostgreSQL status (docker compose ps postgres), then initialize with docker compose exec airflow-webserver airflow db init.

Monitoring Issues¶

PushGateway Not Reachable¶

Symptom: ConnectionError: HTTPConnectionPool host='pushgateway' port=9091 in simulation logs, or metrics silently not appearing.

Cause: The PushGateway container is not running, the PUSHGATEWAY_URL environment variable points to the wrong address, or there is a network mismatch between the simulation container and the monitoring stack.

Solution:

# Check PushGateway is running
docker compose ps pushgateway

# Verify the URL from inside the simulation container
docker compose exec quantum-pipeline curl -s http://pushgateway:9091/metrics | head -5

# Check the configured URL
docker compose exec quantum-pipeline env | grep PUSHGATEWAY

# Common fixes:
# - From Docker containers, use http://pushgateway:9091 (service name)
# - From the host, use http://localhost:9091
# - Make sure MONITORING_ENABLED is set to true

Metrics Not Appearing in Prometheus¶

Symptom: The Grafana dashboard shows "No data" for all panels. Prometheus queries return empty results.

Cause: The PushGateway is not receiving metrics, or Prometheus is not scraping the PushGateway.

Solution:

# Check PushGateway has received metrics
curl http://localhost:9091/metrics | head -50

# Check Prometheus targets
# Navigate to: http://localhost:9090/targets
# Verify pushgateway target shows "UP"

# Verify the simulation was started with monitoring enabled
# Either set the environment variable:
export MONITORING_ENABLED=true

# Or pass the CLI flag:
quantum-pipeline -f data/molecules.json --enable-performance-monitoring

Grafana Cannot Connect to Prometheus¶

Symptom: Grafana data source test fails with "Connection refused".

Solution: Use http://prometheus:9090 from within Docker, or http://localhost:9090 from the host. Verify via Configuration > Data Sources > Prometheus > Save & Test.

Dashboard Import Fails¶

Symptom: Importing the thesis dashboard JSON fails with a validation error.

Solution: The dashboard was created with Grafana 10.x. Ensure the Prometheus data source is named Prometheus (the default) or update the DS_PROMETHEUS variable in the dashboard JSON to match your data source name.

Batch Generation Issues¶

Container Exits with rc=125¶

Symptom: Batch generation reports a container failure with return code 125.

Cause: The Docker image was not found. This usually means the image has not been built locally or the image name in the batch configuration does not match.

Solution:

# Build the required images
just docker-build cpu
just docker-build gpu

# Verify images exist
docker images | grep quantum-pipeline

Container Exits with rc=1¶

Symptom: Batch generation reports a container failure with return code 1.

Cause: The simulation application itself encountered an error. This could be an invalid molecule/optimizer/basis combination, out-of-memory, or a bug.

Solution:

# Check the container logs for the failed run
docker logs <container-id>

# Look at the batch state file for details
cat gen/ml_batch_state.json | python -m json.tool

The batch system is idempotent - rerunning the generation script will skip completed configurations and retry failed ones. Trigger via the vqe_batch_generation Airflow DAG or directly with python scripts/generate_ml_batch.py.

Batch Progress Not Visible¶

Symptom: Batch generation is running but qp_batch_* metrics do not appear in Grafana.

Cause: The batch script pushes progress metrics to the PushGateway. If the PushGateway is not reachable from the host (where the batch script runs), metrics will not be recorded.

Solution:

Make sure the PushGateway is running and accessible from the host at http://localhost:9091. The batch script runs on the host, not inside a container, so it needs host-level access to the PushGateway.

If your issue is not covered here, check the FAQ or open an issue on the Codeberg repository.