Skip to main content

Troubleshooting

Common deployment issues, diagnostic commands, and fixes.

Quick Diagnostics

Run these commands to get a snapshot of system health:

# All containers status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Health endpoint
curl -s http://localhost:80/health | python3 -m json.tool

# Disk usage
df -h /
docker system df

# Memory usage
free -h
docker stats --no-stream

Common Issues

Container Shows (unhealthy)

# Check logs for the specific service
docker logs zol-<service-name> --tail 50

# Check health check history
docker inspect zol-<service-name> --format='{{json .State.Health.Log}}' | python3 -m json.tool

# Restart the specific service
docker restart zol-<service-name>

Application Can't Connect to Database

All containers must be on the zol-network Docker network.

# Verify network
docker network inspect zol-network

# Check if database is accepting connections
docker exec zol-postgres pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}

# Check DATABASE_URL in the app container
docker exec zol-app printenv DATABASE_URL

Common causes:

  • Wrong password in .env.prod
  • PostgreSQL not fully started yet (wait for (healthy))
  • Network name mismatch between compose files

Port Conflicts

# Check what's using port 80
sudo lsof -i :80
# or
sudo ss -tlnp | grep :80

# Check what's using port 443
sudo lsof -i :443

If another service (like Apache or nginx) is using port 80, stop it:

sudo systemctl stop apache2
sudo systemctl disable apache2

Out of Memory (OOM)

Symptoms: containers being killed, docker ps shows Exited (137).

# Check which container was OOM-killed
docker inspect zol-<service> --format='{{.State.OOMKilled}}'

# Check system memory
free -h

# Check container memory usage
docker stats --no-stream

# Check kernel OOM logs
dmesg | grep -i "oom\|killed" | tail -20

Fixes:

  • Reduce PostgreSQL shared_buffers
  • Add swap space: sudo fallocate -l 4G /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile

Out of Disk Space

# Check disk usage
df -h

# Docker disk usage breakdown
docker system df -v

# Clean old Docker images
docker image prune -f

# Clean old Docker build cache
docker builder prune -f

# Nuclear option: remove ALL unused Docker data
docker system prune -f

Embedding Failures (OpenAI)

Embedding inference is delegated to the OpenAI API (text-embedding-3-large, 1536 dim). Confirm connectivity from inside the app container:

# Test the embedding endpoint with the configured key
docker exec zol-rag-app python -c "
from app.services.embedding_service import EmbeddingService
import asyncio
async def main():
svc = EmbeddingService()
v = await svc.embed_text('test')
print('dim=', len(v))
asyncio.run(main())
"
# Expected: dim= 1536

If this fails, check (a) EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL=text-embedding-3-large in .env.prod, (b) OPENAI_API_KEY is set and non-empty, (c) the host has outbound HTTPS to api.openai.com:443. See ADR-0048 for the migration rationale and the historical compose-override drift class of bug it closed.

Migrations Failed

# Check current migration state
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic current

# Check migration history
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic history --verbose | head -20

# Retry migrations
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic upgrade head

Forgot Admin Password

# Generate a new bcrypt hash
docker exec zol-app python -c "
from passlib.context import CryptContext
pwd = CryptContext(schemes=['bcrypt'], deprecated='auto')
print(pwd.hash('NEW_PASSWORD'))
"

# Update in database
docker exec -it zol-postgres psql \
-U ${POSTGRES_USER} -d ${POSTGRES_DB} \
-c "UPDATE app.users SET hashed_password = '<hash>' WHERE email = 'admin@zol.be';"

Redis Connection Refused

# Check Redis is running with auth
docker exec zol-redis redis-cli -a ${REDIS_PASSWORD} ping
# Expected: PONG

# Check if password matches
docker exec zol-app printenv REDIS_URL

Keycloak Issues

Common Keycloak problems and their solutions.

# Check Keycloak logs
docker logs zol-keycloak --tail 50

# Verify Keycloak is accepting connections
curl -s http://localhost:8080/health/ready

# Check realm configuration
curl -s http://localhost:8080/realms/zol/.well-known/openid-configuration | python3 -m json.tool

Common causes:

  • Realm import failed: Check that the realm JSON was mounted correctly at startup. Re-import with docker exec zol-keycloak /opt/keycloak/bin/kc.sh import --dir /opt/keycloak/data/import
  • Token validation errors: Verify KEYCLOAK_URL and KEYCLOAK_REALM match in both the backend .env and the frontend configuration. Ensure the JWKS endpoint is reachable from the backend container.
  • Redirect URI mismatch: The Keycloak client must list all valid redirect URIs. Add http://localhost:4000/* for development and the production URL for deployment.
  • CORS errors on login: The Keycloak client's "Web Origins" must include the frontend origin.

WebSocket Connection Failing

# Test WebSocket endpoint
curl -i -N \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: test" \
http://localhost:80/ws/query

Check nginx config allows WebSocket upgrades (the production config includes this).

Debug Commands Cheat Sheet

# Container shell access
docker exec -it zol-app bash
docker exec -it zol-postgres psql -U ${POSTGRES_USER} -d ${POSTGRES_DB}
# Network debugging
docker exec zol-app curl -s http://postgres:5432 2>&1 || echo "Can reach postgres"
docker exec zol-app curl -s http://redis:6379 2>&1 || echo "Can reach redis"

# Check environment variables in app container
docker exec zol-app env | sort

# Real-time resource monitoring
docker stats

# Docker events (useful for debugging restarts)
docker events --filter container=zol-app --since 1h

Next: Scripts Reference →