Troubleshooting
Common deployment issues, diagnostic commands, and fixes.
Quick Diagnostics
Run these commands to get a snapshot of system health:
# All containers status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
# Health endpoint
curl -s http://localhost:80/health | python3 -m json.tool
# Disk usage
df -h /
docker system df
# Memory usage
free -h
docker stats --no-stream
Common Issues
Container Shows (unhealthy)
# Check logs for the specific service
docker logs zol-<service-name> --tail 50
# Check health check history
docker inspect zol-<service-name> --format='{{json .State.Health.Log}}' | python3 -m json.tool
# Restart the specific service
docker restart zol-<service-name>
Application Can't Connect to Database
All containers must be on the zol-network Docker network.
# Verify network
docker network inspect zol-network
# Check if database is accepting connections
docker exec zol-postgres pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}
# Check DATABASE_URL in the app container
docker exec zol-app printenv DATABASE_URL
Common causes:
- Wrong password in
.env.prod - PostgreSQL not fully started yet (wait for
(healthy)) - Network name mismatch between compose files
Port Conflicts
# Check what's using port 80
sudo lsof -i :80
# or
sudo ss -tlnp | grep :80
# Check what's using port 443
sudo lsof -i :443
If another service (like Apache or nginx) is using port 80, stop it:
sudo systemctl stop apache2
sudo systemctl disable apache2
Out of Memory (OOM)
Symptoms: containers being killed, docker ps shows Exited (137).
# Check which container was OOM-killed
docker inspect zol-<service> --format='{{.State.OOMKilled}}'
# Check system memory
free -h
# Check container memory usage
docker stats --no-stream
# Check kernel OOM logs
dmesg | grep -i "oom\|killed" | tail -20
Fixes:
- Reduce PostgreSQL shared_buffers
- Add swap space:
sudo fallocate -l 4G /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
Out of Disk Space
# Check disk usage
df -h
# Docker disk usage breakdown
docker system df -v
# Clean old Docker images
docker image prune -f
# Clean old Docker build cache
docker builder prune -f
# Nuclear option: remove ALL unused Docker data
docker system prune -f
Embedding Failures (OpenAI)
Embedding inference is delegated to the OpenAI API (text-embedding-3-large, 1536 dim). Confirm connectivity from inside the app container:
# Test the embedding endpoint with the configured key
docker exec zol-rag-app python -c "
from app.services.embedding_service import EmbeddingService
import asyncio
async def main():
svc = EmbeddingService()
v = await svc.embed_text('test')
print('dim=', len(v))
asyncio.run(main())
"
# Expected: dim= 1536
If this fails, check (a) EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL=text-embedding-3-large in .env.prod, (b) OPENAI_API_KEY is set and non-empty, (c) the host has outbound HTTPS to api.openai.com:443. See ADR-0048 for the migration rationale and the historical compose-override drift class of bug it closed.
Migrations Failed
# Check current migration state
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic current
# Check migration history
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic history --verbose | head -20
# Retry migrations
docker run --rm \
--network zol-network \
--env-file .env.prod \
zol-rag-app:${GIT_SHA} \
alembic upgrade head
Forgot Admin Password
# Generate a new bcrypt hash
docker exec zol-app python -c "
from passlib.context import CryptContext
pwd = CryptContext(schemes=['bcrypt'], deprecated='auto')
print(pwd.hash('NEW_PASSWORD'))
"
# Update in database
docker exec -it zol-postgres psql \
-U ${POSTGRES_USER} -d ${POSTGRES_DB} \
-c "UPDATE app.users SET hashed_password = '<hash>' WHERE email = 'admin@zol.be';"
Redis Connection Refused
# Check Redis is running with auth
docker exec zol-redis redis-cli -a ${REDIS_PASSWORD} ping
# Expected: PONG
# Check if password matches
docker exec zol-app printenv REDIS_URL
Keycloak Issues
Common Keycloak problems and their solutions.
# Check Keycloak logs
docker logs zol-keycloak --tail 50
# Verify Keycloak is accepting connections
curl -s http://localhost:8080/health/ready
# Check realm configuration
curl -s http://localhost:8080/realms/zol/.well-known/openid-configuration | python3 -m json.tool
Common causes:
- Realm import failed: Check that the realm JSON was mounted correctly at startup. Re-import with
docker exec zol-keycloak /opt/keycloak/bin/kc.sh import --dir /opt/keycloak/data/import - Token validation errors: Verify
KEYCLOAK_URLandKEYCLOAK_REALMmatch in both the backend.envand the frontend configuration. Ensure the JWKS endpoint is reachable from the backend container. - Redirect URI mismatch: The Keycloak client must list all valid redirect URIs. Add
http://localhost:4000/*for development and the production URL for deployment. - CORS errors on login: The Keycloak client's "Web Origins" must include the frontend origin.
WebSocket Connection Failing
# Test WebSocket endpoint
curl -i -N \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: test" \
http://localhost:80/ws/query
Check nginx config allows WebSocket upgrades (the production config includes this).
Debug Commands Cheat Sheet
# Container shell access
docker exec -it zol-app bash
docker exec -it zol-postgres psql -U ${POSTGRES_USER} -d ${POSTGRES_DB}
# Network debugging
docker exec zol-app curl -s http://postgres:5432 2>&1 || echo "Can reach postgres"
docker exec zol-app curl -s http://redis:6379 2>&1 || echo "Can reach redis"
# Check environment variables in app container
docker exec zol-app env | sort
# Real-time resource monitoring
docker stats
# Docker events (useful for debugging restarts)
docker events --filter container=zol-app --since 1h
Next: Scripts Reference →