# ComfyUI Gateway -- Troubleshooting Guide Comprehensive troubleshooting reference for diagnosing and resolving issues with the ComfyUI Gateway. Every section follows the **Symptom -> Cause -> Solution** format with concrete commands you can run immediately. --- ## Table of Contents 1. [ComfyUI Not Reachable](#1-comfyui-not-reachable) 2. [OOM (Out of Memory) Errors](#2-oom-out-of-memory-errors) 3. [Slow Generation](#3-slow-generation) 4. [Webhook Failures](#4-webhook-failures) 5. [Redis Connection Issues](#5-redis-connection-issues) 6. [Storage Errors](#6-storage-errors) 7. [Database Issues](#7-database-issues) 8. [Job Stuck in "running"](#8-job-stuck-in-running) 9. [Rate Limiting Issues](#9-rate-limiting-issues) 10. [Authentication Problems](#10-authentication-problems) --- ## 1. ComfyUI Not Reachable The gateway returns `COMFYUI_UNREACHABLE` and the `/health` endpoint shows `comfyui.reachable: false`. ### 1a. Wrong COMFYUI_URL **Symptom**: Gateway starts fine but every job fails with `COMFYUI_UNREACHABLE`. The health endpoint returns `{ ok: false, comfyui: { reachable: false } }`. **Cause**: The `COMFYUI_URL` in `.env` does not point to a running ComfyUI instance. **Solution**: ```bash # 1. Verify what you have configured grep COMFYUI_URL .env # 2. Test connectivity from the gateway host curl -s http://127.0.0.1:8188/ # Expected: HTML page or JSON from ComfyUI # 3. If ComfyUI is on a different port or host, update .env # Example: COMFYUI_URL=http://192.168.1.50:8188 # 4. Restart the gateway after changing .env npm run dev ``` ### 1b. Firewall Blocking the Port **Symptom**: `curl` to the ComfyUI URL times out or returns `Connection refused`, but ComfyUI is confirmed running on that machine. **Cause**: A host firewall (Windows Defender, iptables, ufw) is blocking the port. **Solution**: ```bash # Linux (ufw) sudo ufw allow 8188/tcp sudo ufw reload # Linux (iptables) sudo iptables -A INPUT -p tcp --dport 8188 -j ACCEPT # Windows (PowerShell, run as Admin) New-NetFirewallRule -DisplayName "ComfyUI" -Direction Inbound -Port 8188 -Protocol TCP -Action Allow # Verify the port is listening # Linux ss -tlnp | grep 8188 # Windows netstat -an | findstr 8188 ``` ### 1c. Docker Networking **Symptom**: Gateway running inside Docker cannot reach ComfyUI on `127.0.0.1:8188`. **Cause**: `127.0.0.1` inside a Docker container refers to the container itself, not the host machine. **Solution**: ```bash # Option A: Use Docker's special host DNS (Linux + Docker Desktop) COMFYUI_URL=http://host.docker.internal:8188 # Option B: Use the host network mode docker run --network host comfyui-gateway # Option C: Put both containers on the same Docker network docker network create comfy-net docker run --name comfyui --network comfy-net ... docker run --name gateway --network comfy-net -e COMFYUI_URL=http://comfyui:8188 ... # Verify from inside the gateway container docker exec -it gateway sh -c "wget -qO- http://comfyui:8188/ || echo FAIL" ``` ### 1d. WSL2 Networking **Symptom**: Gateway running on Windows/WSL2 cannot reach ComfyUI running on the other side (host vs WSL or vice-versa). **Cause**: WSL2 uses a virtual network adapter. The WSL2 guest and Windows host have different IP addresses. **Solution**: ```bash # From WSL2, get the Windows host IP cat /etc/resolv.conf | grep nameserver | awk '{print $2}' # Example output: 172.25.192.1 # Set COMFYUI_URL to that IP COMFYUI_URL=http://172.25.192.1:8188 # Alternatively, if ComfyUI runs inside WSL2 and the gateway is on Windows: # Find WSL2 IP wsl hostname -I # Example output: 172.25.198.5 # Set: COMFYUI_URL=http://172.25.198.5:8188 # Make sure ComfyUI is listening on 0.0.0.0, not just 127.0.0.1 # Launch ComfyUI with: python main.py --listen 0.0.0.0 ``` ### 1e. ComfyUI Not Started or Crashed **Symptom**: Port is not listening at all. **Cause**: ComfyUI process is not running. **Solution**: ```bash # Check if the process is running # Linux ps aux | grep "main.py" # Windows tasklist | findstr python # Start ComfyUI cd /path/to/ComfyUI python main.py --listen 0.0.0.0 --port 8188 # Check logs for startup errors python main.py --listen 0.0.0.0 --port 8188 2>&1 | tail -50 # Verify it is accepting connections curl -s http://127.0.0.1:8188/ && echo "OK" || echo "NOT REACHABLE" ``` --- ## 2. OOM (Out of Memory) Errors The gateway classifies these as `COMFYUI_OOM` with `retryable: false`. ### 2a. Resolution or Batch Size Too Large **Symptom**: Job fails with error containing "CUDA out of memory", "allocator backend out of memory", or "failed to allocate". **Cause**: The requested image dimensions or batch size exceeds available VRAM. **Solution**: ```bash # 1. Reduce resolution in your job request # Instead of 2048x2048, try 1024x1024 or 768x768 curl -X POST http://localhost:3000/jobs \ -H "Content-Type: application/json" \ -H "X-API-Key: your-key" \ -d '{ "workflowId": "sdxl_realism_v1", "inputs": { "prompt": "a mountain landscape", "width": 1024, "height": 1024 } }' # 2. Reduce batch size to 1 # Set in your job inputs: "batch_size": 1 # 3. Lower the gateway-level limits in .env MAX_IMAGE_SIZE=1024 MAX_BATCH_SIZE=2 ``` ### 2b. Too Many Steps **Symptom**: OOM occurs mid-generation, not immediately at submission. **Cause**: The sampler accumulates intermediate tensors over many steps. **Solution**: ```bash # Reduce steps in the job inputs # Instead of 50 steps, try 20-30 curl -X POST http://localhost:3000/jobs \ -H "Content-Type: application/json" \ -H "X-API-Key: your-key" \ -d '{ "workflowId": "sdxl_realism_v1", "inputs": { "prompt": "a portrait photo", "steps": 20, "width": 1024, "height": 1024 } }' ``` ### 2c. Model Quantization **Symptom**: Even at low resolution, OOM errors occur because the model is too large for the GPU (common on 8 GB VRAM cards with SDXL). **Cause**: Full-precision (fp32) or half-precision (fp16) model weights exceed available VRAM. **Solution**: ```bash # In ComfyUI, use fp8 or quantized checkpoints # Update your workflow template to use a quantized model: # e.g., "ckpt_name": "sdxl_base_1.0_fp8.safetensors" # Or add --fp8_e4m3fn-unet flag when starting ComfyUI python main.py --listen 0.0.0.0 --fp8_e4m3fn-unet # Monitor VRAM usage nvidia-smi -l 2 ``` ### 2d. VAE Tiling **Symptom**: OOM happens during the VAE decode step (after sampling completes). **Cause**: The VAE decoder processes the entire latent at once, which can be very memory-intensive at high resolutions. **Solution**: ``` Enable VAE tiling in your ComfyUI workflow by adding a "VAEDecodeTiled" node instead of "VAEDecode". Tile size of 512 is a good default. In the workflow JSON template: { "10": { "class_type": "VAEDecodeTiled", "inputs": { "samples": ["3", 0], "vae": ["4", 2], "tile_size": 512 } } } ``` --- ## 3. Slow Generation ### 3a. GPU Not Being Utilized **Symptom**: Jobs complete but take much longer than expected. GPU utilization stays near 0%. **Cause**: ComfyUI is falling back to CPU inference, or the wrong GPU is selected. **Solution**: ```bash # 1. Check GPU utilization during a job nvidia-smi -l 1 # Look for "GPU-Util" column -- should be 80-100% during sampling # 2. Verify CUDA is available in ComfyUI # Check ComfyUI startup logs for "Using device: cuda" # 3. Force GPU selection (multi-GPU systems) CUDA_VISIBLE_DEVICES=0 python main.py --listen 0.0.0.0 # 4. Verify PyTorch sees the GPU python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))" ``` ### 3b. Model Loading on Every Job **Symptom**: First job is slow, subsequent jobs with the same workflow are faster, but switching workflows causes long delays. **Cause**: ComfyUI loads the model from disk each time a different checkpoint is requested. This can take 10-30 seconds per model load. **Solution**: ```bash # 1. Increase ComfyUI's model cache # Start ComfyUI with a larger cache (default is 1 model): python main.py --listen 0.0.0.0 --cache-size 3 # 2. Use the same checkpoint across workflows when possible # Standardize on one checkpoint (e.g., sdxl_base_1.0.safetensors) # 3. Place models on an SSD, not an HDD # Move ComfyUI/models/ to an NVMe drive for faster load times ``` ### 3c. Queue Depth / Concurrency **Symptom**: Jobs are queued for a long time before starting. The job stays in `status: "queued"` for minutes. **Cause**: The worker concurrency is set to 1 (default) and multiple jobs are queued, or the single slot is occupied by a long-running job. **Solution**: ```bash # 1. Check current queue state curl -s http://localhost:3000/jobs?status=queued | jq '.count' curl -s http://localhost:3000/jobs?status=running | jq '.count' # 2. Increase concurrency if your GPU can handle it (multi-batch) # Edit .env: MAX_CONCURRENCY=2 # WARNING: Only increase if you have enough VRAM for parallel jobs. # Two concurrent 1024x1024 SDXL jobs need ~20+ GB VRAM. # 3. For multi-GPU setups, run multiple worker processes # Terminal 1: CUDA_VISIBLE_DEVICES=0 npm run start:worker # Terminal 2: CUDA_VISIBLE_DEVICES=1 npm run start:worker # Both connect to the same Redis queue ``` ### 3d. ComfyUI Startup Time **Symptom**: The very first job after starting ComfyUI takes 30-60 seconds even for a simple generation. **Cause**: ComfyUI performs initialization (loading nodes, compiling, warming up CUDA) on the first prompt. **Solution**: ```bash # 1. Send a warm-up job immediately after starting ComfyUI # This is a tiny 64x64 generation that forces initialization curl -X POST http://localhost:3000/jobs \ -H "Content-Type: application/json" \ -H "X-API-Key: your-key" \ -d '{ "workflowId": "sdxl_realism_v1", "inputs": { "prompt": "test", "width": 64, "height": 64, "steps": 1 } }' # 2. Increase the gateway timeout to account for cold starts COMFYUI_TIMEOUT_MS=600000 ``` --- ## 4. Webhook Failures Webhook errors appear in logs as `WEBHOOK_DELIVERY_FAILED`. ### 4a. DNS Resolution Failure **Symptom**: Webhook fails with "getaddrinfo ENOTFOUND" or "DNS lookup failed". **Cause**: The callback URL hostname cannot be resolved. **Solution**: ```bash # 1. Test DNS resolution from the gateway host nslookup your-webhook-domain.com dig your-webhook-domain.com # 2. If using a local hostname (e.g., within Docker), make sure it is resolvable # Add to /etc/hosts if needed: echo "192.168.1.50 my-webhook-server" | sudo tee -a /etc/hosts # 3. Verify the callback URL is correct in your job request curl -X POST http://localhost:3000/jobs \ -H "Content-Type: application/json" \ -H "X-API-Key: your-key" \ -d '{ "workflowId": "sdxl_realism_v1", "inputs": { "prompt": "test" }, "callbackUrl": "https://your-valid-domain.com/webhook" }' ``` ### 4b. SSL Certificate Errors **Symptom**: Webhook fails with "self signed certificate", "CERT_HAS_EXPIRED", or "unable to verify the first certificate". **Cause**: The webhook receiver uses an invalid, expired, or self-signed SSL certificate. **Solution**: ```bash # 1. Test the certificate manually openssl s_client -connect your-webhook-domain.com:443 -servername your-webhook-domain.com < /dev/null 2>&1 | head -20 # 2. Check expiration echo | openssl s_client -connect your-webhook-domain.com:443 2>/dev/null | openssl x509 -noout -dates # 3. For development with self-signed certs, set NODE_TLS_REJECT_UNAUTHORIZED # WARNING: Do NOT use this in production NODE_TLS_REJECT_UNAUTHORIZED=0 npm run dev # 4. For production, fix the certificate (use Let's Encrypt or a valid CA) ``` ### 4c. Webhook Timeout **Symptom**: Webhook logs show "AbortError" or "Webhook POST timed out". **Cause**: The webhook receiver takes longer than 10 seconds to respond. The gateway has a hardcoded 10-second timeout per webhook attempt with 3 retries and exponential backoff. **Solution**: ```bash # 1. Ensure your webhook receiver responds quickly # The receiver should return 200 immediately and process asynchronously # BAD: app.post("/webhook", async (req, res) => { await longProcess(); res.send("ok"); }) # GOOD: app.post("/webhook", (req, res) => { res.send("ok"); enqueueWork(req.body); }) # 2. Test receiver response time time curl -s -o /dev/null -w "%{time_total}" -X POST https://your-webhook.com/callback \ -H "Content-Type: application/json" -d '{"test": true}' # Should be < 2 seconds ``` ### 4d. Domain Not in Allowlist **Symptom**: Job creation fails with `Callback domain "example.com" is not in the allowed domains list`. **Cause**: `WEBHOOK_ALLOWED_DOMAINS` is configured and does not include the callback URL's domain. **Solution**: ```bash # 1. Check current setting grep WEBHOOK_ALLOWED_DOMAINS .env # 2. Add the domain (comma-separated list) WEBHOOK_ALLOWED_DOMAINS=your-app.com,n8n.your-domain.com,*.internal.company.com # 3. Or allow all domains (less secure, suitable for development) WEBHOOK_ALLOWED_DOMAINS=* # 4. Restart the gateway npm run dev ``` ### 4e. HMAC Signature Mismatch **Symptom**: Your webhook receiver receives the POST but HMAC validation fails on your end. **Cause**: The `WEBHOOK_SECRET` configured in the gateway does not match the secret your receiver uses to validate signatures, or the signature computation differs. **Solution**: ```bash # 1. Verify the WEBHOOK_SECRET matches on both sides grep WEBHOOK_SECRET .env # 2. The gateway sends: X-Signature: sha256= # Computed as: HMAC-SHA256(secret, raw_body_string) # Verify in Node.js: node -e " const crypto = require('crypto'); const secret = 'your-webhook-secret'; const body = '{\"jobId\":\"test\",\"status\":\"succeeded\"}'; const sig = crypto.createHmac('sha256', secret).update(body, 'utf8').digest('hex'); console.log('Expected header: sha256=' + sig); " # 3. Common mistakes: # - Parsing the body before computing HMAC (must use raw string) # - Using different encodings (gateway uses utf8) # - Comparing strings case-sensitively (hex is lowercase) ``` --- ## 5. Redis Connection Issues ### 5a. Cannot Connect to Redis **Symptom**: Gateway crashes at startup with "Redis connection error" or "ECONNREFUSED" targeting the Redis port. **Cause**: Redis server is not running, or the `REDIS_URL` is wrong. **Solution**: ```bash # 1. Check if Redis is running redis-cli ping # Expected: PONG # 2. Verify the URL format # Correct formats: # redis://localhost:6379 # redis://:yourpassword@redis-host:6379/0 # rediss://user:password@host:6380/0 (TLS) # 3. Test connectivity redis-cli -u "redis://localhost:6379" ping # 4. If Redis is not needed, remove REDIS_URL to use in-memory queue # Edit .env: REDIS_URL= # The gateway falls back to an in-memory queue automatically ``` ### 5b. Redis Authentication Failure **Symptom**: Error message contains "NOAUTH Authentication required" or "ERR invalid password". **Cause**: Redis requires a password but `REDIS_URL` does not include one, or the password is wrong. **Solution**: ```bash # 1. Include the password in the URL REDIS_URL=redis://:your_redis_password@localhost:6379/0 # 2. Test with redis-cli redis-cli -a "your_redis_password" ping # 3. Check Redis config for requirepass redis-cli CONFIG GET requirepass ``` ### 5c. Fallback to In-Memory Queue **Symptom**: Logs show "No Redis URL configured, using in-memory queue" and you expected BullMQ. **Cause**: `REDIS_URL` is empty or not set in `.env`. **Solution**: ```bash # 1. Set REDIS_URL in .env REDIS_URL=redis://localhost:6379 # 2. Verify Redis is running redis-cli ping # 3. Restart the gateway npm run dev # 4. Confirm in logs: should show "Redis URL configured, using BullMQ worker" ``` > **Note**: The in-memory queue is fine for single-instance development deployments. > For production with multiple workers or durability requirements, use Redis + BullMQ. --- ## 6. Storage Errors ### 6a. Local Disk Permission Denied **Symptom**: Job fails at the output storage step with "EACCES: permission denied" or `STORAGE_READ_ERROR`. **Cause**: The gateway process does not have write permissions to `STORAGE_LOCAL_PATH`. **Solution**: ```bash # 1. Check the configured path grep STORAGE_LOCAL_PATH .env # Default: ./data/outputs # 2. Ensure the directory exists and is writable mkdir -p ./data/outputs chmod 755 ./data/outputs # 3. Check ownership ls -la ./data/ # 4. If running as a different user (e.g., in Docker) chown -R node:node ./data/outputs # 5. For Docker, mount a volume with correct permissions # docker run -v /host/path/outputs:/app/data/outputs ... ``` ### 6b. S3 Credentials Invalid **Symptom**: Job fails with `STORAGE_S3_PUT_ERROR` and the underlying error mentions "InvalidAccessKeyId", "SignatureDoesNotMatch", or "AccessDenied". **Cause**: The `S3_ACCESS_KEY` / `S3_SECRET_KEY` are wrong, expired, or the IAM policy does not grant `s3:PutObject` permission. **Solution**: ```bash # 1. Verify credentials are set grep S3_ACCESS_KEY .env grep S3_SECRET_KEY .env grep S3_BUCKET .env # 2. Test with AWS CLI aws s3 ls s3://your-bucket/ \ --endpoint-url http://your-minio:9000 \ --region us-east-1 # 3. Test a put operation echo "test" > /tmp/test.txt aws s3 cp /tmp/test.txt s3://your-bucket/test.txt \ --endpoint-url http://your-minio:9000 # 4. Minimum IAM policy for the gateway: # { # "Version": "2012-10-17", # "Statement": [{ # "Effect": "Allow", # "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"], # "Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"] # }] # } ``` ### 6c. MinIO Configuration **Symptom**: S3 storage fails with "socket hang up", "ECONNREFUSED", or "Bucket does not exist". **Cause**: MinIO endpoint is wrong, the bucket has not been created, or `forcePathStyle` is not enabled (handled automatically by the gateway). **Solution**: ```bash # 1. Verify MinIO is running curl http://localhost:9000/minio/health/live # Expected: HTTP 200 # 2. Set the correct endpoint in .env S3_ENDPOINT=http://localhost:9000 S3_BUCKET=comfyui-outputs S3_ACCESS_KEY=minioadmin S3_SECRET_KEY=minioadmin S3_REGION=us-east-1 # 3. Create the bucket if it does not exist # Using mc (MinIO Client) mc alias set local http://localhost:9000 minioadmin minioadmin mc mb local/comfyui-outputs # Or using AWS CLI aws s3 mb s3://comfyui-outputs --endpoint-url http://localhost:9000 ``` --- ## 7. Database Issues ### 7a. SQLite WAL Lock Errors **Symptom**: Intermittent "SQLITE_BUSY" or "database is locked" errors under concurrent load. **Cause**: Multiple processes or threads are writing to the SQLite database simultaneously. SQLite WAL mode supports concurrent readers but only one writer. **Solution**: ```bash # 1. The gateway already sets optimal pragmas: # journal_mode = WAL # synchronous = NORMAL # busy_timeout = 5000 (5 seconds) # 2. If running multiple gateway instances, switch to Postgres DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway # 3. If you must use SQLite with a single instance, increase busy timeout # (requires code change or env override): # The default 5000ms should be sufficient for most single-instance use cases # 4. Check for stuck WAL files ls -la ./data/gateway.db* # You should see: gateway.db, gateway.db-wal, gateway.db-shm # 5. If the database is corrupted, try recovery sqlite3 ./data/gateway.db "PRAGMA integrity_check;" # If it reports errors, back up and recreate: cp ./data/gateway.db ./data/gateway.db.bak sqlite3 ./data/gateway.db ".recover" | sqlite3 ./data/gateway_recovered.db ``` ### 7b. Postgres Connection Pooling **Symptom**: Errors like "too many clients already", "remaining connection slots are reserved", or intermittent "Connection terminated unexpectedly". **Cause**: The gateway opens too many connections to Postgres, exceeding `max_connections`, or connections are not being properly returned to the pool. **Solution**: ```bash # 1. Check current connections in Postgres psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'comfyui_gateway';" # 2. Check max_connections setting psql -c "SHOW max_connections;" # 3. Use a connection pooler like PgBouncer # Install PgBouncer and point DATABASE_URL to it DATABASE_URL=postgresql://user:password@localhost:6432/comfyui_gateway # 4. If running multiple gateway instances, ensure the total pool size # across all instances does not exceed Postgres max_connections ``` ### 7c. Database URL Format **Symptom**: Gateway crashes at startup with "Invalid connection string" or uses SQLite when you intended Postgres. **Cause**: The `DATABASE_URL` format is wrong. The gateway checks if the URL starts with `postgres://` or `postgresql://` to select the Postgres backend. **Solution**: ```bash # SQLite formats (all valid): DATABASE_URL=./data/gateway.db DATABASE_URL=/absolute/path/to/gateway.db # Postgres formats (must start with postgres:// or postgresql://): DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway DATABASE_URL=postgres://user:password@host:5432/dbname?sslmode=require ``` --- ## 8. Job Stuck in "running" ### 8a. ComfyUI Crashed During Execution **Symptom**: A job shows `status: "running"` indefinitely. No progress updates. The gateway health endpoint may show `comfyui.reachable: false`. **Cause**: ComfyUI crashed (segfault, CUDA error, killed by OOM killer) while processing the job, and the gateway's WebSocket connection was severed. **Solution**: ```bash # 1. Check job status curl -s http://localhost:3000/jobs/ | jq '.status' # 2. Check if ComfyUI is still running curl -s http://localhost:3000/health | jq '.comfyui.reachable' # 3. If ComfyUI crashed, restart it cd /path/to/ComfyUI python main.py --listen 0.0.0.0 # 4. The stuck job will eventually time out (COMFYUI_TIMEOUT_MS, default 5 min) # and be marked as failed with COMFYUI_TIMEOUT # 5. To immediately cancel the stuck job curl -X POST http://localhost:3000/jobs//cancel \ -H "X-API-Key: your-key" # 6. To reduce timeout for faster failure detection COMFYUI_TIMEOUT_MS=120000 ``` ### 8b. WebSocket Disconnection **Symptom**: Job stays "running" but ComfyUI is actually done. The output exists in ComfyUI's history. **Cause**: The WebSocket connection dropped mid-execution, and the polling fallback failed to pick up the result. **Solution**: ```bash # 1. Check ComfyUI history directly curl -s http://127.0.0.1:8188/history | jq 'keys | length' # 2. The gateway automatically falls back to HTTP polling if WebSocket fails. # If polling also fails, the job times out. # 3. Restart the gateway to reset connections npm run dev # 4. Check network stability between gateway and ComfyUI ping -c 10 ``` ### 8c. Restart Recovery **Symptom**: After restarting the gateway, jobs that were "running" remain in that state permanently. **Cause**: The in-memory queue loses track of running jobs when the process restarts. There is no automatic recovery for in-memory jobs. **Solution**: ```bash # 1. For production, use Redis (BullMQ) for durable job queues REDIS_URL=redis://localhost:6379 # 2. Manually fail stuck jobs via the database sqlite3 ./data/gateway.db \ "UPDATE jobs SET status='failed', errorJson='{\"code\":\"GATEWAY_RESTART\",\"message\":\"Job interrupted by gateway restart\"}', completedAt=datetime('now') WHERE status='running';" # 3. Verify sqlite3 ./data/gateway.db "SELECT id, status FROM jobs WHERE status='running';" ``` --- ## 9. Rate Limiting Issues ### 9a. Identifying You Are Being Rate Limited **Symptom**: API returns HTTP 429 with body `{ "error": "RATE_LIMITED" }` and a `Retry-After` header. **Cause**: You exceeded `RATE_LIMIT_MAX` requests within the `RATE_LIMIT_WINDOW_MS` window. Limits are applied per API key or per IP. **Solution**: ```bash # 1. Check the response headers curl -v http://localhost:3000/health -H "X-API-Key: your-key" 2>&1 | grep -i "x-ratelimit" # X-RateLimit-Limit: 100 # X-RateLimit-Remaining: 0 # Retry-After: 42 # 2. Wait for the Retry-After period, then retry # 3. Implement exponential backoff in your client ``` ### 9b. Adjusting Rate Limits **Symptom**: Legitimate usage is being throttled. **Cause**: Default limits (100 requests/minute) are too low for your workload. **Solution**: ```bash # 1. Increase the limit in .env RATE_LIMIT_MAX=500 RATE_LIMIT_WINDOW_MS=60000 # 2. For burst workloads, widen the window RATE_LIMIT_MAX=1000 RATE_LIMIT_WINDOW_MS=300000 # 3. Restart the gateway npm run dev # 4. Note: Rate limits are per API key (if authenticated) or per IP. # Different API keys have independent counters. ``` ### 9c. Rate Limit Per API Key vs Per IP **Symptom**: Different clients sharing the same IP are interfering with each other's rate limits. **Cause**: Without API keys, all requests from the same IP share a single rate-limit bucket. **Solution**: ```bash # 1. Assign unique API keys to each client API_KEYS=client1-key:user,client2-key:user,admin-key:admin # 2. Each client uses its own X-API-Key header # Client 1: -H "X-API-Key: client1-key" # Client 2: -H "X-API-Key: client2-key" # 3. Each key gets its own independent rate-limit counter ``` --- ## 10. Authentication Problems ### 10a. API Key Not Accepted **Symptom**: Every request returns HTTP 401 with `{ "error": "AUTH_FAILED", "message": "Invalid API key" }`. **Cause**: The `X-API-Key` header value does not match any entry in `API_KEYS`. **Solution**: ```bash # 1. Check configured keys grep API_KEYS .env # Format: key1:admin,key2:user # 2. Ensure your request uses the exact key (no extra whitespace) curl -H "X-API-Key: mykey123" http://localhost:3000/health # 3. Keys are case-sensitive and matched exactly # 4. If API_KEYS is empty, authentication is DISABLED (development mode) # All requests are treated as admin. Set keys for production: API_KEYS=sk-prod-abc123:admin,sk-user-xyz789:user ``` ### 10b. JWT Token Expired **Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "JWT token has expired" }`. **Cause**: The JWT `exp` claim is in the past. **Solution**: ```bash # 1. Decode the JWT to check expiration (without verification) echo "" | cut -d'.' -f2 | base64 -d 2>/dev/null | jq '.exp' # 2. Compare with current time date +%s # 3. Generate a new token with a longer TTL # Example using Node.js: node -e " const crypto = require('crypto'); const secret = 'your-jwt-secret'; const header = Buffer.from(JSON.stringify({alg:'HS256',typ:'JWT'})).toString('base64url'); const payload = Buffer.from(JSON.stringify({ sub: 'user-1', role: 'admin', iat: Math.floor(Date.now()/1000), exp: Math.floor(Date.now()/1000) + 86400 // 24 hours })).toString('base64url'); const sig = crypto.createHmac('sha256', secret).update(header+'.'+payload).digest('base64url'); console.log(header+'.'+payload+'.'+sig); " ``` ### 10c. JWT Signature Invalid **Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Invalid JWT signature" }`. **Cause**: The JWT was signed with a different secret than what is configured in `JWT_SECRET`. **Solution**: ```bash # 1. Verify the secret matches on token-issuer side and gateway side grep JWT_SECRET .env # 2. The gateway uses HMAC-SHA256 (HS256) exclusively # Make sure your token issuer also uses HS256 with the same secret # 3. Re-generate the token using the correct secret ``` ### 10d. No Authentication Header Provided **Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Authentication required. Provide X-API-Key header or Authorization: Bearer token." }`. **Cause**: The request has no `X-API-Key` header and no `Authorization: Bearer` header, and authentication is enabled (API_KEYS or JWT_SECRET is set). **Solution**: ```bash # Option A: Use API Key curl -H "X-API-Key: your-key" http://localhost:3000/health # Option B: Use JWT Bearer token curl -H "Authorization: Bearer your.jwt.token" http://localhost:3000/health # Option C: Disable auth for development (NOT for production) # Remove all values from API_KEYS and JWT_SECRET in .env: API_KEYS= JWT_SECRET= ``` ### 10e. Insufficient Permissions (Forbidden) **Symptom**: Request returns HTTP 403 with `{ "error": "FORBIDDEN", "message": "Admin role required for this operation" }`. **Cause**: You are using a `user` role key to perform an admin-only action (workflow CRUD). **Solution**: ```bash # 1. Check which role your key has grep API_KEYS .env # Example: sk-user-key:user,sk-admin-key:admin # 2. Use the admin key for workflow management curl -H "X-API-Key: sk-admin-key" -X POST http://localhost:3000/workflows ... # 3. User role can: create jobs, read own jobs, view health/capabilities # Admin role can: everything the user can + workflow CRUD + view all jobs ``` --- ## Quick Diagnostic Commands ```bash # Gateway health curl -s http://localhost:3000/health | jq . # ComfyUI direct connectivity curl -s http://127.0.0.1:8188/ | head -5 # Queue status curl -s http://localhost:3000/jobs?status=queued -H "X-API-Key: KEY" | jq '.count' curl -s http://localhost:3000/jobs?status=running -H "X-API-Key: KEY" | jq '.count' # GPU memory nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader # Redis connectivity redis-cli -u "$REDIS_URL" ping # SQLite integrity sqlite3 ./data/gateway.db "PRAGMA integrity_check;" # Logs (if using pino-pretty) npm run dev 2>&1 | npx pino-pretty # Check all configured environment variables grep -v '^#' .env | grep -v '^$' ```