playbook/antigravity-awesome-skills/skills/comfyui-gateway/references/troubleshooting.md

1083 lines
29 KiB
Markdown

# ComfyUI Gateway -- Troubleshooting Guide
Comprehensive troubleshooting reference for diagnosing and resolving issues with the
ComfyUI Gateway. Every section follows the **Symptom -> Cause -> Solution** format
with concrete commands you can run immediately.
---
## Table of Contents
1. [ComfyUI Not Reachable](#1-comfyui-not-reachable)
2. [OOM (Out of Memory) Errors](#2-oom-out-of-memory-errors)
3. [Slow Generation](#3-slow-generation)
4. [Webhook Failures](#4-webhook-failures)
5. [Redis Connection Issues](#5-redis-connection-issues)
6. [Storage Errors](#6-storage-errors)
7. [Database Issues](#7-database-issues)
8. [Job Stuck in "running"](#8-job-stuck-in-running)
9. [Rate Limiting Issues](#9-rate-limiting-issues)
10. [Authentication Problems](#10-authentication-problems)
---
## 1. ComfyUI Not Reachable
The gateway returns `COMFYUI_UNREACHABLE` and the `/health` endpoint shows
`comfyui.reachable: false`.
### 1a. Wrong COMFYUI_URL
**Symptom**: Gateway starts fine but every job fails with `COMFYUI_UNREACHABLE`.
The health endpoint returns `{ ok: false, comfyui: { reachable: false } }`.
**Cause**: The `COMFYUI_URL` in `.env` does not point to a running ComfyUI instance.
**Solution**:
```bash
# 1. Verify what you have configured
grep COMFYUI_URL .env
# 2. Test connectivity from the gateway host
curl -s http://127.0.0.1:8188/
# Expected: HTML page or JSON from ComfyUI
# 3. If ComfyUI is on a different port or host, update .env
# Example: COMFYUI_URL=http://192.168.1.50:8188
# 4. Restart the gateway after changing .env
npm run dev
```
### 1b. Firewall Blocking the Port
**Symptom**: `curl` to the ComfyUI URL times out or returns `Connection refused`,
but ComfyUI is confirmed running on that machine.
**Cause**: A host firewall (Windows Defender, iptables, ufw) is blocking the port.
**Solution**:
```bash
# Linux (ufw)
sudo ufw allow 8188/tcp
sudo ufw reload
# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8188 -j ACCEPT
# Windows (PowerShell, run as Admin)
New-NetFirewallRule -DisplayName "ComfyUI" -Direction Inbound -Port 8188 -Protocol TCP -Action Allow
# Verify the port is listening
# Linux
ss -tlnp | grep 8188
# Windows
netstat -an | findstr 8188
```
### 1c. Docker Networking
**Symptom**: Gateway running inside Docker cannot reach ComfyUI on `127.0.0.1:8188`.
**Cause**: `127.0.0.1` inside a Docker container refers to the container itself,
not the host machine.
**Solution**:
```bash
# Option A: Use Docker's special host DNS (Linux + Docker Desktop)
COMFYUI_URL=http://host.docker.internal:8188
# Option B: Use the host network mode
docker run --network host comfyui-gateway
# Option C: Put both containers on the same Docker network
docker network create comfy-net
docker run --name comfyui --network comfy-net ...
docker run --name gateway --network comfy-net -e COMFYUI_URL=http://comfyui:8188 ...
# Verify from inside the gateway container
docker exec -it gateway sh -c "wget -qO- http://comfyui:8188/ || echo FAIL"
```
### 1d. WSL2 Networking
**Symptom**: Gateway running on Windows/WSL2 cannot reach ComfyUI running on the
other side (host vs WSL or vice-versa).
**Cause**: WSL2 uses a virtual network adapter. The WSL2 guest and Windows host
have different IP addresses.
**Solution**:
```bash
# From WSL2, get the Windows host IP
cat /etc/resolv.conf | grep nameserver | awk '{print $2}'
# Example output: 172.25.192.1
# Set COMFYUI_URL to that IP
COMFYUI_URL=http://172.25.192.1:8188
# Alternatively, if ComfyUI runs inside WSL2 and the gateway is on Windows:
# Find WSL2 IP
wsl hostname -I
# Example output: 172.25.198.5
# Set: COMFYUI_URL=http://172.25.198.5:8188
# Make sure ComfyUI is listening on 0.0.0.0, not just 127.0.0.1
# Launch ComfyUI with: python main.py --listen 0.0.0.0
```
### 1e. ComfyUI Not Started or Crashed
**Symptom**: Port is not listening at all.
**Cause**: ComfyUI process is not running.
**Solution**:
```bash
# Check if the process is running
# Linux
ps aux | grep "main.py"
# Windows
tasklist | findstr python
# Start ComfyUI
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0 --port 8188
# Check logs for startup errors
python main.py --listen 0.0.0.0 --port 8188 2>&1 | tail -50
# Verify it is accepting connections
curl -s http://127.0.0.1:8188/ && echo "OK" || echo "NOT REACHABLE"
```
---
## 2. OOM (Out of Memory) Errors
The gateway classifies these as `COMFYUI_OOM` with `retryable: false`.
### 2a. Resolution or Batch Size Too Large
**Symptom**: Job fails with error containing "CUDA out of memory",
"allocator backend out of memory", or "failed to allocate".
**Cause**: The requested image dimensions or batch size exceeds available VRAM.
**Solution**:
```bash
# 1. Reduce resolution in your job request
# Instead of 2048x2048, try 1024x1024 or 768x768
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "a mountain landscape",
"width": 1024,
"height": 1024
}
}'
# 2. Reduce batch size to 1
# Set in your job inputs: "batch_size": 1
# 3. Lower the gateway-level limits in .env
MAX_IMAGE_SIZE=1024
MAX_BATCH_SIZE=2
```
### 2b. Too Many Steps
**Symptom**: OOM occurs mid-generation, not immediately at submission.
**Cause**: The sampler accumulates intermediate tensors over many steps.
**Solution**:
```bash
# Reduce steps in the job inputs
# Instead of 50 steps, try 20-30
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "a portrait photo",
"steps": 20,
"width": 1024,
"height": 1024
}
}'
```
### 2c. Model Quantization
**Symptom**: Even at low resolution, OOM errors occur because the model is too
large for the GPU (common on 8 GB VRAM cards with SDXL).
**Cause**: Full-precision (fp32) or half-precision (fp16) model weights exceed
available VRAM.
**Solution**:
```bash
# In ComfyUI, use fp8 or quantized checkpoints
# Update your workflow template to use a quantized model:
# e.g., "ckpt_name": "sdxl_base_1.0_fp8.safetensors"
# Or add --fp8_e4m3fn-unet flag when starting ComfyUI
python main.py --listen 0.0.0.0 --fp8_e4m3fn-unet
# Monitor VRAM usage
nvidia-smi -l 2
```
### 2d. VAE Tiling
**Symptom**: OOM happens during the VAE decode step (after sampling completes).
**Cause**: The VAE decoder processes the entire latent at once, which can be very
memory-intensive at high resolutions.
**Solution**:
```
Enable VAE tiling in your ComfyUI workflow by adding a "VAEDecodeTiled" node
instead of "VAEDecode". Tile size of 512 is a good default.
In the workflow JSON template:
{
"10": {
"class_type": "VAEDecodeTiled",
"inputs": {
"samples": ["3", 0],
"vae": ["4", 2],
"tile_size": 512
}
}
}
```
---
## 3. Slow Generation
### 3a. GPU Not Being Utilized
**Symptom**: Jobs complete but take much longer than expected. GPU utilization
stays near 0%.
**Cause**: ComfyUI is falling back to CPU inference, or the wrong GPU is selected.
**Solution**:
```bash
# 1. Check GPU utilization during a job
nvidia-smi -l 1
# Look for "GPU-Util" column -- should be 80-100% during sampling
# 2. Verify CUDA is available in ComfyUI
# Check ComfyUI startup logs for "Using device: cuda"
# 3. Force GPU selection (multi-GPU systems)
CUDA_VISIBLE_DEVICES=0 python main.py --listen 0.0.0.0
# 4. Verify PyTorch sees the GPU
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
```
### 3b. Model Loading on Every Job
**Symptom**: First job is slow, subsequent jobs with the same workflow are faster,
but switching workflows causes long delays.
**Cause**: ComfyUI loads the model from disk each time a different checkpoint is
requested. This can take 10-30 seconds per model load.
**Solution**:
```bash
# 1. Increase ComfyUI's model cache
# Start ComfyUI with a larger cache (default is 1 model):
python main.py --listen 0.0.0.0 --cache-size 3
# 2. Use the same checkpoint across workflows when possible
# Standardize on one checkpoint (e.g., sdxl_base_1.0.safetensors)
# 3. Place models on an SSD, not an HDD
# Move ComfyUI/models/ to an NVMe drive for faster load times
```
### 3c. Queue Depth / Concurrency
**Symptom**: Jobs are queued for a long time before starting.
The job stays in `status: "queued"` for minutes.
**Cause**: The worker concurrency is set to 1 (default) and multiple jobs are
queued, or the single slot is occupied by a long-running job.
**Solution**:
```bash
# 1. Check current queue state
curl -s http://localhost:3000/jobs?status=queued | jq '.count'
curl -s http://localhost:3000/jobs?status=running | jq '.count'
# 2. Increase concurrency if your GPU can handle it (multi-batch)
# Edit .env:
MAX_CONCURRENCY=2
# WARNING: Only increase if you have enough VRAM for parallel jobs.
# Two concurrent 1024x1024 SDXL jobs need ~20+ GB VRAM.
# 3. For multi-GPU setups, run multiple worker processes
# Terminal 1: CUDA_VISIBLE_DEVICES=0 npm run start:worker
# Terminal 2: CUDA_VISIBLE_DEVICES=1 npm run start:worker
# Both connect to the same Redis queue
```
### 3d. ComfyUI Startup Time
**Symptom**: The very first job after starting ComfyUI takes 30-60 seconds even
for a simple generation.
**Cause**: ComfyUI performs initialization (loading nodes, compiling, warming up
CUDA) on the first prompt.
**Solution**:
```bash
# 1. Send a warm-up job immediately after starting ComfyUI
# This is a tiny 64x64 generation that forces initialization
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": {
"prompt": "test",
"width": 64,
"height": 64,
"steps": 1
}
}'
# 2. Increase the gateway timeout to account for cold starts
COMFYUI_TIMEOUT_MS=600000
```
---
## 4. Webhook Failures
Webhook errors appear in logs as `WEBHOOK_DELIVERY_FAILED`.
### 4a. DNS Resolution Failure
**Symptom**: Webhook fails with "getaddrinfo ENOTFOUND" or "DNS lookup failed".
**Cause**: The callback URL hostname cannot be resolved.
**Solution**:
```bash
# 1. Test DNS resolution from the gateway host
nslookup your-webhook-domain.com
dig your-webhook-domain.com
# 2. If using a local hostname (e.g., within Docker), make sure it is resolvable
# Add to /etc/hosts if needed:
echo "192.168.1.50 my-webhook-server" | sudo tee -a /etc/hosts
# 3. Verify the callback URL is correct in your job request
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"workflowId": "sdxl_realism_v1",
"inputs": { "prompt": "test" },
"callbackUrl": "https://your-valid-domain.com/webhook"
}'
```
### 4b. SSL Certificate Errors
**Symptom**: Webhook fails with "self signed certificate", "CERT_HAS_EXPIRED",
or "unable to verify the first certificate".
**Cause**: The webhook receiver uses an invalid, expired, or self-signed SSL certificate.
**Solution**:
```bash
# 1. Test the certificate manually
openssl s_client -connect your-webhook-domain.com:443 -servername your-webhook-domain.com < /dev/null 2>&1 | head -20
# 2. Check expiration
echo | openssl s_client -connect your-webhook-domain.com:443 2>/dev/null | openssl x509 -noout -dates
# 3. For development with self-signed certs, set NODE_TLS_REJECT_UNAUTHORIZED
# WARNING: Do NOT use this in production
NODE_TLS_REJECT_UNAUTHORIZED=0 npm run dev
# 4. For production, fix the certificate (use Let's Encrypt or a valid CA)
```
### 4c. Webhook Timeout
**Symptom**: Webhook logs show "AbortError" or "Webhook POST timed out".
**Cause**: The webhook receiver takes longer than 10 seconds to respond.
The gateway has a hardcoded 10-second timeout per webhook attempt with
3 retries and exponential backoff.
**Solution**:
```bash
# 1. Ensure your webhook receiver responds quickly
# The receiver should return 200 immediately and process asynchronously
# BAD: app.post("/webhook", async (req, res) => { await longProcess(); res.send("ok"); })
# GOOD: app.post("/webhook", (req, res) => { res.send("ok"); enqueueWork(req.body); })
# 2. Test receiver response time
time curl -s -o /dev/null -w "%{time_total}" -X POST https://your-webhook.com/callback \
-H "Content-Type: application/json" -d '{"test": true}'
# Should be < 2 seconds
```
### 4d. Domain Not in Allowlist
**Symptom**: Job creation fails with `Callback domain "example.com" is not in the
allowed domains list`.
**Cause**: `WEBHOOK_ALLOWED_DOMAINS` is configured and does not include the
callback URL's domain.
**Solution**:
```bash
# 1. Check current setting
grep WEBHOOK_ALLOWED_DOMAINS .env
# 2. Add the domain (comma-separated list)
WEBHOOK_ALLOWED_DOMAINS=your-app.com,n8n.your-domain.com,*.internal.company.com
# 3. Or allow all domains (less secure, suitable for development)
WEBHOOK_ALLOWED_DOMAINS=*
# 4. Restart the gateway
npm run dev
```
### 4e. HMAC Signature Mismatch
**Symptom**: Your webhook receiver receives the POST but HMAC validation fails
on your end.
**Cause**: The `WEBHOOK_SECRET` configured in the gateway does not match the secret
your receiver uses to validate signatures, or the signature computation differs.
**Solution**:
```bash
# 1. Verify the WEBHOOK_SECRET matches on both sides
grep WEBHOOK_SECRET .env
# 2. The gateway sends: X-Signature: sha256=<hex>
# Computed as: HMAC-SHA256(secret, raw_body_string)
# Verify in Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-webhook-secret';
const body = '{\"jobId\":\"test\",\"status\":\"succeeded\"}';
const sig = crypto.createHmac('sha256', secret).update(body, 'utf8').digest('hex');
console.log('Expected header: sha256=' + sig);
"
# 3. Common mistakes:
# - Parsing the body before computing HMAC (must use raw string)
# - Using different encodings (gateway uses utf8)
# - Comparing strings case-sensitively (hex is lowercase)
```
---
## 5. Redis Connection Issues
### 5a. Cannot Connect to Redis
**Symptom**: Gateway crashes at startup with "Redis connection error" or
"ECONNREFUSED" targeting the Redis port.
**Cause**: Redis server is not running, or the `REDIS_URL` is wrong.
**Solution**:
```bash
# 1. Check if Redis is running
redis-cli ping
# Expected: PONG
# 2. Verify the URL format
# Correct formats:
# redis://localhost:6379
# redis://:yourpassword@redis-host:6379/0
# rediss://user:password@host:6380/0 (TLS)
# 3. Test connectivity
redis-cli -u "redis://localhost:6379" ping
# 4. If Redis is not needed, remove REDIS_URL to use in-memory queue
# Edit .env:
REDIS_URL=
# The gateway falls back to an in-memory queue automatically
```
### 5b. Redis Authentication Failure
**Symptom**: Error message contains "NOAUTH Authentication required" or
"ERR invalid password".
**Cause**: Redis requires a password but `REDIS_URL` does not include one,
or the password is wrong.
**Solution**:
```bash
# 1. Include the password in the URL
REDIS_URL=redis://:your_redis_password@localhost:6379/0
# 2. Test with redis-cli
redis-cli -a "your_redis_password" ping
# 3. Check Redis config for requirepass
redis-cli CONFIG GET requirepass
```
### 5c. Fallback to In-Memory Queue
**Symptom**: Logs show "No Redis URL configured, using in-memory queue" and
you expected BullMQ.
**Cause**: `REDIS_URL` is empty or not set in `.env`.
**Solution**:
```bash
# 1. Set REDIS_URL in .env
REDIS_URL=redis://localhost:6379
# 2. Verify Redis is running
redis-cli ping
# 3. Restart the gateway
npm run dev
# 4. Confirm in logs: should show "Redis URL configured, using BullMQ worker"
```
> **Note**: The in-memory queue is fine for single-instance development deployments.
> For production with multiple workers or durability requirements, use Redis + BullMQ.
---
## 6. Storage Errors
### 6a. Local Disk Permission Denied
**Symptom**: Job fails at the output storage step with "EACCES: permission denied"
or `STORAGE_READ_ERROR`.
**Cause**: The gateway process does not have write permissions to `STORAGE_LOCAL_PATH`.
**Solution**:
```bash
# 1. Check the configured path
grep STORAGE_LOCAL_PATH .env
# Default: ./data/outputs
# 2. Ensure the directory exists and is writable
mkdir -p ./data/outputs
chmod 755 ./data/outputs
# 3. Check ownership
ls -la ./data/
# 4. If running as a different user (e.g., in Docker)
chown -R node:node ./data/outputs
# 5. For Docker, mount a volume with correct permissions
# docker run -v /host/path/outputs:/app/data/outputs ...
```
### 6b. S3 Credentials Invalid
**Symptom**: Job fails with `STORAGE_S3_PUT_ERROR` and the underlying error
mentions "InvalidAccessKeyId", "SignatureDoesNotMatch", or "AccessDenied".
**Cause**: The `S3_ACCESS_KEY` / `S3_SECRET_KEY` are wrong, expired, or the
IAM policy does not grant `s3:PutObject` permission.
**Solution**:
```bash
# 1. Verify credentials are set
grep S3_ACCESS_KEY .env
grep S3_SECRET_KEY .env
grep S3_BUCKET .env
# 2. Test with AWS CLI
aws s3 ls s3://your-bucket/ \
--endpoint-url http://your-minio:9000 \
--region us-east-1
# 3. Test a put operation
echo "test" > /tmp/test.txt
aws s3 cp /tmp/test.txt s3://your-bucket/test.txt \
--endpoint-url http://your-minio:9000
# 4. Minimum IAM policy for the gateway:
# {
# "Version": "2012-10-17",
# "Statement": [{
# "Effect": "Allow",
# "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"],
# "Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"]
# }]
# }
```
### 6c. MinIO Configuration
**Symptom**: S3 storage fails with "socket hang up", "ECONNREFUSED", or
"Bucket does not exist".
**Cause**: MinIO endpoint is wrong, the bucket has not been created, or
`forcePathStyle` is not enabled (handled automatically by the gateway).
**Solution**:
```bash
# 1. Verify MinIO is running
curl http://localhost:9000/minio/health/live
# Expected: HTTP 200
# 2. Set the correct endpoint in .env
S3_ENDPOINT=http://localhost:9000
S3_BUCKET=comfyui-outputs
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_REGION=us-east-1
# 3. Create the bucket if it does not exist
# Using mc (MinIO Client)
mc alias set local http://localhost:9000 minioadmin minioadmin
mc mb local/comfyui-outputs
# Or using AWS CLI
aws s3 mb s3://comfyui-outputs --endpoint-url http://localhost:9000
```
---
## 7. Database Issues
### 7a. SQLite WAL Lock Errors
**Symptom**: Intermittent "SQLITE_BUSY" or "database is locked" errors under
concurrent load.
**Cause**: Multiple processes or threads are writing to the SQLite database
simultaneously. SQLite WAL mode supports concurrent readers but only one writer.
**Solution**:
```bash
# 1. The gateway already sets optimal pragmas:
# journal_mode = WAL
# synchronous = NORMAL
# busy_timeout = 5000 (5 seconds)
# 2. If running multiple gateway instances, switch to Postgres
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
# 3. If you must use SQLite with a single instance, increase busy timeout
# (requires code change or env override):
# The default 5000ms should be sufficient for most single-instance use cases
# 4. Check for stuck WAL files
ls -la ./data/gateway.db*
# You should see: gateway.db, gateway.db-wal, gateway.db-shm
# 5. If the database is corrupted, try recovery
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
# If it reports errors, back up and recreate:
cp ./data/gateway.db ./data/gateway.db.bak
sqlite3 ./data/gateway.db ".recover" | sqlite3 ./data/gateway_recovered.db
```
### 7b. Postgres Connection Pooling
**Symptom**: Errors like "too many clients already", "remaining connection slots
are reserved", or intermittent "Connection terminated unexpectedly".
**Cause**: The gateway opens too many connections to Postgres, exceeding
`max_connections`, or connections are not being properly returned to the pool.
**Solution**:
```bash
# 1. Check current connections in Postgres
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'comfyui_gateway';"
# 2. Check max_connections setting
psql -c "SHOW max_connections;"
# 3. Use a connection pooler like PgBouncer
# Install PgBouncer and point DATABASE_URL to it
DATABASE_URL=postgresql://user:password@localhost:6432/comfyui_gateway
# 4. If running multiple gateway instances, ensure the total pool size
# across all instances does not exceed Postgres max_connections
```
### 7c. Database URL Format
**Symptom**: Gateway crashes at startup with "Invalid connection string" or
uses SQLite when you intended Postgres.
**Cause**: The `DATABASE_URL` format is wrong. The gateway checks if the URL
starts with `postgres://` or `postgresql://` to select the Postgres backend.
**Solution**:
```bash
# SQLite formats (all valid):
DATABASE_URL=./data/gateway.db
DATABASE_URL=/absolute/path/to/gateway.db
# Postgres formats (must start with postgres:// or postgresql://):
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
DATABASE_URL=postgres://user:password@host:5432/dbname?sslmode=require
```
---
## 8. Job Stuck in "running"
### 8a. ComfyUI Crashed During Execution
**Symptom**: A job shows `status: "running"` indefinitely. No progress updates.
The gateway health endpoint may show `comfyui.reachable: false`.
**Cause**: ComfyUI crashed (segfault, CUDA error, killed by OOM killer) while
processing the job, and the gateway's WebSocket connection was severed.
**Solution**:
```bash
# 1. Check job status
curl -s http://localhost:3000/jobs/<jobId> | jq '.status'
# 2. Check if ComfyUI is still running
curl -s http://localhost:3000/health | jq '.comfyui.reachable'
# 3. If ComfyUI crashed, restart it
cd /path/to/ComfyUI
python main.py --listen 0.0.0.0
# 4. The stuck job will eventually time out (COMFYUI_TIMEOUT_MS, default 5 min)
# and be marked as failed with COMFYUI_TIMEOUT
# 5. To immediately cancel the stuck job
curl -X POST http://localhost:3000/jobs/<jobId>/cancel \
-H "X-API-Key: your-key"
# 6. To reduce timeout for faster failure detection
COMFYUI_TIMEOUT_MS=120000
```
### 8b. WebSocket Disconnection
**Symptom**: Job stays "running" but ComfyUI is actually done. The output
exists in ComfyUI's history.
**Cause**: The WebSocket connection dropped mid-execution, and the polling
fallback failed to pick up the result.
**Solution**:
```bash
# 1. Check ComfyUI history directly
curl -s http://127.0.0.1:8188/history | jq 'keys | length'
# 2. The gateway automatically falls back to HTTP polling if WebSocket fails.
# If polling also fails, the job times out.
# 3. Restart the gateway to reset connections
npm run dev
# 4. Check network stability between gateway and ComfyUI
ping -c 10 <comfyui-host>
```
### 8c. Restart Recovery
**Symptom**: After restarting the gateway, jobs that were "running" remain
in that state permanently.
**Cause**: The in-memory queue loses track of running jobs when the process
restarts. There is no automatic recovery for in-memory jobs.
**Solution**:
```bash
# 1. For production, use Redis (BullMQ) for durable job queues
REDIS_URL=redis://localhost:6379
# 2. Manually fail stuck jobs via the database
sqlite3 ./data/gateway.db \
"UPDATE jobs SET status='failed', errorJson='{\"code\":\"GATEWAY_RESTART\",\"message\":\"Job interrupted by gateway restart\"}', completedAt=datetime('now') WHERE status='running';"
# 3. Verify
sqlite3 ./data/gateway.db "SELECT id, status FROM jobs WHERE status='running';"
```
---
## 9. Rate Limiting Issues
### 9a. Identifying You Are Being Rate Limited
**Symptom**: API returns HTTP 429 with body `{ "error": "RATE_LIMITED" }` and
a `Retry-After` header.
**Cause**: You exceeded `RATE_LIMIT_MAX` requests within the `RATE_LIMIT_WINDOW_MS`
window. Limits are applied per API key or per IP.
**Solution**:
```bash
# 1. Check the response headers
curl -v http://localhost:3000/health -H "X-API-Key: your-key" 2>&1 | grep -i "x-ratelimit"
# X-RateLimit-Limit: 100
# X-RateLimit-Remaining: 0
# Retry-After: 42
# 2. Wait for the Retry-After period, then retry
# 3. Implement exponential backoff in your client
```
### 9b. Adjusting Rate Limits
**Symptom**: Legitimate usage is being throttled.
**Cause**: Default limits (100 requests/minute) are too low for your workload.
**Solution**:
```bash
# 1. Increase the limit in .env
RATE_LIMIT_MAX=500
RATE_LIMIT_WINDOW_MS=60000
# 2. For burst workloads, widen the window
RATE_LIMIT_MAX=1000
RATE_LIMIT_WINDOW_MS=300000
# 3. Restart the gateway
npm run dev
# 4. Note: Rate limits are per API key (if authenticated) or per IP.
# Different API keys have independent counters.
```
### 9c. Rate Limit Per API Key vs Per IP
**Symptom**: Different clients sharing the same IP are interfering with each
other's rate limits.
**Cause**: Without API keys, all requests from the same IP share a single
rate-limit bucket.
**Solution**:
```bash
# 1. Assign unique API keys to each client
API_KEYS=client1-key:user,client2-key:user,admin-key:admin
# 2. Each client uses its own X-API-Key header
# Client 1: -H "X-API-Key: client1-key"
# Client 2: -H "X-API-Key: client2-key"
# 3. Each key gets its own independent rate-limit counter
```
---
## 10. Authentication Problems
### 10a. API Key Not Accepted
**Symptom**: Every request returns HTTP 401 with `{ "error": "AUTH_FAILED",
"message": "Invalid API key" }`.
**Cause**: The `X-API-Key` header value does not match any entry in `API_KEYS`.
**Solution**:
```bash
# 1. Check configured keys
grep API_KEYS .env
# Format: key1:admin,key2:user
# 2. Ensure your request uses the exact key (no extra whitespace)
curl -H "X-API-Key: mykey123" http://localhost:3000/health
# 3. Keys are case-sensitive and matched exactly
# 4. If API_KEYS is empty, authentication is DISABLED (development mode)
# All requests are treated as admin. Set keys for production:
API_KEYS=sk-prod-abc123:admin,sk-user-xyz789:user
```
### 10b. JWT Token Expired
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "JWT token
has expired" }`.
**Cause**: The JWT `exp` claim is in the past.
**Solution**:
```bash
# 1. Decode the JWT to check expiration (without verification)
echo "<your-token>" | cut -d'.' -f2 | base64 -d 2>/dev/null | jq '.exp'
# 2. Compare with current time
date +%s
# 3. Generate a new token with a longer TTL
# Example using Node.js:
node -e "
const crypto = require('crypto');
const secret = 'your-jwt-secret';
const header = Buffer.from(JSON.stringify({alg:'HS256',typ:'JWT'})).toString('base64url');
const payload = Buffer.from(JSON.stringify({
sub: 'user-1',
role: 'admin',
iat: Math.floor(Date.now()/1000),
exp: Math.floor(Date.now()/1000) + 86400 // 24 hours
})).toString('base64url');
const sig = crypto.createHmac('sha256', secret).update(header+'.'+payload).digest('base64url');
console.log(header+'.'+payload+'.'+sig);
"
```
### 10c. JWT Signature Invalid
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Invalid JWT
signature" }`.
**Cause**: The JWT was signed with a different secret than what is configured in
`JWT_SECRET`.
**Solution**:
```bash
# 1. Verify the secret matches on token-issuer side and gateway side
grep JWT_SECRET .env
# 2. The gateway uses HMAC-SHA256 (HS256) exclusively
# Make sure your token issuer also uses HS256 with the same secret
# 3. Re-generate the token using the correct secret
```
### 10d. No Authentication Header Provided
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Authentication
required. Provide X-API-Key header or Authorization: Bearer token." }`.
**Cause**: The request has no `X-API-Key` header and no `Authorization: Bearer`
header, and authentication is enabled (API_KEYS or JWT_SECRET is set).
**Solution**:
```bash
# Option A: Use API Key
curl -H "X-API-Key: your-key" http://localhost:3000/health
# Option B: Use JWT Bearer token
curl -H "Authorization: Bearer your.jwt.token" http://localhost:3000/health
# Option C: Disable auth for development (NOT for production)
# Remove all values from API_KEYS and JWT_SECRET in .env:
API_KEYS=
JWT_SECRET=
```
### 10e. Insufficient Permissions (Forbidden)
**Symptom**: Request returns HTTP 403 with `{ "error": "FORBIDDEN", "message":
"Admin role required for this operation" }`.
**Cause**: You are using a `user` role key to perform an admin-only action
(workflow CRUD).
**Solution**:
```bash
# 1. Check which role your key has
grep API_KEYS .env
# Example: sk-user-key:user,sk-admin-key:admin
# 2. Use the admin key for workflow management
curl -H "X-API-Key: sk-admin-key" -X POST http://localhost:3000/workflows ...
# 3. User role can: create jobs, read own jobs, view health/capabilities
# Admin role can: everything the user can + workflow CRUD + view all jobs
```
---
## Quick Diagnostic Commands
```bash
# Gateway health
curl -s http://localhost:3000/health | jq .
# ComfyUI direct connectivity
curl -s http://127.0.0.1:8188/ | head -5
# Queue status
curl -s http://localhost:3000/jobs?status=queued -H "X-API-Key: KEY" | jq '.count'
curl -s http://localhost:3000/jobs?status=running -H "X-API-Key: KEY" | jq '.count'
# GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
# Redis connectivity
redis-cli -u "$REDIS_URL" ping
# SQLite integrity
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
# Logs (if using pino-pretty)
npm run dev 2>&1 | npx pino-pretty
# Check all configured environment variables
grep -v '^#' .env | grep -v '^$'
```