1083 lines
29 KiB
Markdown
1083 lines
29 KiB
Markdown
# ComfyUI Gateway -- Troubleshooting Guide
|
|
|
|
Comprehensive troubleshooting reference for diagnosing and resolving issues with the
|
|
ComfyUI Gateway. Every section follows the **Symptom -> Cause -> Solution** format
|
|
with concrete commands you can run immediately.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [ComfyUI Not Reachable](#1-comfyui-not-reachable)
|
|
2. [OOM (Out of Memory) Errors](#2-oom-out-of-memory-errors)
|
|
3. [Slow Generation](#3-slow-generation)
|
|
4. [Webhook Failures](#4-webhook-failures)
|
|
5. [Redis Connection Issues](#5-redis-connection-issues)
|
|
6. [Storage Errors](#6-storage-errors)
|
|
7. [Database Issues](#7-database-issues)
|
|
8. [Job Stuck in "running"](#8-job-stuck-in-running)
|
|
9. [Rate Limiting Issues](#9-rate-limiting-issues)
|
|
10. [Authentication Problems](#10-authentication-problems)
|
|
|
|
---
|
|
|
|
## 1. ComfyUI Not Reachable
|
|
|
|
The gateway returns `COMFYUI_UNREACHABLE` and the `/health` endpoint shows
|
|
`comfyui.reachable: false`.
|
|
|
|
### 1a. Wrong COMFYUI_URL
|
|
|
|
**Symptom**: Gateway starts fine but every job fails with `COMFYUI_UNREACHABLE`.
|
|
The health endpoint returns `{ ok: false, comfyui: { reachable: false } }`.
|
|
|
|
**Cause**: The `COMFYUI_URL` in `.env` does not point to a running ComfyUI instance.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Verify what you have configured
|
|
grep COMFYUI_URL .env
|
|
|
|
# 2. Test connectivity from the gateway host
|
|
curl -s http://127.0.0.1:8188/
|
|
# Expected: HTML page or JSON from ComfyUI
|
|
|
|
# 3. If ComfyUI is on a different port or host, update .env
|
|
# Example: COMFYUI_URL=http://192.168.1.50:8188
|
|
|
|
# 4. Restart the gateway after changing .env
|
|
npm run dev
|
|
```
|
|
|
|
### 1b. Firewall Blocking the Port
|
|
|
|
**Symptom**: `curl` to the ComfyUI URL times out or returns `Connection refused`,
|
|
but ComfyUI is confirmed running on that machine.
|
|
|
|
**Cause**: A host firewall (Windows Defender, iptables, ufw) is blocking the port.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# Linux (ufw)
|
|
sudo ufw allow 8188/tcp
|
|
sudo ufw reload
|
|
|
|
# Linux (iptables)
|
|
sudo iptables -A INPUT -p tcp --dport 8188 -j ACCEPT
|
|
|
|
# Windows (PowerShell, run as Admin)
|
|
New-NetFirewallRule -DisplayName "ComfyUI" -Direction Inbound -Port 8188 -Protocol TCP -Action Allow
|
|
|
|
# Verify the port is listening
|
|
# Linux
|
|
ss -tlnp | grep 8188
|
|
# Windows
|
|
netstat -an | findstr 8188
|
|
```
|
|
|
|
### 1c. Docker Networking
|
|
|
|
**Symptom**: Gateway running inside Docker cannot reach ComfyUI on `127.0.0.1:8188`.
|
|
|
|
**Cause**: `127.0.0.1` inside a Docker container refers to the container itself,
|
|
not the host machine.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# Option A: Use Docker's special host DNS (Linux + Docker Desktop)
|
|
COMFYUI_URL=http://host.docker.internal:8188
|
|
|
|
# Option B: Use the host network mode
|
|
docker run --network host comfyui-gateway
|
|
|
|
# Option C: Put both containers on the same Docker network
|
|
docker network create comfy-net
|
|
docker run --name comfyui --network comfy-net ...
|
|
docker run --name gateway --network comfy-net -e COMFYUI_URL=http://comfyui:8188 ...
|
|
|
|
# Verify from inside the gateway container
|
|
docker exec -it gateway sh -c "wget -qO- http://comfyui:8188/ || echo FAIL"
|
|
```
|
|
|
|
### 1d. WSL2 Networking
|
|
|
|
**Symptom**: Gateway running on Windows/WSL2 cannot reach ComfyUI running on the
|
|
other side (host vs WSL or vice-versa).
|
|
|
|
**Cause**: WSL2 uses a virtual network adapter. The WSL2 guest and Windows host
|
|
have different IP addresses.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# From WSL2, get the Windows host IP
|
|
cat /etc/resolv.conf | grep nameserver | awk '{print $2}'
|
|
# Example output: 172.25.192.1
|
|
|
|
# Set COMFYUI_URL to that IP
|
|
COMFYUI_URL=http://172.25.192.1:8188
|
|
|
|
# Alternatively, if ComfyUI runs inside WSL2 and the gateway is on Windows:
|
|
# Find WSL2 IP
|
|
wsl hostname -I
|
|
# Example output: 172.25.198.5
|
|
# Set: COMFYUI_URL=http://172.25.198.5:8188
|
|
|
|
# Make sure ComfyUI is listening on 0.0.0.0, not just 127.0.0.1
|
|
# Launch ComfyUI with: python main.py --listen 0.0.0.0
|
|
```
|
|
|
|
### 1e. ComfyUI Not Started or Crashed
|
|
|
|
**Symptom**: Port is not listening at all.
|
|
|
|
**Cause**: ComfyUI process is not running.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# Check if the process is running
|
|
# Linux
|
|
ps aux | grep "main.py"
|
|
# Windows
|
|
tasklist | findstr python
|
|
|
|
# Start ComfyUI
|
|
cd /path/to/ComfyUI
|
|
python main.py --listen 0.0.0.0 --port 8188
|
|
|
|
# Check logs for startup errors
|
|
python main.py --listen 0.0.0.0 --port 8188 2>&1 | tail -50
|
|
|
|
# Verify it is accepting connections
|
|
curl -s http://127.0.0.1:8188/ && echo "OK" || echo "NOT REACHABLE"
|
|
```
|
|
|
|
---
|
|
|
|
## 2. OOM (Out of Memory) Errors
|
|
|
|
The gateway classifies these as `COMFYUI_OOM` with `retryable: false`.
|
|
|
|
### 2a. Resolution or Batch Size Too Large
|
|
|
|
**Symptom**: Job fails with error containing "CUDA out of memory",
|
|
"allocator backend out of memory", or "failed to allocate".
|
|
|
|
**Cause**: The requested image dimensions or batch size exceeds available VRAM.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Reduce resolution in your job request
|
|
# Instead of 2048x2048, try 1024x1024 or 768x768
|
|
curl -X POST http://localhost:3000/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-key" \
|
|
-d '{
|
|
"workflowId": "sdxl_realism_v1",
|
|
"inputs": {
|
|
"prompt": "a mountain landscape",
|
|
"width": 1024,
|
|
"height": 1024
|
|
}
|
|
}'
|
|
|
|
# 2. Reduce batch size to 1
|
|
# Set in your job inputs: "batch_size": 1
|
|
|
|
# 3. Lower the gateway-level limits in .env
|
|
MAX_IMAGE_SIZE=1024
|
|
MAX_BATCH_SIZE=2
|
|
```
|
|
|
|
### 2b. Too Many Steps
|
|
|
|
**Symptom**: OOM occurs mid-generation, not immediately at submission.
|
|
|
|
**Cause**: The sampler accumulates intermediate tensors over many steps.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# Reduce steps in the job inputs
|
|
# Instead of 50 steps, try 20-30
|
|
curl -X POST http://localhost:3000/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-key" \
|
|
-d '{
|
|
"workflowId": "sdxl_realism_v1",
|
|
"inputs": {
|
|
"prompt": "a portrait photo",
|
|
"steps": 20,
|
|
"width": 1024,
|
|
"height": 1024
|
|
}
|
|
}'
|
|
```
|
|
|
|
### 2c. Model Quantization
|
|
|
|
**Symptom**: Even at low resolution, OOM errors occur because the model is too
|
|
large for the GPU (common on 8 GB VRAM cards with SDXL).
|
|
|
|
**Cause**: Full-precision (fp32) or half-precision (fp16) model weights exceed
|
|
available VRAM.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# In ComfyUI, use fp8 or quantized checkpoints
|
|
# Update your workflow template to use a quantized model:
|
|
# e.g., "ckpt_name": "sdxl_base_1.0_fp8.safetensors"
|
|
|
|
# Or add --fp8_e4m3fn-unet flag when starting ComfyUI
|
|
python main.py --listen 0.0.0.0 --fp8_e4m3fn-unet
|
|
|
|
# Monitor VRAM usage
|
|
nvidia-smi -l 2
|
|
```
|
|
|
|
### 2d. VAE Tiling
|
|
|
|
**Symptom**: OOM happens during the VAE decode step (after sampling completes).
|
|
|
|
**Cause**: The VAE decoder processes the entire latent at once, which can be very
|
|
memory-intensive at high resolutions.
|
|
|
|
**Solution**:
|
|
|
|
```
|
|
Enable VAE tiling in your ComfyUI workflow by adding a "VAEDecodeTiled" node
|
|
instead of "VAEDecode". Tile size of 512 is a good default.
|
|
|
|
In the workflow JSON template:
|
|
{
|
|
"10": {
|
|
"class_type": "VAEDecodeTiled",
|
|
"inputs": {
|
|
"samples": ["3", 0],
|
|
"vae": ["4", 2],
|
|
"tile_size": 512
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Slow Generation
|
|
|
|
### 3a. GPU Not Being Utilized
|
|
|
|
**Symptom**: Jobs complete but take much longer than expected. GPU utilization
|
|
stays near 0%.
|
|
|
|
**Cause**: ComfyUI is falling back to CPU inference, or the wrong GPU is selected.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check GPU utilization during a job
|
|
nvidia-smi -l 1
|
|
# Look for "GPU-Util" column -- should be 80-100% during sampling
|
|
|
|
# 2. Verify CUDA is available in ComfyUI
|
|
# Check ComfyUI startup logs for "Using device: cuda"
|
|
|
|
# 3. Force GPU selection (multi-GPU systems)
|
|
CUDA_VISIBLE_DEVICES=0 python main.py --listen 0.0.0.0
|
|
|
|
# 4. Verify PyTorch sees the GPU
|
|
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
|
|
```
|
|
|
|
### 3b. Model Loading on Every Job
|
|
|
|
**Symptom**: First job is slow, subsequent jobs with the same workflow are faster,
|
|
but switching workflows causes long delays.
|
|
|
|
**Cause**: ComfyUI loads the model from disk each time a different checkpoint is
|
|
requested. This can take 10-30 seconds per model load.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Increase ComfyUI's model cache
|
|
# Start ComfyUI with a larger cache (default is 1 model):
|
|
python main.py --listen 0.0.0.0 --cache-size 3
|
|
|
|
# 2. Use the same checkpoint across workflows when possible
|
|
# Standardize on one checkpoint (e.g., sdxl_base_1.0.safetensors)
|
|
|
|
# 3. Place models on an SSD, not an HDD
|
|
# Move ComfyUI/models/ to an NVMe drive for faster load times
|
|
```
|
|
|
|
### 3c. Queue Depth / Concurrency
|
|
|
|
**Symptom**: Jobs are queued for a long time before starting.
|
|
The job stays in `status: "queued"` for minutes.
|
|
|
|
**Cause**: The worker concurrency is set to 1 (default) and multiple jobs are
|
|
queued, or the single slot is occupied by a long-running job.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check current queue state
|
|
curl -s http://localhost:3000/jobs?status=queued | jq '.count'
|
|
curl -s http://localhost:3000/jobs?status=running | jq '.count'
|
|
|
|
# 2. Increase concurrency if your GPU can handle it (multi-batch)
|
|
# Edit .env:
|
|
MAX_CONCURRENCY=2
|
|
|
|
# WARNING: Only increase if you have enough VRAM for parallel jobs.
|
|
# Two concurrent 1024x1024 SDXL jobs need ~20+ GB VRAM.
|
|
|
|
# 3. For multi-GPU setups, run multiple worker processes
|
|
# Terminal 1: CUDA_VISIBLE_DEVICES=0 npm run start:worker
|
|
# Terminal 2: CUDA_VISIBLE_DEVICES=1 npm run start:worker
|
|
# Both connect to the same Redis queue
|
|
```
|
|
|
|
### 3d. ComfyUI Startup Time
|
|
|
|
**Symptom**: The very first job after starting ComfyUI takes 30-60 seconds even
|
|
for a simple generation.
|
|
|
|
**Cause**: ComfyUI performs initialization (loading nodes, compiling, warming up
|
|
CUDA) on the first prompt.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Send a warm-up job immediately after starting ComfyUI
|
|
# This is a tiny 64x64 generation that forces initialization
|
|
curl -X POST http://localhost:3000/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-key" \
|
|
-d '{
|
|
"workflowId": "sdxl_realism_v1",
|
|
"inputs": {
|
|
"prompt": "test",
|
|
"width": 64,
|
|
"height": 64,
|
|
"steps": 1
|
|
}
|
|
}'
|
|
|
|
# 2. Increase the gateway timeout to account for cold starts
|
|
COMFYUI_TIMEOUT_MS=600000
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Webhook Failures
|
|
|
|
Webhook errors appear in logs as `WEBHOOK_DELIVERY_FAILED`.
|
|
|
|
### 4a. DNS Resolution Failure
|
|
|
|
**Symptom**: Webhook fails with "getaddrinfo ENOTFOUND" or "DNS lookup failed".
|
|
|
|
**Cause**: The callback URL hostname cannot be resolved.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Test DNS resolution from the gateway host
|
|
nslookup your-webhook-domain.com
|
|
dig your-webhook-domain.com
|
|
|
|
# 2. If using a local hostname (e.g., within Docker), make sure it is resolvable
|
|
# Add to /etc/hosts if needed:
|
|
echo "192.168.1.50 my-webhook-server" | sudo tee -a /etc/hosts
|
|
|
|
# 3. Verify the callback URL is correct in your job request
|
|
curl -X POST http://localhost:3000/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: your-key" \
|
|
-d '{
|
|
"workflowId": "sdxl_realism_v1",
|
|
"inputs": { "prompt": "test" },
|
|
"callbackUrl": "https://your-valid-domain.com/webhook"
|
|
}'
|
|
```
|
|
|
|
### 4b. SSL Certificate Errors
|
|
|
|
**Symptom**: Webhook fails with "self signed certificate", "CERT_HAS_EXPIRED",
|
|
or "unable to verify the first certificate".
|
|
|
|
**Cause**: The webhook receiver uses an invalid, expired, or self-signed SSL certificate.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Test the certificate manually
|
|
openssl s_client -connect your-webhook-domain.com:443 -servername your-webhook-domain.com < /dev/null 2>&1 | head -20
|
|
|
|
# 2. Check expiration
|
|
echo | openssl s_client -connect your-webhook-domain.com:443 2>/dev/null | openssl x509 -noout -dates
|
|
|
|
# 3. For development with self-signed certs, set NODE_TLS_REJECT_UNAUTHORIZED
|
|
# WARNING: Do NOT use this in production
|
|
NODE_TLS_REJECT_UNAUTHORIZED=0 npm run dev
|
|
|
|
# 4. For production, fix the certificate (use Let's Encrypt or a valid CA)
|
|
```
|
|
|
|
### 4c. Webhook Timeout
|
|
|
|
**Symptom**: Webhook logs show "AbortError" or "Webhook POST timed out".
|
|
|
|
**Cause**: The webhook receiver takes longer than 10 seconds to respond.
|
|
The gateway has a hardcoded 10-second timeout per webhook attempt with
|
|
3 retries and exponential backoff.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Ensure your webhook receiver responds quickly
|
|
# The receiver should return 200 immediately and process asynchronously
|
|
# BAD: app.post("/webhook", async (req, res) => { await longProcess(); res.send("ok"); })
|
|
# GOOD: app.post("/webhook", (req, res) => { res.send("ok"); enqueueWork(req.body); })
|
|
|
|
# 2. Test receiver response time
|
|
time curl -s -o /dev/null -w "%{time_total}" -X POST https://your-webhook.com/callback \
|
|
-H "Content-Type: application/json" -d '{"test": true}'
|
|
# Should be < 2 seconds
|
|
```
|
|
|
|
### 4d. Domain Not in Allowlist
|
|
|
|
**Symptom**: Job creation fails with `Callback domain "example.com" is not in the
|
|
allowed domains list`.
|
|
|
|
**Cause**: `WEBHOOK_ALLOWED_DOMAINS` is configured and does not include the
|
|
callback URL's domain.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check current setting
|
|
grep WEBHOOK_ALLOWED_DOMAINS .env
|
|
|
|
# 2. Add the domain (comma-separated list)
|
|
WEBHOOK_ALLOWED_DOMAINS=your-app.com,n8n.your-domain.com,*.internal.company.com
|
|
|
|
# 3. Or allow all domains (less secure, suitable for development)
|
|
WEBHOOK_ALLOWED_DOMAINS=*
|
|
|
|
# 4. Restart the gateway
|
|
npm run dev
|
|
```
|
|
|
|
### 4e. HMAC Signature Mismatch
|
|
|
|
**Symptom**: Your webhook receiver receives the POST but HMAC validation fails
|
|
on your end.
|
|
|
|
**Cause**: The `WEBHOOK_SECRET` configured in the gateway does not match the secret
|
|
your receiver uses to validate signatures, or the signature computation differs.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Verify the WEBHOOK_SECRET matches on both sides
|
|
grep WEBHOOK_SECRET .env
|
|
|
|
# 2. The gateway sends: X-Signature: sha256=<hex>
|
|
# Computed as: HMAC-SHA256(secret, raw_body_string)
|
|
# Verify in Node.js:
|
|
node -e "
|
|
const crypto = require('crypto');
|
|
const secret = 'your-webhook-secret';
|
|
const body = '{\"jobId\":\"test\",\"status\":\"succeeded\"}';
|
|
const sig = crypto.createHmac('sha256', secret).update(body, 'utf8').digest('hex');
|
|
console.log('Expected header: sha256=' + sig);
|
|
"
|
|
|
|
# 3. Common mistakes:
|
|
# - Parsing the body before computing HMAC (must use raw string)
|
|
# - Using different encodings (gateway uses utf8)
|
|
# - Comparing strings case-sensitively (hex is lowercase)
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Redis Connection Issues
|
|
|
|
### 5a. Cannot Connect to Redis
|
|
|
|
**Symptom**: Gateway crashes at startup with "Redis connection error" or
|
|
"ECONNREFUSED" targeting the Redis port.
|
|
|
|
**Cause**: Redis server is not running, or the `REDIS_URL` is wrong.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check if Redis is running
|
|
redis-cli ping
|
|
# Expected: PONG
|
|
|
|
# 2. Verify the URL format
|
|
# Correct formats:
|
|
# redis://localhost:6379
|
|
# redis://:yourpassword@redis-host:6379/0
|
|
# rediss://user:password@host:6380/0 (TLS)
|
|
|
|
# 3. Test connectivity
|
|
redis-cli -u "redis://localhost:6379" ping
|
|
|
|
# 4. If Redis is not needed, remove REDIS_URL to use in-memory queue
|
|
# Edit .env:
|
|
REDIS_URL=
|
|
# The gateway falls back to an in-memory queue automatically
|
|
```
|
|
|
|
### 5b. Redis Authentication Failure
|
|
|
|
**Symptom**: Error message contains "NOAUTH Authentication required" or
|
|
"ERR invalid password".
|
|
|
|
**Cause**: Redis requires a password but `REDIS_URL` does not include one,
|
|
or the password is wrong.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Include the password in the URL
|
|
REDIS_URL=redis://:your_redis_password@localhost:6379/0
|
|
|
|
# 2. Test with redis-cli
|
|
redis-cli -a "your_redis_password" ping
|
|
|
|
# 3. Check Redis config for requirepass
|
|
redis-cli CONFIG GET requirepass
|
|
```
|
|
|
|
### 5c. Fallback to In-Memory Queue
|
|
|
|
**Symptom**: Logs show "No Redis URL configured, using in-memory queue" and
|
|
you expected BullMQ.
|
|
|
|
**Cause**: `REDIS_URL` is empty or not set in `.env`.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Set REDIS_URL in .env
|
|
REDIS_URL=redis://localhost:6379
|
|
|
|
# 2. Verify Redis is running
|
|
redis-cli ping
|
|
|
|
# 3. Restart the gateway
|
|
npm run dev
|
|
|
|
# 4. Confirm in logs: should show "Redis URL configured, using BullMQ worker"
|
|
```
|
|
|
|
> **Note**: The in-memory queue is fine for single-instance development deployments.
|
|
> For production with multiple workers or durability requirements, use Redis + BullMQ.
|
|
|
|
---
|
|
|
|
## 6. Storage Errors
|
|
|
|
### 6a. Local Disk Permission Denied
|
|
|
|
**Symptom**: Job fails at the output storage step with "EACCES: permission denied"
|
|
or `STORAGE_READ_ERROR`.
|
|
|
|
**Cause**: The gateway process does not have write permissions to `STORAGE_LOCAL_PATH`.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check the configured path
|
|
grep STORAGE_LOCAL_PATH .env
|
|
# Default: ./data/outputs
|
|
|
|
# 2. Ensure the directory exists and is writable
|
|
mkdir -p ./data/outputs
|
|
chmod 755 ./data/outputs
|
|
|
|
# 3. Check ownership
|
|
ls -la ./data/
|
|
|
|
# 4. If running as a different user (e.g., in Docker)
|
|
chown -R node:node ./data/outputs
|
|
|
|
# 5. For Docker, mount a volume with correct permissions
|
|
# docker run -v /host/path/outputs:/app/data/outputs ...
|
|
```
|
|
|
|
### 6b. S3 Credentials Invalid
|
|
|
|
**Symptom**: Job fails with `STORAGE_S3_PUT_ERROR` and the underlying error
|
|
mentions "InvalidAccessKeyId", "SignatureDoesNotMatch", or "AccessDenied".
|
|
|
|
**Cause**: The `S3_ACCESS_KEY` / `S3_SECRET_KEY` are wrong, expired, or the
|
|
IAM policy does not grant `s3:PutObject` permission.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Verify credentials are set
|
|
grep S3_ACCESS_KEY .env
|
|
grep S3_SECRET_KEY .env
|
|
grep S3_BUCKET .env
|
|
|
|
# 2. Test with AWS CLI
|
|
aws s3 ls s3://your-bucket/ \
|
|
--endpoint-url http://your-minio:9000 \
|
|
--region us-east-1
|
|
|
|
# 3. Test a put operation
|
|
echo "test" > /tmp/test.txt
|
|
aws s3 cp /tmp/test.txt s3://your-bucket/test.txt \
|
|
--endpoint-url http://your-minio:9000
|
|
|
|
# 4. Minimum IAM policy for the gateway:
|
|
# {
|
|
# "Version": "2012-10-17",
|
|
# "Statement": [{
|
|
# "Effect": "Allow",
|
|
# "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"],
|
|
# "Resource": ["arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*"]
|
|
# }]
|
|
# }
|
|
```
|
|
|
|
### 6c. MinIO Configuration
|
|
|
|
**Symptom**: S3 storage fails with "socket hang up", "ECONNREFUSED", or
|
|
"Bucket does not exist".
|
|
|
|
**Cause**: MinIO endpoint is wrong, the bucket has not been created, or
|
|
`forcePathStyle` is not enabled (handled automatically by the gateway).
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Verify MinIO is running
|
|
curl http://localhost:9000/minio/health/live
|
|
# Expected: HTTP 200
|
|
|
|
# 2. Set the correct endpoint in .env
|
|
S3_ENDPOINT=http://localhost:9000
|
|
S3_BUCKET=comfyui-outputs
|
|
S3_ACCESS_KEY=minioadmin
|
|
S3_SECRET_KEY=minioadmin
|
|
S3_REGION=us-east-1
|
|
|
|
# 3. Create the bucket if it does not exist
|
|
# Using mc (MinIO Client)
|
|
mc alias set local http://localhost:9000 minioadmin minioadmin
|
|
mc mb local/comfyui-outputs
|
|
|
|
# Or using AWS CLI
|
|
aws s3 mb s3://comfyui-outputs --endpoint-url http://localhost:9000
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Database Issues
|
|
|
|
### 7a. SQLite WAL Lock Errors
|
|
|
|
**Symptom**: Intermittent "SQLITE_BUSY" or "database is locked" errors under
|
|
concurrent load.
|
|
|
|
**Cause**: Multiple processes or threads are writing to the SQLite database
|
|
simultaneously. SQLite WAL mode supports concurrent readers but only one writer.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. The gateway already sets optimal pragmas:
|
|
# journal_mode = WAL
|
|
# synchronous = NORMAL
|
|
# busy_timeout = 5000 (5 seconds)
|
|
|
|
# 2. If running multiple gateway instances, switch to Postgres
|
|
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
|
|
|
|
# 3. If you must use SQLite with a single instance, increase busy timeout
|
|
# (requires code change or env override):
|
|
# The default 5000ms should be sufficient for most single-instance use cases
|
|
|
|
# 4. Check for stuck WAL files
|
|
ls -la ./data/gateway.db*
|
|
# You should see: gateway.db, gateway.db-wal, gateway.db-shm
|
|
|
|
# 5. If the database is corrupted, try recovery
|
|
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
|
|
# If it reports errors, back up and recreate:
|
|
cp ./data/gateway.db ./data/gateway.db.bak
|
|
sqlite3 ./data/gateway.db ".recover" | sqlite3 ./data/gateway_recovered.db
|
|
```
|
|
|
|
### 7b. Postgres Connection Pooling
|
|
|
|
**Symptom**: Errors like "too many clients already", "remaining connection slots
|
|
are reserved", or intermittent "Connection terminated unexpectedly".
|
|
|
|
**Cause**: The gateway opens too many connections to Postgres, exceeding
|
|
`max_connections`, or connections are not being properly returned to the pool.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check current connections in Postgres
|
|
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'comfyui_gateway';"
|
|
|
|
# 2. Check max_connections setting
|
|
psql -c "SHOW max_connections;"
|
|
|
|
# 3. Use a connection pooler like PgBouncer
|
|
# Install PgBouncer and point DATABASE_URL to it
|
|
DATABASE_URL=postgresql://user:password@localhost:6432/comfyui_gateway
|
|
|
|
# 4. If running multiple gateway instances, ensure the total pool size
|
|
# across all instances does not exceed Postgres max_connections
|
|
```
|
|
|
|
### 7c. Database URL Format
|
|
|
|
**Symptom**: Gateway crashes at startup with "Invalid connection string" or
|
|
uses SQLite when you intended Postgres.
|
|
|
|
**Cause**: The `DATABASE_URL` format is wrong. The gateway checks if the URL
|
|
starts with `postgres://` or `postgresql://` to select the Postgres backend.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# SQLite formats (all valid):
|
|
DATABASE_URL=./data/gateway.db
|
|
DATABASE_URL=/absolute/path/to/gateway.db
|
|
|
|
# Postgres formats (must start with postgres:// or postgresql://):
|
|
DATABASE_URL=postgresql://user:password@localhost:5432/comfyui_gateway
|
|
DATABASE_URL=postgres://user:password@host:5432/dbname?sslmode=require
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Job Stuck in "running"
|
|
|
|
### 8a. ComfyUI Crashed During Execution
|
|
|
|
**Symptom**: A job shows `status: "running"` indefinitely. No progress updates.
|
|
The gateway health endpoint may show `comfyui.reachable: false`.
|
|
|
|
**Cause**: ComfyUI crashed (segfault, CUDA error, killed by OOM killer) while
|
|
processing the job, and the gateway's WebSocket connection was severed.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check job status
|
|
curl -s http://localhost:3000/jobs/<jobId> | jq '.status'
|
|
|
|
# 2. Check if ComfyUI is still running
|
|
curl -s http://localhost:3000/health | jq '.comfyui.reachable'
|
|
|
|
# 3. If ComfyUI crashed, restart it
|
|
cd /path/to/ComfyUI
|
|
python main.py --listen 0.0.0.0
|
|
|
|
# 4. The stuck job will eventually time out (COMFYUI_TIMEOUT_MS, default 5 min)
|
|
# and be marked as failed with COMFYUI_TIMEOUT
|
|
|
|
# 5. To immediately cancel the stuck job
|
|
curl -X POST http://localhost:3000/jobs/<jobId>/cancel \
|
|
-H "X-API-Key: your-key"
|
|
|
|
# 6. To reduce timeout for faster failure detection
|
|
COMFYUI_TIMEOUT_MS=120000
|
|
```
|
|
|
|
### 8b. WebSocket Disconnection
|
|
|
|
**Symptom**: Job stays "running" but ComfyUI is actually done. The output
|
|
exists in ComfyUI's history.
|
|
|
|
**Cause**: The WebSocket connection dropped mid-execution, and the polling
|
|
fallback failed to pick up the result.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check ComfyUI history directly
|
|
curl -s http://127.0.0.1:8188/history | jq 'keys | length'
|
|
|
|
# 2. The gateway automatically falls back to HTTP polling if WebSocket fails.
|
|
# If polling also fails, the job times out.
|
|
|
|
# 3. Restart the gateway to reset connections
|
|
npm run dev
|
|
|
|
# 4. Check network stability between gateway and ComfyUI
|
|
ping -c 10 <comfyui-host>
|
|
```
|
|
|
|
### 8c. Restart Recovery
|
|
|
|
**Symptom**: After restarting the gateway, jobs that were "running" remain
|
|
in that state permanently.
|
|
|
|
**Cause**: The in-memory queue loses track of running jobs when the process
|
|
restarts. There is no automatic recovery for in-memory jobs.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. For production, use Redis (BullMQ) for durable job queues
|
|
REDIS_URL=redis://localhost:6379
|
|
|
|
# 2. Manually fail stuck jobs via the database
|
|
sqlite3 ./data/gateway.db \
|
|
"UPDATE jobs SET status='failed', errorJson='{\"code\":\"GATEWAY_RESTART\",\"message\":\"Job interrupted by gateway restart\"}', completedAt=datetime('now') WHERE status='running';"
|
|
|
|
# 3. Verify
|
|
sqlite3 ./data/gateway.db "SELECT id, status FROM jobs WHERE status='running';"
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Rate Limiting Issues
|
|
|
|
### 9a. Identifying You Are Being Rate Limited
|
|
|
|
**Symptom**: API returns HTTP 429 with body `{ "error": "RATE_LIMITED" }` and
|
|
a `Retry-After` header.
|
|
|
|
**Cause**: You exceeded `RATE_LIMIT_MAX` requests within the `RATE_LIMIT_WINDOW_MS`
|
|
window. Limits are applied per API key or per IP.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check the response headers
|
|
curl -v http://localhost:3000/health -H "X-API-Key: your-key" 2>&1 | grep -i "x-ratelimit"
|
|
# X-RateLimit-Limit: 100
|
|
# X-RateLimit-Remaining: 0
|
|
# Retry-After: 42
|
|
|
|
# 2. Wait for the Retry-After period, then retry
|
|
|
|
# 3. Implement exponential backoff in your client
|
|
```
|
|
|
|
### 9b. Adjusting Rate Limits
|
|
|
|
**Symptom**: Legitimate usage is being throttled.
|
|
|
|
**Cause**: Default limits (100 requests/minute) are too low for your workload.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Increase the limit in .env
|
|
RATE_LIMIT_MAX=500
|
|
RATE_LIMIT_WINDOW_MS=60000
|
|
|
|
# 2. For burst workloads, widen the window
|
|
RATE_LIMIT_MAX=1000
|
|
RATE_LIMIT_WINDOW_MS=300000
|
|
|
|
# 3. Restart the gateway
|
|
npm run dev
|
|
|
|
# 4. Note: Rate limits are per API key (if authenticated) or per IP.
|
|
# Different API keys have independent counters.
|
|
```
|
|
|
|
### 9c. Rate Limit Per API Key vs Per IP
|
|
|
|
**Symptom**: Different clients sharing the same IP are interfering with each
|
|
other's rate limits.
|
|
|
|
**Cause**: Without API keys, all requests from the same IP share a single
|
|
rate-limit bucket.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Assign unique API keys to each client
|
|
API_KEYS=client1-key:user,client2-key:user,admin-key:admin
|
|
|
|
# 2. Each client uses its own X-API-Key header
|
|
# Client 1: -H "X-API-Key: client1-key"
|
|
# Client 2: -H "X-API-Key: client2-key"
|
|
|
|
# 3. Each key gets its own independent rate-limit counter
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Authentication Problems
|
|
|
|
### 10a. API Key Not Accepted
|
|
|
|
**Symptom**: Every request returns HTTP 401 with `{ "error": "AUTH_FAILED",
|
|
"message": "Invalid API key" }`.
|
|
|
|
**Cause**: The `X-API-Key` header value does not match any entry in `API_KEYS`.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check configured keys
|
|
grep API_KEYS .env
|
|
# Format: key1:admin,key2:user
|
|
|
|
# 2. Ensure your request uses the exact key (no extra whitespace)
|
|
curl -H "X-API-Key: mykey123" http://localhost:3000/health
|
|
|
|
# 3. Keys are case-sensitive and matched exactly
|
|
|
|
# 4. If API_KEYS is empty, authentication is DISABLED (development mode)
|
|
# All requests are treated as admin. Set keys for production:
|
|
API_KEYS=sk-prod-abc123:admin,sk-user-xyz789:user
|
|
```
|
|
|
|
### 10b. JWT Token Expired
|
|
|
|
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "JWT token
|
|
has expired" }`.
|
|
|
|
**Cause**: The JWT `exp` claim is in the past.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Decode the JWT to check expiration (without verification)
|
|
echo "<your-token>" | cut -d'.' -f2 | base64 -d 2>/dev/null | jq '.exp'
|
|
|
|
# 2. Compare with current time
|
|
date +%s
|
|
|
|
# 3. Generate a new token with a longer TTL
|
|
# Example using Node.js:
|
|
node -e "
|
|
const crypto = require('crypto');
|
|
const secret = 'your-jwt-secret';
|
|
const header = Buffer.from(JSON.stringify({alg:'HS256',typ:'JWT'})).toString('base64url');
|
|
const payload = Buffer.from(JSON.stringify({
|
|
sub: 'user-1',
|
|
role: 'admin',
|
|
iat: Math.floor(Date.now()/1000),
|
|
exp: Math.floor(Date.now()/1000) + 86400 // 24 hours
|
|
})).toString('base64url');
|
|
const sig = crypto.createHmac('sha256', secret).update(header+'.'+payload).digest('base64url');
|
|
console.log(header+'.'+payload+'.'+sig);
|
|
"
|
|
```
|
|
|
|
### 10c. JWT Signature Invalid
|
|
|
|
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Invalid JWT
|
|
signature" }`.
|
|
|
|
**Cause**: The JWT was signed with a different secret than what is configured in
|
|
`JWT_SECRET`.
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Verify the secret matches on token-issuer side and gateway side
|
|
grep JWT_SECRET .env
|
|
|
|
# 2. The gateway uses HMAC-SHA256 (HS256) exclusively
|
|
# Make sure your token issuer also uses HS256 with the same secret
|
|
|
|
# 3. Re-generate the token using the correct secret
|
|
```
|
|
|
|
### 10d. No Authentication Header Provided
|
|
|
|
**Symptom**: Request returns `{ "error": "AUTH_FAILED", "message": "Authentication
|
|
required. Provide X-API-Key header or Authorization: Bearer token." }`.
|
|
|
|
**Cause**: The request has no `X-API-Key` header and no `Authorization: Bearer`
|
|
header, and authentication is enabled (API_KEYS or JWT_SECRET is set).
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# Option A: Use API Key
|
|
curl -H "X-API-Key: your-key" http://localhost:3000/health
|
|
|
|
# Option B: Use JWT Bearer token
|
|
curl -H "Authorization: Bearer your.jwt.token" http://localhost:3000/health
|
|
|
|
# Option C: Disable auth for development (NOT for production)
|
|
# Remove all values from API_KEYS and JWT_SECRET in .env:
|
|
API_KEYS=
|
|
JWT_SECRET=
|
|
```
|
|
|
|
### 10e. Insufficient Permissions (Forbidden)
|
|
|
|
**Symptom**: Request returns HTTP 403 with `{ "error": "FORBIDDEN", "message":
|
|
"Admin role required for this operation" }`.
|
|
|
|
**Cause**: You are using a `user` role key to perform an admin-only action
|
|
(workflow CRUD).
|
|
|
|
**Solution**:
|
|
|
|
```bash
|
|
# 1. Check which role your key has
|
|
grep API_KEYS .env
|
|
# Example: sk-user-key:user,sk-admin-key:admin
|
|
|
|
# 2. Use the admin key for workflow management
|
|
curl -H "X-API-Key: sk-admin-key" -X POST http://localhost:3000/workflows ...
|
|
|
|
# 3. User role can: create jobs, read own jobs, view health/capabilities
|
|
# Admin role can: everything the user can + workflow CRUD + view all jobs
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Diagnostic Commands
|
|
|
|
```bash
|
|
# Gateway health
|
|
curl -s http://localhost:3000/health | jq .
|
|
|
|
# ComfyUI direct connectivity
|
|
curl -s http://127.0.0.1:8188/ | head -5
|
|
|
|
# Queue status
|
|
curl -s http://localhost:3000/jobs?status=queued -H "X-API-Key: KEY" | jq '.count'
|
|
curl -s http://localhost:3000/jobs?status=running -H "X-API-Key: KEY" | jq '.count'
|
|
|
|
# GPU memory
|
|
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
|
|
|
|
# Redis connectivity
|
|
redis-cli -u "$REDIS_URL" ping
|
|
|
|
# SQLite integrity
|
|
sqlite3 ./data/gateway.db "PRAGMA integrity_check;"
|
|
|
|
# Logs (if using pino-pretty)
|
|
npm run dev 2>&1 | npx pino-pretty
|
|
|
|
# Check all configured environment variables
|
|
grep -v '^#' .env | grep -v '^$'
|
|
```
|