1032 lines
28 KiB
Markdown
1032 lines
28 KiB
Markdown
---
|
|
name: workflow-automation
|
|
description: Workflow automation is the infrastructure that makes AI agents
|
|
reliable. Without durable execution, a network hiccup during a 10-step payment
|
|
flow means lost money and angry customers. With it, workflows resume exactly
|
|
where they left off.
|
|
risk: critical
|
|
source: vibeship-spawner-skills (Apache 2.0)
|
|
date_added: 2026-02-27
|
|
---
|
|
|
|
# Workflow Automation
|
|
|
|
Workflow automation is the infrastructure that makes AI agents reliable.
|
|
Without durable execution, a network hiccup during a 10-step payment
|
|
flow means lost money and angry customers. With it, workflows resume
|
|
exactly where they left off.
|
|
|
|
This skill covers the platforms (n8n, Temporal, Inngest) and patterns
|
|
(sequential, parallel, orchestrator-worker) that turn brittle scripts
|
|
into production-grade automation.
|
|
|
|
Key insight: The platforms make different tradeoffs. n8n optimizes for
|
|
accessibility, Temporal for correctness, Inngest for developer experience.
|
|
Pick based on your actual needs, not hype.
|
|
|
|
## Principles
|
|
|
|
- Durable execution is non-negotiable for money or state-critical workflows
|
|
- Events are the universal language of workflow triggers
|
|
- Steps are checkpoints - each should be independently retryable
|
|
- Start simple, add complexity only when reliability demands it
|
|
- Observability isn't optional - you need to see where workflows fail
|
|
- Workflows and agents co-evolve - design for both
|
|
|
|
## Capabilities
|
|
|
|
- workflow-automation
|
|
- workflow-orchestration
|
|
- durable-execution
|
|
- event-driven-workflows
|
|
- step-functions
|
|
- job-queues
|
|
- background-jobs
|
|
- scheduled-tasks
|
|
|
|
## Scope
|
|
|
|
- multi-agent-coordination → multi-agent-orchestration
|
|
- ci-cd-pipelines → devops
|
|
- data-pipelines → data-engineer
|
|
- api-design → api-designer
|
|
|
|
## Tooling
|
|
|
|
### Platforms
|
|
|
|
- n8n - When: Low-code automation, quick prototyping, non-technical users Note: Self-hostable, 400+ integrations, great for visual workflows
|
|
- Temporal - When: Mission-critical workflows, financial transactions, microservices Note: Strongest durability guarantees, steeper learning curve
|
|
- Inngest - When: Event-driven serverless, TypeScript codebases, AI workflows Note: Best developer experience, works with any hosting
|
|
- AWS Step Functions - When: AWS-native stacks, existing Lambda functions Note: Tight AWS integration, JSON-based workflow definition
|
|
- Azure Durable Functions - When: Azure stacks, .NET or TypeScript Note: Good AI agent support, checkpoint and replay
|
|
|
|
## Patterns
|
|
|
|
### Sequential Workflow Pattern
|
|
|
|
Steps execute in order, each output becomes next input
|
|
|
|
**When to use**: Content pipelines, data processing, ordered operations
|
|
|
|
# SEQUENTIAL WORKFLOW:
|
|
|
|
"""
|
|
Step 1 → Step 2 → Step 3 → Output
|
|
↓ ↓ ↓
|
|
(checkpoint at each step)
|
|
"""
|
|
|
|
## Inngest Example (TypeScript)
|
|
"""
|
|
import { inngest } from "./client";
|
|
|
|
export const processOrder = inngest.createFunction(
|
|
{ id: "process-order" },
|
|
{ event: "order/created" },
|
|
async ({ event, step }) => {
|
|
// Step 1: Validate order
|
|
const validated = await step.run("validate-order", async () => {
|
|
return validateOrder(event.data.order);
|
|
});
|
|
|
|
// Step 2: Process payment (durable - survives crashes)
|
|
const payment = await step.run("process-payment", async () => {
|
|
return chargeCard(validated.paymentMethod, validated.total);
|
|
});
|
|
|
|
// Step 3: Create shipment
|
|
const shipment = await step.run("create-shipment", async () => {
|
|
return createShipment(validated.items, validated.address);
|
|
});
|
|
|
|
// Step 4: Send confirmation
|
|
await step.run("send-confirmation", async () => {
|
|
return sendEmail(validated.email, { payment, shipment });
|
|
});
|
|
|
|
return { success: true, orderId: event.data.orderId };
|
|
}
|
|
);
|
|
"""
|
|
|
|
## Temporal Example (TypeScript)
|
|
"""
|
|
import { proxyActivities } from '@temporalio/workflow';
|
|
import type * as activities from './activities';
|
|
|
|
const { validateOrder, chargeCard, createShipment, sendEmail } =
|
|
proxyActivities<typeof activities>({
|
|
startToCloseTimeout: '30 seconds',
|
|
retry: {
|
|
maximumAttempts: 3,
|
|
backoffCoefficient: 2,
|
|
}
|
|
});
|
|
|
|
export async function processOrderWorkflow(order: Order): Promise<void> {
|
|
const validated = await validateOrder(order);
|
|
const payment = await chargeCard(validated.paymentMethod, validated.total);
|
|
const shipment = await createShipment(validated.items, validated.address);
|
|
await sendEmail(validated.email, { payment, shipment });
|
|
}
|
|
"""
|
|
|
|
## n8n Pattern
|
|
"""
|
|
[Webhook: order.created]
|
|
↓
|
|
[HTTP Request: Validate Order]
|
|
↓
|
|
[HTTP Request: Process Payment]
|
|
↓
|
|
[HTTP Request: Create Shipment]
|
|
↓
|
|
[Send Email: Confirmation]
|
|
|
|
Configure each node with retry on failure.
|
|
Use Error Trigger for dead letter handling.
|
|
"""
|
|
|
|
### Parallel Workflow Pattern
|
|
|
|
Independent steps run simultaneously, aggregate results
|
|
|
|
**When to use**: Multiple independent analyses, data from multiple sources
|
|
|
|
# PARALLEL WORKFLOW:
|
|
|
|
"""
|
|
┌→ Step A ─┐
|
|
Input ──┼→ Step B ─┼→ Aggregate → Output
|
|
└→ Step C ─┘
|
|
"""
|
|
|
|
## Inngest Example
|
|
"""
|
|
export const analyzeDocument = inngest.createFunction(
|
|
{ id: "analyze-document" },
|
|
{ event: "document/uploaded" },
|
|
async ({ event, step }) => {
|
|
// Run analyses in parallel
|
|
const [security, performance, compliance] = await Promise.all([
|
|
step.run("security-analysis", () =>
|
|
analyzeForSecurityIssues(event.data.document)
|
|
),
|
|
step.run("performance-analysis", () =>
|
|
analyzeForPerformance(event.data.document)
|
|
),
|
|
step.run("compliance-analysis", () =>
|
|
analyzeForCompliance(event.data.document)
|
|
),
|
|
]);
|
|
|
|
// Aggregate results
|
|
const report = await step.run("generate-report", () =>
|
|
generateReport({ security, performance, compliance })
|
|
);
|
|
|
|
return report;
|
|
}
|
|
);
|
|
"""
|
|
|
|
## AWS Step Functions (Amazon States Language)
|
|
"""
|
|
{
|
|
"Type": "Parallel",
|
|
"Branches": [
|
|
{
|
|
"StartAt": "SecurityAnalysis",
|
|
"States": {
|
|
"SecurityAnalysis": {
|
|
"Type": "Task",
|
|
"Resource": "arn:aws:lambda:...:security-analyzer",
|
|
"End": true
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"StartAt": "PerformanceAnalysis",
|
|
"States": {
|
|
"PerformanceAnalysis": {
|
|
"Type": "Task",
|
|
"Resource": "arn:aws:lambda:...:performance-analyzer",
|
|
"End": true
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"Next": "AggregateResults"
|
|
}
|
|
"""
|
|
|
|
### Orchestrator-Worker Pattern
|
|
|
|
Central coordinator dispatches work to specialized workers
|
|
|
|
**When to use**: Complex tasks requiring different expertise, dynamic subtask creation
|
|
|
|
# ORCHESTRATOR-WORKER PATTERN:
|
|
|
|
"""
|
|
┌─────────────────────────────────────┐
|
|
│ ORCHESTRATOR │
|
|
│ - Analyzes task │
|
|
│ - Creates subtasks │
|
|
│ - Dispatches to workers │
|
|
│ - Aggregates results │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
┌───────────┼───────────┐
|
|
▼ ▼ ▼
|
|
┌───────┐ ┌───────┐ ┌───────┐
|
|
│Worker1│ │Worker2│ │Worker3│
|
|
│Create │ │Modify │ │Delete │
|
|
└───────┘ └───────┘ └───────┘
|
|
"""
|
|
|
|
## Temporal Example
|
|
"""
|
|
export async function orchestratorWorkflow(task: ComplexTask) {
|
|
// Orchestrator decides what work needs to be done
|
|
const plan = await analyzeTask(task);
|
|
|
|
// Dispatch to specialized worker workflows
|
|
const results = await Promise.all(
|
|
plan.subtasks.map(subtask => {
|
|
switch (subtask.type) {
|
|
case 'create':
|
|
return executeChild(createWorkerWorkflow, { args: [subtask] });
|
|
case 'modify':
|
|
return executeChild(modifyWorkerWorkflow, { args: [subtask] });
|
|
case 'delete':
|
|
return executeChild(deleteWorkerWorkflow, { args: [subtask] });
|
|
}
|
|
})
|
|
);
|
|
|
|
// Aggregate results
|
|
return aggregateResults(results);
|
|
}
|
|
"""
|
|
|
|
## Inngest with AI Orchestration
|
|
"""
|
|
export const aiOrchestrator = inngest.createFunction(
|
|
{ id: "ai-orchestrator" },
|
|
{ event: "task/complex" },
|
|
async ({ event, step }) => {
|
|
// AI decides what needs to be done
|
|
const plan = await step.run("create-plan", async () => {
|
|
return await llm.chat({
|
|
messages: [
|
|
{ role: "system", content: "Break this task into subtasks..." },
|
|
{ role: "user", content: event.data.task }
|
|
]
|
|
});
|
|
});
|
|
|
|
// Execute each subtask as a durable step
|
|
const results = [];
|
|
for (const subtask of plan.subtasks) {
|
|
const result = await step.run(`execute-${subtask.id}`, async () => {
|
|
return executeSubtask(subtask);
|
|
});
|
|
results.push(result);
|
|
}
|
|
|
|
// Final synthesis
|
|
return await step.run("synthesize", async () => {
|
|
return synthesizeResults(results);
|
|
});
|
|
}
|
|
);
|
|
"""
|
|
|
|
### Event-Driven Trigger Pattern
|
|
|
|
Workflows triggered by events, not schedules
|
|
|
|
**When to use**: Reactive systems, user actions, webhook integrations
|
|
|
|
# EVENT-DRIVEN TRIGGERS:
|
|
|
|
## Inngest Event-Based
|
|
"""
|
|
// Define events with TypeScript types
|
|
type Events = {
|
|
"user/signed.up": {
|
|
data: { userId: string; email: string };
|
|
};
|
|
"order/completed": {
|
|
data: { orderId: string; total: number };
|
|
};
|
|
};
|
|
|
|
// Function triggered by event
|
|
export const onboardUser = inngest.createFunction(
|
|
{ id: "onboard-user" },
|
|
{ event: "user/signed.up" }, // Trigger on this event
|
|
async ({ event, step }) => {
|
|
// Wait 1 hour, then send welcome email
|
|
await step.sleep("wait-for-exploration", "1 hour");
|
|
|
|
await step.run("send-welcome", async () => {
|
|
return sendWelcomeEmail(event.data.email);
|
|
});
|
|
|
|
// Wait 3 days for engagement check
|
|
await step.sleep("wait-for-engagement", "3 days");
|
|
|
|
const engaged = await step.run("check-engagement", async () => {
|
|
return checkUserEngagement(event.data.userId);
|
|
});
|
|
|
|
if (!engaged) {
|
|
await step.run("send-nudge", async () => {
|
|
return sendNudgeEmail(event.data.email);
|
|
});
|
|
}
|
|
}
|
|
);
|
|
|
|
// Send events from anywhere
|
|
await inngest.send({
|
|
name: "user/signed.up",
|
|
data: { userId: "123", email: "user@example.com" }
|
|
});
|
|
"""
|
|
|
|
## n8n Webhook Trigger
|
|
"""
|
|
[Webhook: POST /api/webhooks/order]
|
|
↓
|
|
[Switch: event.type]
|
|
↓ order.created
|
|
[Process New Order Subworkflow]
|
|
↓ order.cancelled
|
|
[Handle Cancellation Subworkflow]
|
|
"""
|
|
|
|
### Retry and Recovery Pattern
|
|
|
|
Automatic retry with backoff, dead letter handling
|
|
|
|
**When to use**: Any workflow with external dependencies
|
|
|
|
# RETRY AND RECOVERY:
|
|
|
|
## Temporal Retry Configuration
|
|
"""
|
|
const activities = proxyActivities<typeof activitiesType>({
|
|
startToCloseTimeout: '30 seconds',
|
|
retry: {
|
|
initialInterval: '1 second',
|
|
backoffCoefficient: 2,
|
|
maximumInterval: '1 minute',
|
|
maximumAttempts: 5,
|
|
nonRetryableErrorTypes: [
|
|
'ValidationError', // Don't retry validation failures
|
|
'InsufficientFunds', // Don't retry payment failures
|
|
]
|
|
}
|
|
});
|
|
"""
|
|
|
|
## Inngest Retry Configuration
|
|
"""
|
|
export const processPayment = inngest.createFunction(
|
|
{
|
|
id: "process-payment",
|
|
retries: 5, // Retry up to 5 times
|
|
},
|
|
{ event: "payment/initiated" },
|
|
async ({ event, step, attempt }) => {
|
|
// attempt is 0-indexed retry count
|
|
|
|
const result = await step.run("charge-card", async () => {
|
|
try {
|
|
return await stripe.charges.create({...});
|
|
} catch (error) {
|
|
if (error.code === 'card_declined') {
|
|
// Don't retry card declines
|
|
throw new NonRetriableError("Card declined");
|
|
}
|
|
throw error; // Retry other errors
|
|
}
|
|
});
|
|
|
|
return result;
|
|
}
|
|
);
|
|
"""
|
|
|
|
## Dead Letter Handling
|
|
"""
|
|
// n8n: Use Error Trigger node
|
|
[Error Trigger]
|
|
↓
|
|
[Log to Error Database]
|
|
↓
|
|
[Send Alert to Slack]
|
|
↓
|
|
[Create Ticket in Jira]
|
|
|
|
// Inngest: Handle in onFailure
|
|
export const myFunction = inngest.createFunction(
|
|
{
|
|
id: "my-function",
|
|
onFailure: async ({ error, event, step }) => {
|
|
await step.run("alert-team", async () => {
|
|
await slack.postMessage({
|
|
channel: "#errors",
|
|
text: `Function failed: ${error.message}`
|
|
});
|
|
});
|
|
}
|
|
},
|
|
{ event: "..." },
|
|
async ({ step }) => { ... }
|
|
);
|
|
"""
|
|
|
|
### Scheduled Workflow Pattern
|
|
|
|
Time-based triggers for recurring tasks
|
|
|
|
**When to use**: Daily reports, periodic sync, batch processing
|
|
|
|
# SCHEDULED WORKFLOWS:
|
|
|
|
## Inngest Cron
|
|
"""
|
|
export const dailyReport = inngest.createFunction(
|
|
{ id: "daily-report" },
|
|
{ cron: "0 9 * * *" }, // Every day at 9 AM
|
|
async ({ step }) => {
|
|
const data = await step.run("gather-metrics", async () => {
|
|
return gatherDailyMetrics();
|
|
});
|
|
|
|
await step.run("generate-report", async () => {
|
|
return generateAndSendReport(data);
|
|
});
|
|
}
|
|
);
|
|
|
|
export const syncInventory = inngest.createFunction(
|
|
{ id: "sync-inventory" },
|
|
{ cron: "*/15 * * * *" }, // Every 15 minutes
|
|
async ({ step }) => {
|
|
await step.run("sync", async () => {
|
|
return syncWithSupplier();
|
|
});
|
|
}
|
|
);
|
|
"""
|
|
|
|
## Temporal Cron Workflow
|
|
"""
|
|
// Schedule workflow to run on cron
|
|
const handle = await client.workflow.start(dailyReportWorkflow, {
|
|
taskQueue: 'reports',
|
|
workflowId: 'daily-report',
|
|
cronSchedule: '0 9 * * *', // 9 AM daily
|
|
});
|
|
"""
|
|
|
|
## n8n Schedule Trigger
|
|
"""
|
|
[Schedule Trigger: Every day at 9:00 AM]
|
|
↓
|
|
[HTTP Request: Get Metrics]
|
|
↓
|
|
[Code Node: Generate Report]
|
|
↓
|
|
[Send Email: Report]
|
|
"""
|
|
|
|
## Sharp Edges
|
|
|
|
### Non-Idempotent Steps in Durable Workflows
|
|
|
|
Severity: CRITICAL
|
|
|
|
Situation: Writing workflow steps that modify external state
|
|
|
|
Symptoms:
|
|
Customer charged twice. Email sent three times. Database record
|
|
created multiple times. Workflow retries cause duplicate side effects.
|
|
|
|
Why this breaks:
|
|
Durable execution replays workflows from the beginning on restart.
|
|
If step 3 crashes and the workflow resumes, steps 1 and 2 run again.
|
|
Without idempotency keys, external services don't know these are retries.
|
|
|
|
Recommended fix:
|
|
|
|
# ALWAYS use idempotency keys for external calls:
|
|
|
|
### Stripe example:
|
|
await stripe.paymentIntents.create({
|
|
amount: 1000,
|
|
currency: 'usd',
|
|
idempotency_key: `order-${orderId}-payment` # Critical!
|
|
});
|
|
|
|
### Email example:
|
|
await step.run("send-confirmation", async () => {
|
|
const alreadySent = await checkEmailSent(orderId);
|
|
if (alreadySent) return { skipped: true };
|
|
return sendEmail(customer, orderId);
|
|
});
|
|
|
|
### Database example:
|
|
await db.query(`
|
|
INSERT INTO orders (id, ...) VALUES ($1, ...)
|
|
ON CONFLICT (id) DO NOTHING
|
|
`, [orderId]);
|
|
|
|
# Generate idempotency key from stable inputs, not random values
|
|
|
|
### Workflow Runs for Hours/Days Without Checkpoints
|
|
|
|
Severity: HIGH
|
|
|
|
Situation: Long-running workflows with infrequent steps
|
|
|
|
Symptoms:
|
|
Memory consumption grows. Worker timeouts. Lost progress after
|
|
crashes. "Workflow exceeded maximum duration" errors.
|
|
|
|
Why this breaks:
|
|
Workflows hold state in memory until checkpointed. A workflow that
|
|
runs for 24 hours with one step per hour accumulates state for 24h.
|
|
Workers have memory limits. Functions have execution time limits.
|
|
|
|
Recommended fix:
|
|
|
|
# Break long workflows into checkpointed steps:
|
|
|
|
### WRONG - one long step:
|
|
await step.run("process-all", async () => {
|
|
for (const item of thousandItems) {
|
|
await processItem(item); // Hours of work, one checkpoint
|
|
}
|
|
});
|
|
|
|
### CORRECT - many small steps:
|
|
for (const item of thousandItems) {
|
|
await step.run(`process-${item.id}`, async () => {
|
|
return processItem(item); // Checkpoint after each
|
|
});
|
|
}
|
|
|
|
## For very long waits, use sleep:
|
|
await step.sleep("wait-for-trial", "14 days");
|
|
// Doesn't consume resources while waiting
|
|
|
|
## Consider child workflows for long processes:
|
|
await step.invoke("process-batch", {
|
|
function: batchProcessor,
|
|
data: { items: batch }
|
|
});
|
|
|
|
### Activities Without Timeout Configuration
|
|
|
|
Severity: HIGH
|
|
|
|
Situation: Calling external services from workflow activities
|
|
|
|
Symptoms:
|
|
Workflows hang indefinitely. Worker pool exhausted. Dead workflows
|
|
that never complete or fail. Manual intervention needed to kill stuck
|
|
workflows.
|
|
|
|
Why this breaks:
|
|
External APIs can hang forever. Without timeout, your workflow waits
|
|
forever. Unlike HTTP clients, workflow activities don't have default
|
|
timeouts in most platforms.
|
|
|
|
Recommended fix:
|
|
|
|
# ALWAYS set timeouts on activities:
|
|
|
|
### Temporal:
|
|
const activities = proxyActivities<typeof activitiesType>({
|
|
startToCloseTimeout: '30 seconds', # Required!
|
|
scheduleToCloseTimeout: '5 minutes',
|
|
heartbeatTimeout: '10 seconds', # For long activities
|
|
retry: {
|
|
maximumAttempts: 3,
|
|
initialInterval: '1 second',
|
|
}
|
|
});
|
|
|
|
### Inngest:
|
|
await step.run("call-api", { timeout: "30s" }, async () => {
|
|
return fetch(url, { signal: AbortSignal.timeout(25000) });
|
|
});
|
|
|
|
## AWS Step Functions:
|
|
{
|
|
"Type": "Task",
|
|
"TimeoutSeconds": 30,
|
|
"HeartbeatSeconds": 10,
|
|
"Resource": "arn:aws:lambda:..."
|
|
}
|
|
|
|
# Rule: Activity timeout < Workflow timeout
|
|
|
|
### Side Effects Outside Step/Activity Boundaries
|
|
|
|
Severity: CRITICAL
|
|
|
|
Situation: Writing code that runs during workflow replay
|
|
|
|
Symptoms:
|
|
Random failures on replay. "Workflow corrupted" errors. Different
|
|
behavior on replay than initial run. Non-determinism errors.
|
|
|
|
Why this breaks:
|
|
Workflow code runs on EVERY replay. If you generate a random ID in
|
|
workflow code, you get a different ID each replay. If you read the
|
|
current time, you get a different time. This breaks determinism.
|
|
|
|
Recommended fix:
|
|
|
|
# WRONG - side effects in workflow code:
|
|
export async function orderWorkflow(order) {
|
|
const orderId = uuid(); // Different every replay!
|
|
const now = new Date(); // Different every replay!
|
|
await activities.process(orderId, now);
|
|
}
|
|
|
|
# CORRECT - side effects in activities:
|
|
export async function orderWorkflow(order) {
|
|
const orderId = await activities.generateOrderId(); # Recorded
|
|
const now = await activities.getCurrentTime(); # Recorded
|
|
await activities.process(orderId, now);
|
|
}
|
|
|
|
# Also CORRECT - Temporal workflow.now() and sideEffect:
|
|
import { sideEffect } from '@temporalio/workflow';
|
|
|
|
const orderId = await sideEffect(() => uuid());
|
|
const now = workflow.now(); # Deterministic replay-safe time
|
|
|
|
# Side effects that are safe in workflow code:
|
|
# - Reading function arguments
|
|
# - Simple calculations (no randomness)
|
|
# - Logging (usually)
|
|
|
|
### Retry Configuration Without Exponential Backoff
|
|
|
|
Severity: MEDIUM
|
|
|
|
Situation: Configuring retry behavior for failing steps
|
|
|
|
Symptoms:
|
|
Overwhelming failing services. Rate limiting. Cascading failures.
|
|
Retry storms causing outages. Being blocked by external APIs.
|
|
|
|
Why this breaks:
|
|
When a service is struggling, immediate retries make it worse.
|
|
100 workflows retrying instantly = 100 requests hitting a service
|
|
that's already failing. Backoff gives the service time to recover.
|
|
|
|
Recommended fix:
|
|
|
|
# ALWAYS use exponential backoff:
|
|
|
|
### Temporal:
|
|
const activities = proxyActivities({
|
|
retry: {
|
|
initialInterval: '1 second',
|
|
backoffCoefficient: 2, # 1s, 2s, 4s, 8s, 16s...
|
|
maximumInterval: '1 minute', # Cap the backoff
|
|
maximumAttempts: 5,
|
|
}
|
|
});
|
|
|
|
### Inngest (built-in backoff):
|
|
{
|
|
id: "my-function",
|
|
retries: 5, # Uses exponential backoff by default
|
|
}
|
|
|
|
### Manual backoff:
|
|
const backoff = (attempt) => {
|
|
const base = 1000;
|
|
const max = 60000;
|
|
const delay = Math.min(base * Math.pow(2, attempt), max);
|
|
const jitter = delay * 0.1 * Math.random();
|
|
return delay + jitter;
|
|
};
|
|
|
|
# Add jitter to prevent thundering herd
|
|
|
|
### Storing Large Data in Workflow State
|
|
|
|
Severity: HIGH
|
|
|
|
Situation: Passing large payloads between workflow steps
|
|
|
|
Symptoms:
|
|
Slow workflow execution. Memory errors. "Payload too large" errors.
|
|
Expensive storage costs. Slow replays.
|
|
|
|
Why this breaks:
|
|
Workflow state is persisted and replayed. A 10MB payload is stored,
|
|
serialized, and deserialized on every step. This adds latency and
|
|
cost. Some platforms have hard limits (e.g., Step Functions 256KB).
|
|
|
|
Recommended fix:
|
|
|
|
# WRONG - large data in workflow:
|
|
await step.run("fetch-data", async () => {
|
|
const largeDataset = await fetchAllRecords(); // 100MB!
|
|
return largeDataset; // Stored in workflow state
|
|
});
|
|
|
|
# CORRECT - store reference, not data:
|
|
await step.run("fetch-data", async () => {
|
|
const largeDataset = await fetchAllRecords();
|
|
const s3Key = await uploadToS3(largeDataset);
|
|
return { s3Key }; // Just the reference
|
|
});
|
|
|
|
const processed = await step.run("process-data", async () => {
|
|
const data = await downloadFromS3(fetchResult.s3Key);
|
|
return processData(data);
|
|
});
|
|
|
|
# For Step Functions, use S3 for large payloads:
|
|
{
|
|
"Type": "Task",
|
|
"Resource": "arn:aws:states:::s3:putObject",
|
|
"Parameters": {
|
|
"Bucket": "my-bucket",
|
|
"Key.$": "$.outputKey",
|
|
"Body.$": "$.largeData"
|
|
}
|
|
}
|
|
|
|
### Missing Dead Letter Queue or Failure Handler
|
|
|
|
Severity: HIGH
|
|
|
|
Situation: Workflows that exhaust all retries
|
|
|
|
Symptoms:
|
|
Failed workflows silently disappear. No alerts when things break.
|
|
Customer issues discovered days later. Manual recovery impossible.
|
|
|
|
Why this breaks:
|
|
Even with retries, some workflows will fail permanently. Without
|
|
dead letter handling, you don't know they failed. The customer
|
|
waits forever, you're unaware, and there's no data to debug.
|
|
|
|
Recommended fix:
|
|
|
|
# Inngest onFailure handler:
|
|
export const myFunction = inngest.createFunction(
|
|
{
|
|
id: "process-order",
|
|
onFailure: async ({ error, event, step }) => {
|
|
// Log to error tracking
|
|
await step.run("log-error", () =>
|
|
sentry.captureException(error, { extra: { event } })
|
|
);
|
|
|
|
// Alert team
|
|
await step.run("alert", () =>
|
|
slack.postMessage({
|
|
channel: "#alerts",
|
|
text: `Order ${event.data.orderId} failed: ${error.message}`
|
|
})
|
|
);
|
|
|
|
// Queue for manual review
|
|
await step.run("queue-review", () =>
|
|
db.insert(failedOrders, { orderId, error, event })
|
|
);
|
|
}
|
|
},
|
|
{ event: "order/created" },
|
|
async ({ event, step }) => { ... }
|
|
);
|
|
|
|
# n8n Error Trigger:
|
|
[Error Trigger] → [Log to DB] → [Slack Alert] → [Create Ticket]
|
|
|
|
# Temporal: Use workflow.failed or workflow signals
|
|
|
|
### n8n Workflow Without Error Trigger
|
|
|
|
Severity: MEDIUM
|
|
|
|
Situation: Building production n8n workflows
|
|
|
|
Symptoms:
|
|
Workflow fails silently. Errors only visible in execution logs.
|
|
No alerts, no recovery, no visibility until someone notices.
|
|
|
|
Why this breaks:
|
|
n8n doesn't notify on failure by default. Without an Error Trigger
|
|
node connected to alerting, failures are only visible in the UI.
|
|
Production failures go unnoticed.
|
|
|
|
Recommended fix:
|
|
|
|
# Every production n8n workflow needs:
|
|
|
|
1. Error Trigger node
|
|
- Catches any node failure in the workflow
|
|
- Provides error details and context
|
|
|
|
2. Connected error handling:
|
|
[Error Trigger]
|
|
↓
|
|
[Set: Extract Error Details]
|
|
↓
|
|
[HTTP: Log to Error Service]
|
|
↓
|
|
[Slack/Email: Alert Team]
|
|
|
|
3. Consider dead letter pattern:
|
|
[Error Trigger]
|
|
↓
|
|
[Redis/Postgres: Store Failed Job]
|
|
↓
|
|
[Separate Recovery Workflow]
|
|
|
|
# Also use:
|
|
- Retry on node failures (built-in)
|
|
- Node timeout settings
|
|
- Workflow timeout
|
|
|
|
### Long-Running Temporal Activities Without Heartbeat
|
|
|
|
Severity: MEDIUM
|
|
|
|
Situation: Activities that run for more than a few seconds
|
|
|
|
Symptoms:
|
|
Activity timeouts even when work is progressing. Lost work when
|
|
workers restart. Can't cancel long-running activities.
|
|
|
|
Why this breaks:
|
|
Temporal detects stuck activities via heartbeat. Without heartbeat,
|
|
Temporal can't tell if activity is working or stuck. Long activities
|
|
appear hung, may timeout, and can't be gracefully cancelled.
|
|
|
|
Recommended fix:
|
|
|
|
# For any activity > 10 seconds, add heartbeat:
|
|
|
|
import { heartbeat, activityInfo } from '@temporalio/activity';
|
|
|
|
export async function processLargeFile(fileUrl: string): Promise<void> {
|
|
const chunks = await downloadChunks(fileUrl);
|
|
|
|
for (let i = 0; i < chunks.length; i++) {
|
|
// Check for cancellation
|
|
const { cancelled } = activityInfo();
|
|
if (cancelled) {
|
|
throw new CancelledFailure('Activity cancelled');
|
|
}
|
|
|
|
await processChunk(chunks[i]);
|
|
|
|
// Report progress
|
|
heartbeat({ progress: (i + 1) / chunks.length });
|
|
}
|
|
}
|
|
|
|
# Configure heartbeat timeout:
|
|
const activities = proxyActivities({
|
|
startToCloseTimeout: '10 minutes',
|
|
heartbeatTimeout: '30 seconds', # Must heartbeat every 30s
|
|
});
|
|
|
|
# If no heartbeat for 30s, activity is considered stuck
|
|
|
|
## Validation Checks
|
|
|
|
### External Calls Without Idempotency Key
|
|
|
|
Severity: ERROR
|
|
|
|
Stripe/payment calls should use idempotency keys
|
|
|
|
Message: Payment call without idempotency_key. Add idempotency key to prevent duplicate charges on retry.
|
|
|
|
### Email Sending Without Deduplication
|
|
|
|
Severity: WARNING
|
|
|
|
Email sends in workflows should check for already-sent
|
|
|
|
Message: Email sent in workflow without deduplication check. Retries may send duplicate emails.
|
|
|
|
### Temporal Activities Without Timeout
|
|
|
|
Severity: ERROR
|
|
|
|
All Temporal activities need timeout configuration
|
|
|
|
Message: proxyActivities without timeout. Add startToCloseTimeout to prevent indefinite hangs.
|
|
|
|
### Inngest Steps Calling External APIs Without Timeout
|
|
|
|
Severity: WARNING
|
|
|
|
External API calls should have timeouts
|
|
|
|
Message: External API call in step without timeout. Add timeout to prevent workflow hangs.
|
|
|
|
### Random Values in Workflow Code
|
|
|
|
Severity: ERROR
|
|
|
|
Random values break determinism on replay
|
|
|
|
Message: Random value in workflow code. Move to activity/step or use sideEffect.
|
|
|
|
### Date.now() in Workflow Code
|
|
|
|
Severity: ERROR
|
|
|
|
Current time breaks determinism on replay
|
|
|
|
Message: Current time in workflow code. Use workflow.now() or move to activity/step.
|
|
|
|
### Inngest Function Without onFailure Handler
|
|
|
|
Severity: WARNING
|
|
|
|
Production functions should have failure handlers
|
|
|
|
Message: Inngest function without onFailure handler. Add failure handling for production reliability.
|
|
|
|
### Step Without Error Handling
|
|
|
|
Severity: WARNING
|
|
|
|
Steps should handle errors gracefully
|
|
|
|
Message: Step without try/catch. Consider handling specific error cases.
|
|
|
|
### Potentially Large Data Returned from Step
|
|
|
|
Severity: INFO
|
|
|
|
Large data in workflow state slows execution
|
|
|
|
Message: Returning potentially large data from step. Consider storing in S3/DB and returning reference.
|
|
|
|
### Retry Without Backoff Configuration
|
|
|
|
Severity: WARNING
|
|
|
|
Retries should use exponential backoff
|
|
|
|
Message: Retry configured without backoff. Add backoffCoefficient and initialInterval.
|
|
|
|
## Collaboration
|
|
|
|
### Delegation Triggers
|
|
|
|
- user needs multi-agent coordination -> multi-agent-orchestration (Workflow provides infrastructure, orchestration provides patterns)
|
|
- user needs tool building for workflows -> agent-tool-builder (Tools that workflows can invoke)
|
|
- user needs Zapier/Make integration -> zapier-make-patterns (No-code automation platforms)
|
|
- user needs browser automation in workflow -> browser-automation (Playwright/Puppeteer activities)
|
|
- user needs computer control in workflow -> computer-use-agents (Desktop automation activities)
|
|
- user needs LLM integration in workflow -> llm-architect (AI-powered workflow steps)
|
|
|
|
## Related Skills
|
|
|
|
Works well with: `multi-agent-orchestration`, `agent-tool-builder`, `backend`, `devops`
|
|
|
|
## When to Use
|
|
- User mentions or implies: workflow
|
|
- User mentions or implies: automation
|
|
- User mentions or implies: n8n
|
|
- User mentions or implies: temporal
|
|
- User mentions or implies: inngest
|
|
- User mentions or implies: step function
|
|
- User mentions or implies: background job
|
|
- User mentions or implies: durable execution
|
|
- User mentions or implies: event-driven
|
|
- User mentions or implies: scheduled task
|
|
- User mentions or implies: job queue
|
|
- User mentions or implies: cron
|
|
- User mentions or implies: trigger
|
|
|
|
## Limitations
|
|
- Use this skill only when the task clearly matches the scope described above.
|
|
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
|
|
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
|