playbook/superpowers/docs/testing.md

9.7 KiB

Testing Superpowers Skills

This document describes how to test Superpowers skills, particularly the integration tests for complex skills like subagent-driven-development.

Overview

Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.

Test Structure

tests/
├── claude-code/
│   ├── test-helpers.sh                    # Shared test utilities
│   ├── test-subagent-driven-development-integration.sh
│   ├── analyze-token-usage.py             # Token analysis tool
│   └── run-skill-tests.sh                 # Test runner (if exists)

Running Tests

Integration Tests

Integration tests execute real Claude Code sessions with actual skills:

# Run the subagent-driven-development integration test
cd tests/claude-code
./test-subagent-driven-development-integration.sh

Note: Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.

Requirements

  • Must run from the superpowers plugin directory (not from temp directories)
  • Claude Code must be installed and available as claude command
  • Local dev marketplace must be enabled: "superpowers@superpowers-dev": true in ~/.claude/settings.json

Integration Test: subagent-driven-development

What It Tests

The integration test verifies the subagent-driven-development skill correctly:

  1. Plan Loading: Reads the plan once at the beginning
  2. Full Task Text: Provides complete task descriptions to subagents (doesn't make them read files)
  3. Self-Review: Ensures subagents perform self-review before reporting
  4. Review Order: Runs spec compliance review before code quality review
  5. Review Loops: Uses review loops when issues are found
  6. Independent Verification: Spec reviewer reads code independently, doesn't trust implementer reports

How It Works

  1. Setup: Creates a temporary Node.js project with a minimal implementation plan
  2. Execution: Runs Claude Code in headless mode with the skill
  3. Verification: Parses the session transcript (.jsonl file) to verify:
    • Skill tool was invoked
    • Subagents were dispatched (Task tool)
    • TodoWrite was used for tracking
    • Implementation files were created
    • Tests pass
    • Git commits show proper workflow
  4. Token Analysis: Shows token usage breakdown by subagent

Test Output

========================================
 Integration Test: subagent-driven-development
========================================

Test project: /tmp/tmp.xyz123

=== Verification Tests ===

Test 1: Skill tool invoked...
  [PASS] subagent-driven-development skill was invoked

Test 2: Subagents dispatched...
  [PASS] 7 subagents dispatched

Test 3: Task tracking...
  [PASS] TodoWrite used 5 time(s)

Test 6: Implementation verification...
  [PASS] src/math.js created
  [PASS] add function exists
  [PASS] multiply function exists
  [PASS] test/math.test.js created
  [PASS] Tests pass

Test 7: Git commit history...
  [PASS] Multiple commits created (3 total)

Test 8: No extra features added...
  [PASS] No extra features added

=========================================
 Token Usage Analysis
=========================================

Usage Breakdown:
----------------------------------------------------------------------------------------------------
Agent           Description                          Msgs      Input     Output      Cache     Cost
----------------------------------------------------------------------------------------------------
main            Main session (coordinator)             34         27      3,996  1,213,703 $   4.09
3380c209        implementing Task 1: Create Add Function     1          2        787     24,989 $   0.09
34b00fde        implementing Task 2: Create Multiply Function     1          4        644     25,114 $   0.09
3801a732        reviewing whether an implementation matches...   1          5        703     25,742 $   0.09
4c142934        doing a final code review...                    1          6        854     25,319 $   0.09
5f017a42        a code reviewer. Review Task 2...               1          6        504     22,949 $   0.08
a6b7fbe4        a code reviewer. Review Task 1...               1          6        515     22,534 $   0.08
f15837c0        reviewing whether an implementation matches...   1          6        416     22,485 $   0.07
----------------------------------------------------------------------------------------------------

TOTALS:
  Total messages:         41
  Input tokens:           62
  Output tokens:          8,419
  Cache creation tokens:  132,742
  Cache read tokens:      1,382,835

  Total input (incl cache): 1,515,639
  Total tokens:             1,524,058

  Estimated cost: $4.67
  (at $3/$15 per M tokens for input/output)

========================================
 Test Summary
========================================

STATUS: PASSED

Token Analysis Tool

Usage

Analyze token usage from any Claude Code session:

python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl

Finding Session Files

Session transcripts are stored in ~/.claude/projects/ with the working directory path encoded:

# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"

# Find recent sessions
ls -lt "$SESSION_DIR"/*.jsonl | head -5

What It Shows

  • Main session usage: Token usage by the coordinator (you or main Claude instance)
  • Per-subagent breakdown: Each Task invocation with:
    • Agent ID
    • Description (extracted from prompt)
    • Message count
    • Input/output tokens
    • Cache usage
    • Estimated cost
  • Totals: Overall token usage and cost estimate

Understanding the Output

  • High cache reads: Good - means prompt caching is working
  • High input tokens on main: Expected - coordinator has full context
  • Similar costs per subagent: Expected - each gets similar task complexity
  • Cost per task: Typical range is $0.05-$0.15 per subagent depending on task

Troubleshooting

Skills Not Loading

Problem: Skill not found when running headless tests

Solutions:

  1. Ensure you're running FROM the superpowers directory: cd /path/to/superpowers && tests/...
  2. Check ~/.claude/settings.json has "superpowers@superpowers-dev": true in enabledPlugins
  3. Verify skill exists in skills/ directory

Permission Errors

Problem: Claude blocked from writing files or accessing directories

Solutions:

  1. Use --permission-mode bypassPermissions flag
  2. Use --add-dir /path/to/temp/dir to grant access to test directories
  3. Check file permissions on test directories

Test Timeouts

Problem: Test takes too long and times out

Solutions:

  1. Increase timeout: timeout 1800 claude ... (30 minutes)
  2. Check for infinite loops in skill logic
  3. Review subagent task complexity

Session File Not Found

Problem: Can't find session transcript after test run

Solutions:

  1. Check the correct project directory in ~/.claude/projects/
  2. Use find ~/.claude/projects -name "*.jsonl" -mmin -60 to find recent sessions
  3. Verify test actually ran (check for errors in test output)

Writing New Integration Tests

Template

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/test-helpers.sh"

# Create test project
TEST_PROJECT=$(create_test_project)
trap "cleanup_test_project $TEST_PROJECT" EXIT

# Set up test files...
cd "$TEST_PROJECT"

# Run Claude with skill
PROMPT="Your test prompt here"
cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
  --allowed-tools=all \
  --add-dir "$TEST_PROJECT" \
  --permission-mode bypassPermissions \
  2>&1 | tee output.txt

# Find and analyze session
WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)

# Verify behavior by parsing session transcript
if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
    echo "[PASS] Skill was invoked"
fi

# Show token analysis
python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"

Best Practices

  1. Always cleanup: Use trap to cleanup temp directories
  2. Parse transcripts: Don't grep user-facing output - parse the .jsonl session file
  3. Grant permissions: Use --permission-mode bypassPermissions and --add-dir
  4. Run from plugin dir: Skills only load when running from the superpowers directory
  5. Show token usage: Always include token analysis for cost visibility
  6. Test real behavior: Verify actual files created, tests passing, commits made

Session Transcript Format

Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.

Key Fields

{
  "type": "assistant",
  "message": {
    "content": [...],
    "usage": {
      "input_tokens": 27,
      "output_tokens": 3996,
      "cache_read_input_tokens": 1213703
    }
  }
}

Tool Results

{
  "type": "user",
  "toolUseResult": {
    "agentId": "3380c209",
    "usage": {
      "input_tokens": 2,
      "output_tokens": 787,
      "cache_read_input_tokens": 24989
    },
    "prompt": "You are implementing Task 1...",
    "content": [{"type": "text", "text": "..."}]
  }
}

The agentId field links to subagent sessions, and the usage field contains token usage for that specific subagent invocation.