9.7 KiB

Raw Blame History

Testing Superpowers Skills

This document describes how to test Superpowers skills, particularly the integration tests for complex skills like subagent-driven-development.

Overview

Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.

Test Structure

tests/
├── claude-code/
│   ├── test-helpers.sh                    # Shared test utilities
│   ├── test-subagent-driven-development-integration.sh
│   ├── analyze-token-usage.py             # Token analysis tool
│   └── run-skill-tests.sh                 # Test runner (if exists)

Running Tests

Integration Tests

Integration tests execute real Claude Code sessions with actual skills:

# Run the subagent-driven-development integration test
cd tests/claude-code
./test-subagent-driven-development-integration.sh

Note: Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.

Requirements

Must run from the superpowers plugin directory (not from temp directories)
Claude Code must be installed and available as claude command
Local dev marketplace must be enabled: "superpowers@superpowers-dev": true in ~/.claude/settings.json

Integration Test: subagent-driven-development

What It Tests

The integration test verifies the subagent-driven-development skill correctly:

Plan Loading: Reads the plan once at the beginning
Full Task Text: Provides complete task descriptions to subagents (doesn't make them read files)
Self-Review: Ensures subagents perform self-review before reporting
Review Order: Runs spec compliance review before code quality review
Review Loops: Uses review loops when issues are found
Independent Verification: Spec reviewer reads code independently, doesn't trust implementer reports

How It Works

Setup: Creates a temporary Node.js project with a minimal implementation plan
Execution: Runs Claude Code in headless mode with the skill
Verification: Parses the session transcript (.jsonl file) to verify:
- Skill tool was invoked
- Subagents were dispatched (Task tool)
- TodoWrite was used for tracking
- Implementation files were created
- Tests pass
- Git commits show proper workflow
Token Analysis: Shows token usage breakdown by subagent

Test Output

========================================
 Integration Test: subagent-driven-development
========================================

Test project: /tmp/tmp.xyz123

=== Verification Tests ===

Test 1: Skill tool invoked...
  [PASS] subagent-driven-development skill was invoked

Test 2: Subagents dispatched...
  [PASS] 7 subagents dispatched

Test 3: Task tracking...
  [PASS] TodoWrite used 5 time(s)

Test 6: Implementation verification...
  [PASS] src/math.js created
  [PASS] add function exists
  [PASS] multiply function exists
  [PASS] test/math.test.js created
  [PASS] Tests pass

Test 7: Git commit history...
  [PASS] Multiple commits created (3 total)

Test 8: No extra features added...
  [PASS] No extra features added

=========================================
 Token Usage Analysis
=========================================

Usage Breakdown:
----------------------------------------------------------------------------------------------------
Agent           Description                          Msgs      Input     Output      Cache     Cost
----------------------------------------------------------------------------------------------------
main            Main session (coordinator)             34         27      3,996  1,213,703 $   4.09
3380c209        implementing Task 1: Create Add Function     1          2        787     24,989 $   0.09
34b00fde        implementing Task 2: Create Multiply Function     1          4        644     25,114 $   0.09
3801a732        reviewing whether an implementation matches...   1          5        703     25,742 $   0.09
4c142934        doing a final code review...                    1          6        854     25,319 $   0.09
5f017a42        a code reviewer. Review Task 2...               1          6        504     22,949 $   0.08
a6b7fbe4        a code reviewer. Review Task 1...               1          6        515     22,534 $   0.08
f15837c0        reviewing whether an implementation matches...   1          6        416     22,485 $   0.07
----------------------------------------------------------------------------------------------------

TOTALS:
  Total messages:         41
  Input tokens:           62
  Output tokens:          8,419
  Cache creation tokens:  132,742
  Cache read tokens:      1,382,835

  Total input (incl cache): 1,515,639
  Total tokens:             1,524,058

  Estimated cost: $4.67
  (at $3/$15 per M tokens for input/output)

========================================
 Test Summary
========================================

STATUS: PASSED

Token Analysis Tool

Usage

Analyze token usage from any Claude Code session:

python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl

Finding Session Files

Session transcripts are stored in ~/.claude/projects/ with the working directory path encoded:

# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"

# Find recent sessions
ls -lt "$SESSION_DIR"/*.jsonl | head -5

What It Shows

Main session usage: Token usage by the coordinator (you or main Claude instance)
Per-subagent breakdown: Each Task invocation with:
- Agent ID
- Description (extracted from prompt)
- Message count
- Input/output tokens
- Cache usage
- Estimated cost
Totals: Overall token usage and cost estimate

Understanding the Output

High cache reads: Good - means prompt caching is working
High input tokens on main: Expected - coordinator has full context
Similar costs per subagent: Expected - each gets similar task complexity
Cost per task: Typical range is $0.05-$0.15 per subagent depending on task

Troubleshooting

Skills Not Loading

Problem: Skill not found when running headless tests

Solutions:

Ensure you're running FROM the superpowers directory: cd /path/to/superpowers && tests/...
Check ~/.claude/settings.json has "superpowers@superpowers-dev": true in enabledPlugins
Verify skill exists in skills/ directory

Permission Errors

Problem: Claude blocked from writing files or accessing directories

Solutions:

Use --permission-mode bypassPermissions flag
Use --add-dir /path/to/temp/dir to grant access to test directories
Check file permissions on test directories

Test Timeouts

Problem: Test takes too long and times out

Solutions:

Increase timeout: timeout 1800 claude ... (30 minutes)
Check for infinite loops in skill logic
Review subagent task complexity

Session File Not Found

Problem: Can't find session transcript after test run

Solutions:

Check the correct project directory in ~/.claude/projects/
Use find ~/.claude/projects -name "*.jsonl" -mmin -60 to find recent sessions
Verify test actually ran (check for errors in test output)

Writing New Integration Tests

Template

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/test-helpers.sh"

# Create test project
TEST_PROJECT=$(create_test_project)
trap "cleanup_test_project $TEST_PROJECT" EXIT

# Set up test files...
cd "$TEST_PROJECT"

# Run Claude with skill
PROMPT="Your test prompt here"
cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
  --allowed-tools=all \
  --add-dir "$TEST_PROJECT" \
  --permission-mode bypassPermissions \
  2>&1 | tee output.txt

# Find and analyze session
WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)

# Verify behavior by parsing session transcript
if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
    echo "[PASS] Skill was invoked"
fi

# Show token analysis
python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"

Best Practices

Always cleanup: Use trap to cleanup temp directories
Parse transcripts: Don't grep user-facing output - parse the .jsonl session file
Grant permissions: Use --permission-mode bypassPermissions and --add-dir
Run from plugin dir: Skills only load when running from the superpowers directory
Show token usage: Always include token analysis for cost visibility
Test real behavior: Verify actual files created, tests passing, commits made

Session Transcript Format

Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.

Key Fields

{
  "type": "assistant",
  "message": {
    "content": [...],
    "usage": {
      "input_tokens": 27,
      "output_tokens": 3996,
      "cache_read_input_tokens": 1213703
    }
  }
}

Tool Results

{
  "type": "user",
  "toolUseResult": {
    "agentId": "3380c209",
    "usage": {
      "input_tokens": 2,
      "output_tokens": 787,
      "cache_read_input_tokens": 24989
    },
    "prompt": "You are implementing Task 1...",
    "content": [{"type": "text", "text": "..."}]
  }
}

The agentId field links to subagent sessions, and the usage field contains token usage for that specific subagent invocation.

9.7 KiB Raw Blame History

Testing Superpowers Skills

Overview

Test Structure

Running Tests

Integration Tests

Requirements

Integration Test: subagent-driven-development

What It Tests

How It Works

Test Output

Token Analysis Tool

Usage

Finding Session Files

What It Shows

Understanding the Output

Troubleshooting

Skills Not Loading

Permission Errors

Test Timeouts

Session File Not Found

Writing New Integration Tests

Template

Best Practices

Session Transcript Format

Key Fields

Tool Results

9.7 KiB

Raw Blame History