9.7 KiB
Testing Superpowers Skills
This document describes how to test Superpowers skills, particularly the integration tests for complex skills like subagent-driven-development.
Overview
Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.
Test Structure
tests/
├── claude-code/
│ ├── test-helpers.sh # Shared test utilities
│ ├── test-subagent-driven-development-integration.sh
│ ├── analyze-token-usage.py # Token analysis tool
│ └── run-skill-tests.sh # Test runner (if exists)
Running Tests
Integration Tests
Integration tests execute real Claude Code sessions with actual skills:
# Run the subagent-driven-development integration test
cd tests/claude-code
./test-subagent-driven-development-integration.sh
Note: Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.
Requirements
- Must run from the superpowers plugin directory (not from temp directories)
- Claude Code must be installed and available as
claudecommand - Local dev marketplace must be enabled:
"superpowers@superpowers-dev": truein~/.claude/settings.json
Integration Test: subagent-driven-development
What It Tests
The integration test verifies the subagent-driven-development skill correctly:
- Plan Loading: Reads the plan once at the beginning
- Full Task Text: Provides complete task descriptions to subagents (doesn't make them read files)
- Self-Review: Ensures subagents perform self-review before reporting
- Review Order: Runs spec compliance review before code quality review
- Review Loops: Uses review loops when issues are found
- Independent Verification: Spec reviewer reads code independently, doesn't trust implementer reports
How It Works
- Setup: Creates a temporary Node.js project with a minimal implementation plan
- Execution: Runs Claude Code in headless mode with the skill
- Verification: Parses the session transcript (
.jsonlfile) to verify:- Skill tool was invoked
- Subagents were dispatched (Task tool)
- TodoWrite was used for tracking
- Implementation files were created
- Tests pass
- Git commits show proper workflow
- Token Analysis: Shows token usage breakdown by subagent
Test Output
========================================
Integration Test: subagent-driven-development
========================================
Test project: /tmp/tmp.xyz123
=== Verification Tests ===
Test 1: Skill tool invoked...
[PASS] subagent-driven-development skill was invoked
Test 2: Subagents dispatched...
[PASS] 7 subagents dispatched
Test 3: Task tracking...
[PASS] TodoWrite used 5 time(s)
Test 6: Implementation verification...
[PASS] src/math.js created
[PASS] add function exists
[PASS] multiply function exists
[PASS] test/math.test.js created
[PASS] Tests pass
Test 7: Git commit history...
[PASS] Multiple commits created (3 total)
Test 8: No extra features added...
[PASS] No extra features added
=========================================
Token Usage Analysis
=========================================
Usage Breakdown:
----------------------------------------------------------------------------------------------------
Agent Description Msgs Input Output Cache Cost
----------------------------------------------------------------------------------------------------
main Main session (coordinator) 34 27 3,996 1,213,703 $ 4.09
3380c209 implementing Task 1: Create Add Function 1 2 787 24,989 $ 0.09
34b00fde implementing Task 2: Create Multiply Function 1 4 644 25,114 $ 0.09
3801a732 reviewing whether an implementation matches... 1 5 703 25,742 $ 0.09
4c142934 doing a final code review... 1 6 854 25,319 $ 0.09
5f017a42 a code reviewer. Review Task 2... 1 6 504 22,949 $ 0.08
a6b7fbe4 a code reviewer. Review Task 1... 1 6 515 22,534 $ 0.08
f15837c0 reviewing whether an implementation matches... 1 6 416 22,485 $ 0.07
----------------------------------------------------------------------------------------------------
TOTALS:
Total messages: 41
Input tokens: 62
Output tokens: 8,419
Cache creation tokens: 132,742
Cache read tokens: 1,382,835
Total input (incl cache): 1,515,639
Total tokens: 1,524,058
Estimated cost: $4.67
(at $3/$15 per M tokens for input/output)
========================================
Test Summary
========================================
STATUS: PASSED
Token Analysis Tool
Usage
Analyze token usage from any Claude Code session:
python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl
Finding Session Files
Session transcripts are stored in ~/.claude/projects/ with the working directory path encoded:
# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"
# Find recent sessions
ls -lt "$SESSION_DIR"/*.jsonl | head -5
What It Shows
- Main session usage: Token usage by the coordinator (you or main Claude instance)
- Per-subagent breakdown: Each Task invocation with:
- Agent ID
- Description (extracted from prompt)
- Message count
- Input/output tokens
- Cache usage
- Estimated cost
- Totals: Overall token usage and cost estimate
Understanding the Output
- High cache reads: Good - means prompt caching is working
- High input tokens on main: Expected - coordinator has full context
- Similar costs per subagent: Expected - each gets similar task complexity
- Cost per task: Typical range is $0.05-$0.15 per subagent depending on task
Troubleshooting
Skills Not Loading
Problem: Skill not found when running headless tests
Solutions:
- Ensure you're running FROM the superpowers directory:
cd /path/to/superpowers && tests/... - Check
~/.claude/settings.jsonhas"superpowers@superpowers-dev": trueinenabledPlugins - Verify skill exists in
skills/directory
Permission Errors
Problem: Claude blocked from writing files or accessing directories
Solutions:
- Use
--permission-mode bypassPermissionsflag - Use
--add-dir /path/to/temp/dirto grant access to test directories - Check file permissions on test directories
Test Timeouts
Problem: Test takes too long and times out
Solutions:
- Increase timeout:
timeout 1800 claude ...(30 minutes) - Check for infinite loops in skill logic
- Review subagent task complexity
Session File Not Found
Problem: Can't find session transcript after test run
Solutions:
- Check the correct project directory in
~/.claude/projects/ - Use
find ~/.claude/projects -name "*.jsonl" -mmin -60to find recent sessions - Verify test actually ran (check for errors in test output)
Writing New Integration Tests
Template
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/test-helpers.sh"
# Create test project
TEST_PROJECT=$(create_test_project)
trap "cleanup_test_project $TEST_PROJECT" EXIT
# Set up test files...
cd "$TEST_PROJECT"
# Run Claude with skill
PROMPT="Your test prompt here"
cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
--allowed-tools=all \
--add-dir "$TEST_PROJECT" \
--permission-mode bypassPermissions \
2>&1 | tee output.txt
# Find and analyze session
WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)
# Verify behavior by parsing session transcript
if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
echo "[PASS] Skill was invoked"
fi
# Show token analysis
python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"
Best Practices
- Always cleanup: Use trap to cleanup temp directories
- Parse transcripts: Don't grep user-facing output - parse the
.jsonlsession file - Grant permissions: Use
--permission-mode bypassPermissionsand--add-dir - Run from plugin dir: Skills only load when running from the superpowers directory
- Show token usage: Always include token analysis for cost visibility
- Test real behavior: Verify actual files created, tests passing, commits made
Session Transcript Format
Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.
Key Fields
{
"type": "assistant",
"message": {
"content": [...],
"usage": {
"input_tokens": 27,
"output_tokens": 3996,
"cache_read_input_tokens": 1213703
}
}
}
Tool Results
{
"type": "user",
"toolUseResult": {
"agentId": "3380c209",
"usage": {
"input_tokens": 2,
"output_tokens": 787,
"cache_read_input_tokens": 24989
},
"prompt": "You are implementing Task 1...",
"content": [{"type": "text", "text": "..."}]
}
}
The agentId field links to subagent sessions, and the usage field contains token usage for that specific subagent invocation.