304 lines
9.7 KiB
Markdown
304 lines
9.7 KiB
Markdown
# Testing Superpowers Skills
|
|
|
|
This document describes how to test Superpowers skills, particularly the integration tests for complex skills like `subagent-driven-development`.
|
|
|
|
## Overview
|
|
|
|
Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.
|
|
|
|
## Test Structure
|
|
|
|
```
|
|
tests/
|
|
├── claude-code/
|
|
│ ├── test-helpers.sh # Shared test utilities
|
|
│ ├── test-subagent-driven-development-integration.sh
|
|
│ ├── analyze-token-usage.py # Token analysis tool
|
|
│ └── run-skill-tests.sh # Test runner (if exists)
|
|
```
|
|
|
|
## Running Tests
|
|
|
|
### Integration Tests
|
|
|
|
Integration tests execute real Claude Code sessions with actual skills:
|
|
|
|
```bash
|
|
# Run the subagent-driven-development integration test
|
|
cd tests/claude-code
|
|
./test-subagent-driven-development-integration.sh
|
|
```
|
|
|
|
**Note:** Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.
|
|
|
|
### Requirements
|
|
|
|
- Must run from the **superpowers plugin directory** (not from temp directories)
|
|
- Claude Code must be installed and available as `claude` command
|
|
- Local dev marketplace must be enabled: `"superpowers@superpowers-dev": true` in `~/.claude/settings.json`
|
|
|
|
## Integration Test: subagent-driven-development
|
|
|
|
### What It Tests
|
|
|
|
The integration test verifies the `subagent-driven-development` skill correctly:
|
|
|
|
1. **Plan Loading**: Reads the plan once at the beginning
|
|
2. **Full Task Text**: Provides complete task descriptions to subagents (doesn't make them read files)
|
|
3. **Self-Review**: Ensures subagents perform self-review before reporting
|
|
4. **Review Order**: Runs spec compliance review before code quality review
|
|
5. **Review Loops**: Uses review loops when issues are found
|
|
6. **Independent Verification**: Spec reviewer reads code independently, doesn't trust implementer reports
|
|
|
|
### How It Works
|
|
|
|
1. **Setup**: Creates a temporary Node.js project with a minimal implementation plan
|
|
2. **Execution**: Runs Claude Code in headless mode with the skill
|
|
3. **Verification**: Parses the session transcript (`.jsonl` file) to verify:
|
|
- Skill tool was invoked
|
|
- Subagents were dispatched (Task tool)
|
|
- TodoWrite was used for tracking
|
|
- Implementation files were created
|
|
- Tests pass
|
|
- Git commits show proper workflow
|
|
4. **Token Analysis**: Shows token usage breakdown by subagent
|
|
|
|
### Test Output
|
|
|
|
```
|
|
========================================
|
|
Integration Test: subagent-driven-development
|
|
========================================
|
|
|
|
Test project: /tmp/tmp.xyz123
|
|
|
|
=== Verification Tests ===
|
|
|
|
Test 1: Skill tool invoked...
|
|
[PASS] subagent-driven-development skill was invoked
|
|
|
|
Test 2: Subagents dispatched...
|
|
[PASS] 7 subagents dispatched
|
|
|
|
Test 3: Task tracking...
|
|
[PASS] TodoWrite used 5 time(s)
|
|
|
|
Test 6: Implementation verification...
|
|
[PASS] src/math.js created
|
|
[PASS] add function exists
|
|
[PASS] multiply function exists
|
|
[PASS] test/math.test.js created
|
|
[PASS] Tests pass
|
|
|
|
Test 7: Git commit history...
|
|
[PASS] Multiple commits created (3 total)
|
|
|
|
Test 8: No extra features added...
|
|
[PASS] No extra features added
|
|
|
|
=========================================
|
|
Token Usage Analysis
|
|
=========================================
|
|
|
|
Usage Breakdown:
|
|
----------------------------------------------------------------------------------------------------
|
|
Agent Description Msgs Input Output Cache Cost
|
|
----------------------------------------------------------------------------------------------------
|
|
main Main session (coordinator) 34 27 3,996 1,213,703 $ 4.09
|
|
3380c209 implementing Task 1: Create Add Function 1 2 787 24,989 $ 0.09
|
|
34b00fde implementing Task 2: Create Multiply Function 1 4 644 25,114 $ 0.09
|
|
3801a732 reviewing whether an implementation matches... 1 5 703 25,742 $ 0.09
|
|
4c142934 doing a final code review... 1 6 854 25,319 $ 0.09
|
|
5f017a42 a code reviewer. Review Task 2... 1 6 504 22,949 $ 0.08
|
|
a6b7fbe4 a code reviewer. Review Task 1... 1 6 515 22,534 $ 0.08
|
|
f15837c0 reviewing whether an implementation matches... 1 6 416 22,485 $ 0.07
|
|
----------------------------------------------------------------------------------------------------
|
|
|
|
TOTALS:
|
|
Total messages: 41
|
|
Input tokens: 62
|
|
Output tokens: 8,419
|
|
Cache creation tokens: 132,742
|
|
Cache read tokens: 1,382,835
|
|
|
|
Total input (incl cache): 1,515,639
|
|
Total tokens: 1,524,058
|
|
|
|
Estimated cost: $4.67
|
|
(at $3/$15 per M tokens for input/output)
|
|
|
|
========================================
|
|
Test Summary
|
|
========================================
|
|
|
|
STATUS: PASSED
|
|
```
|
|
|
|
## Token Analysis Tool
|
|
|
|
### Usage
|
|
|
|
Analyze token usage from any Claude Code session:
|
|
|
|
```bash
|
|
python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl
|
|
```
|
|
|
|
### Finding Session Files
|
|
|
|
Session transcripts are stored in `~/.claude/projects/` with the working directory path encoded:
|
|
|
|
```bash
|
|
# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
|
|
SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"
|
|
|
|
# Find recent sessions
|
|
ls -lt "$SESSION_DIR"/*.jsonl | head -5
|
|
```
|
|
|
|
### What It Shows
|
|
|
|
- **Main session usage**: Token usage by the coordinator (you or main Claude instance)
|
|
- **Per-subagent breakdown**: Each Task invocation with:
|
|
- Agent ID
|
|
- Description (extracted from prompt)
|
|
- Message count
|
|
- Input/output tokens
|
|
- Cache usage
|
|
- Estimated cost
|
|
- **Totals**: Overall token usage and cost estimate
|
|
|
|
### Understanding the Output
|
|
|
|
- **High cache reads**: Good - means prompt caching is working
|
|
- **High input tokens on main**: Expected - coordinator has full context
|
|
- **Similar costs per subagent**: Expected - each gets similar task complexity
|
|
- **Cost per task**: Typical range is $0.05-$0.15 per subagent depending on task
|
|
|
|
## Troubleshooting
|
|
|
|
### Skills Not Loading
|
|
|
|
**Problem**: Skill not found when running headless tests
|
|
|
|
**Solutions**:
|
|
1. Ensure you're running FROM the superpowers directory: `cd /path/to/superpowers && tests/...`
|
|
2. Check `~/.claude/settings.json` has `"superpowers@superpowers-dev": true` in `enabledPlugins`
|
|
3. Verify skill exists in `skills/` directory
|
|
|
|
### Permission Errors
|
|
|
|
**Problem**: Claude blocked from writing files or accessing directories
|
|
|
|
**Solutions**:
|
|
1. Use `--permission-mode bypassPermissions` flag
|
|
2. Use `--add-dir /path/to/temp/dir` to grant access to test directories
|
|
3. Check file permissions on test directories
|
|
|
|
### Test Timeouts
|
|
|
|
**Problem**: Test takes too long and times out
|
|
|
|
**Solutions**:
|
|
1. Increase timeout: `timeout 1800 claude ...` (30 minutes)
|
|
2. Check for infinite loops in skill logic
|
|
3. Review subagent task complexity
|
|
|
|
### Session File Not Found
|
|
|
|
**Problem**: Can't find session transcript after test run
|
|
|
|
**Solutions**:
|
|
1. Check the correct project directory in `~/.claude/projects/`
|
|
2. Use `find ~/.claude/projects -name "*.jsonl" -mmin -60` to find recent sessions
|
|
3. Verify test actually ran (check for errors in test output)
|
|
|
|
## Writing New Integration Tests
|
|
|
|
### Template
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
|
source "$SCRIPT_DIR/test-helpers.sh"
|
|
|
|
# Create test project
|
|
TEST_PROJECT=$(create_test_project)
|
|
trap "cleanup_test_project $TEST_PROJECT" EXIT
|
|
|
|
# Set up test files...
|
|
cd "$TEST_PROJECT"
|
|
|
|
# Run Claude with skill
|
|
PROMPT="Your test prompt here"
|
|
cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
|
|
--allowed-tools=all \
|
|
--add-dir "$TEST_PROJECT" \
|
|
--permission-mode bypassPermissions \
|
|
2>&1 | tee output.txt
|
|
|
|
# Find and analyze session
|
|
WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
|
|
SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
|
|
SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)
|
|
|
|
# Verify behavior by parsing session transcript
|
|
if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
|
|
echo "[PASS] Skill was invoked"
|
|
fi
|
|
|
|
# Show token analysis
|
|
python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"
|
|
```
|
|
|
|
### Best Practices
|
|
|
|
1. **Always cleanup**: Use trap to cleanup temp directories
|
|
2. **Parse transcripts**: Don't grep user-facing output - parse the `.jsonl` session file
|
|
3. **Grant permissions**: Use `--permission-mode bypassPermissions` and `--add-dir`
|
|
4. **Run from plugin dir**: Skills only load when running from the superpowers directory
|
|
5. **Show token usage**: Always include token analysis for cost visibility
|
|
6. **Test real behavior**: Verify actual files created, tests passing, commits made
|
|
|
|
## Session Transcript Format
|
|
|
|
Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.
|
|
|
|
### Key Fields
|
|
|
|
```json
|
|
{
|
|
"type": "assistant",
|
|
"message": {
|
|
"content": [...],
|
|
"usage": {
|
|
"input_tokens": 27,
|
|
"output_tokens": 3996,
|
|
"cache_read_input_tokens": 1213703
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Tool Results
|
|
|
|
```json
|
|
{
|
|
"type": "user",
|
|
"toolUseResult": {
|
|
"agentId": "3380c209",
|
|
"usage": {
|
|
"input_tokens": 2,
|
|
"output_tokens": 787,
|
|
"cache_read_input_tokens": 24989
|
|
},
|
|
"prompt": "You are implementing Task 1...",
|
|
"content": [{"type": "text", "text": "..."}]
|
|
}
|
|
}
|
|
```
|
|
|
|
The `agentId` field links to subagent sessions, and the `usage` field contains token usage for that specific subagent invocation.
|