playbook/superpowers/docs/plans/2025-11-28-skills-improveme...

# Skills Improvements from User Feedback

**Date:** 2025-11-28
**Status:** Draft
**Source:** Two Claude instances using superpowers in real development scenarios

---

## Executive Summary

Two Claude instances provided detailed feedback from actual development sessions. Their feedback reveals **systematic gaps** in current skills that allowed preventable bugs to ship despite following the skills.

**Critical insight:** These are problem reports, not just solution proposals. The problems are real; the solutions need careful evaluation.

**Key themes:**
1. **Verification gaps** - We verify operations succeed but not that they achieve intended outcomes
2. **Process hygiene** - Background processes accumulate and interfere across subagents
3. **Context optimization** - Subagents get too much irrelevant information
4. **Self-reflection missing** - No prompt to critique own work before handoff
5. **Mock safety** - Mocks can drift from interfaces without detection
6. **Skill activation** - Skills exist but aren't being read/used

---

## Problems Identified

### Problem 1: Configuration Change Verification Gap

**What happened:**
- Subagent tested "OpenAI integration"
- Set `OPENAI_API_KEY` env var
- Got status 200 responses
- Reported "OpenAI integration working"
- **BUT** response contained `"model": "claude-sonnet-4-20250514"` - was actually using Anthropic

**Root cause:**
`verification-before-completion` checks operations succeed but not that outcomes reflect intended configuration changes.

**Impact:** High - False confidence in integration tests, bugs ship to production

**Example failure pattern:**
- Switch LLM provider → verify status 200 but don't check model name
- Enable feature flag → verify no errors but don't check feature is active
- Change environment → verify deployment succeeds but don't check environment vars

---

### Problem 2: Background Process Accumulation

**What happened:**
- Multiple subagents dispatched during session
- Each started background server processes
- Processes accumulated (4+ servers running)
- Stale processes still bound to ports
- Later E2E test hit stale server with wrong config
- Confusing/incorrect test results

**Root cause:**
Subagents are stateless - don't know about previous subagents' processes. No cleanup protocol.

**Impact:** Medium-High - Tests hit wrong server, false passes/failures, debugging confusion

---

### Problem 3: Context Bloat in Subagent Prompts

**What happened:**
- Standard approach: give subagent full plan file to read
- Experiment: give only task + pattern + file + verify command
- Result: Faster, more focused, single-attempt completion more common

**Root cause:**
Subagents waste tokens and attention on irrelevant plan sections.

**Impact:** Medium - Slower execution, more failed attempts

**What worked:**
```
You are adding a single E2E test to packnplay's test suite.

**Your task:** Add `TestE2E_FeaturePrivilegedMode` to `pkg/runner/e2e_test.go`

**What to test:** A local devcontainer feature that requests `"privileged": true`
in its metadata should result in the container running with `--privileged` flag.

**Follow the exact pattern of TestE2E_FeatureOptionValidation** (at the end of the file)

**After writing, run:** `go test -v ./pkg/runner -run TestE2E_FeaturePrivilegedMode -timeout 5m`
```

---

### Problem 4: No Self-Reflection Before Handoff

**What happened:**
- Added self-reflection prompt: "Look at your work with fresh eyes - what could be better?"
- Implementer for Task 5 identified failing test was due to implementation bug, not test bug
- Traced to line 99: `strings.Join(metadata.Entrypoint, " ")` creating invalid Docker syntax
- Without self-reflection, would have just reported "test fails" without root cause

**Root cause:**
Implementers don't naturally step back and critique their own work before reporting completion.

**Impact:** Medium - Bugs handed off to reviewer that implementer could have caught

---

### Problem 5: Mock-Interface Drift

**What happened:**
```typescript
// Interface defines close()
interface PlatformAdapter {
  close(): Promise<void>;
}

// Code (BUGGY) calls cleanup()
await adapter.cleanup();

// Mock (MATCHES BUG) defines cleanup()
vi.mock('web-adapter', () => ({
  WebAdapter: vi.fn().mockImplementation(() => ({
    cleanup: vi.fn().mockResolvedValue(undefined),  // Wrong!
  })),
}));
```
- Tests passed
- Runtime crashed: "adapter.cleanup is not a function"

**Root cause:**
Mock derived from what buggy code calls, not from interface definition. TypeScript can't catch inline mocks with wrong method names.

**Impact:** High - Tests give false confidence, runtime crashes

**Why testing-anti-patterns didn't prevent this:**
The skill covers testing mock behavior and mocking without understanding, but not the specific pattern of "derive mock from interface, not implementation."

---

### Problem 6: Code Reviewer File Access

**What happened:**
- Code reviewer subagent dispatched
- Couldn't find test file: "The file doesn't appear to exist in the repository"
- File actually exists
- Reviewer didn't know to explicitly read it first

**Root cause:**
Reviewer prompts don't include explicit file reading instructions.

**Impact:** Low-Medium - Reviews fail or incomplete

---

### Problem 7: Fix Workflow Latency

**What happened:**
- Implementer identifies bug during self-reflection
- Implementer knows the fix
- Current workflow: report → I dispatch fixer → fixer fixes → I verify
- Extra round-trip adds latency without adding value

**Root cause:**
Rigid separation between implementer and fixer roles when implementer has already diagnosed.

**Impact:** Low - Latency, but no correctness issue

---

### Problem 8: Skills Not Being Read

**What happened:**
- `testing-anti-patterns` skill exists
- Neither human nor subagents read it before writing tests
- Would have prevented some issues (though not all - see Problem 5)

**Root cause:**
No enforcement that subagents read relevant skills. No prompt includes skill reading.

**Impact:** Medium - Skill investment wasted if not used

---

## Proposed Improvements

### 1. verification-before-completion: Add Configuration Change Verification

**Add new section:**

```markdown
## Verifying Configuration Changes

When testing changes to configuration, providers, feature flags, or environment:

**Don't just verify the operation succeeded. Verify the output reflects the intended change.**

### Common Failure Pattern

Operation succeeds because *some* valid config exists, but it's not the config you intended to test.

### Examples

| Change | Insufficient | Required |
|--------|-------------|----------|
| Switch LLM provider | Status 200 | Response contains expected model name |
| Enable feature flag | No errors | Feature behavior actually active |
| Change environment | Deploy succeeds | Logs/vars reference new environment |
| Set credentials | Auth succeeds | Authenticated user/context is correct |

### Gate Function

```
BEFORE claiming configuration change works:

1. IDENTIFY: What should be DIFFERENT after this change?
2. LOCATE: Where is that difference observable?
   - Response field (model name, user ID)
   - Log line (environment, provider)
   - Behavior (feature active/inactive)
3. RUN: Command that shows the observable difference
4. VERIFY: Output contains expected difference
5. ONLY THEN: Claim configuration change works

Red flags:
  - "Request succeeded" without checking content
  - Checking status code but not response body
  - Verifying no errors but not positive confirmation
```

**Why this works:**
Forces verification of INTENT, not just operation success.

---

### 2. subagent-driven-development: Add Process Hygiene for E2E Tests

**Add new section:**

```markdown
## Process Hygiene for E2E Tests

When dispatching subagents that start services (servers, databases, message queues):

### Problem

Subagents are stateless - they don't know about processes started by previous subagents. Background processes persist and can interfere with later tests.

### Solution

**Before dispatching E2E test subagent, include cleanup in prompt:**

```
BEFORE starting any services:
1. Kill existing processes: pkill -f "<service-pattern>" 2>/dev/null || true
2. Wait for cleanup: sleep 1
3. Verify port free: lsof -i :<port> && echo "ERROR: Port still in use" || echo "Port free"

AFTER tests complete:
1. Kill the process you started
2. Verify cleanup: pgrep -f "<service-pattern>" || echo "Cleanup successful"
```

### Example

```
Task: Run E2E test of API server

Prompt includes:
"Before starting the server:
- Kill any existing servers: pkill -f 'node.*server.js' 2>/dev/null || true
- Verify port 3001 is free: lsof -i :3001 && exit 1 || echo 'Port available'

After tests:
- Kill the server you started
- Verify: pgrep -f 'node.*server.js' || echo 'Cleanup verified'"
```

### Why This Matters

- Stale processes serve requests with wrong config
- Port conflicts cause silent failures
- Process accumulation slows system
- Confusing test results (hitting wrong server)
```

**Trade-off analysis:**
- Adds boilerplate to prompts
- But prevents very confusing debugging
- Worth it for E2E test subagents

---

### 3. subagent-driven-development: Add Lean Context Option

**Modify Step 2: Execute Task with Subagent**

**Before:**
```
Read that task carefully from [plan-file].
```

**After:**
```
## Context Approaches

**Full Plan (default):**
Use when tasks are complex or have dependencies:
```
Read Task N from [plan-file] carefully.
```

**Lean Context (for independent tasks):**
Use when task is standalone and pattern-based:
```
You are implementing: [1-2 sentence task description]

File to modify: [exact path]
Pattern to follow: [reference to existing function/test]
What to implement: [specific requirement]
Verification: [exact command to run]

[Do NOT include full plan file]
```

**Use lean context when:**
- Task follows existing pattern (add similar test, implement similar feature)
- Task is self-contained (doesn't need context from other tasks)
- Pattern reference is sufficient (e.g., "follow TestE2E_FeatureOptionValidation")

**Use full plan when:**
- Task has dependencies on other tasks
- Requires understanding of overall architecture
- Complex logic that needs context
```

**Example:**
```
Lean context prompt:

"You are adding a test for privileged mode in devcontainer features.

File: pkg/runner/e2e_test.go
Pattern: Follow TestE2E_FeatureOptionValidation (at end of file)
Test: Feature with `"privileged": true` in metadata results in `--privileged` flag
Verify: go test -v ./pkg/runner -run TestE2E_FeaturePrivilegedMode -timeout 5m

Report: Implementation, test results, any issues."
```

**Why this works:**
Reduces token usage, increases focus, faster completion when appropriate.

---

### 4. subagent-driven-development: Add Self-Reflection Step

**Modify Step 2: Execute Task with Subagent**

**Add to prompt template:**

```
When done, BEFORE reporting back:

Take a step back and review your work with fresh eyes.

Ask yourself:
- Does this actually solve the task as specified?
- Are there edge cases I didn't consider?
- Did I follow the pattern correctly?
- If tests are failing, what's the ROOT CAUSE (implementation bug vs test bug)?
- What could be better about this implementation?

If you identify issues during this reflection, fix them now.

Then report:
- What you implemented
- Self-reflection findings (if any)
- Test results
- Files changed
```

**Why this works:**
Catches bugs implementer can find themselves before handoff. Documented case: identified entrypoint bug through self-reflection.

**Trade-off:**
Adds ~30 seconds per task, but catches issues before review.

---

### 5. requesting-code-review: Add Explicit File Reading

**Modify the code-reviewer template:**

**Add at the beginning:**

```markdown
## Files to Review

BEFORE analyzing, read these files:

1. [List specific files that changed in the diff]
2. [Files referenced by changes but not modified]

Use Read tool to load each file.

If you cannot find a file:
- Check exact path from diff
- Try alternate locations
- Report: "Cannot locate [path] - please verify file exists"

DO NOT proceed with review until you've read the actual code.
```

**Why this works:**
Explicit instruction prevents "file not found" issues.

---

### 6. testing-anti-patterns: Add Mock-Interface Drift Anti-Pattern

**Add new Anti-Pattern 6:**

```markdown
## Anti-Pattern 6: Mocks Derived from Implementation

**The violation:**
```typescript
// Code (BUGGY) calls cleanup()
await adapter.cleanup();

// Mock (MATCHES BUG) has cleanup()
const mock = {
  cleanup: vi.fn().mockResolvedValue(undefined)
};

// Interface (CORRECT) defines close()
interface PlatformAdapter {
  close(): Promise<void>;
}
```

**Why this is wrong:**
- Mock encodes the bug into the test
- TypeScript can't catch inline mocks with wrong method names
- Test passes because both code and mock are wrong
- Runtime crashes when real object is used

**The fix:**
```typescript
// ✅ GOOD: Derive mock from interface

// Step 1: Open interface definition (PlatformAdapter)
// Step 2: List methods defined there (close, initialize, etc.)
// Step 3: Mock EXACTLY those methods

const mock = {
  initialize: vi.fn().mockResolvedValue(undefined),
  close: vi.fn().mockResolvedValue(undefined),  // From interface!
};

// Now test FAILS because code calls cleanup() which doesn't exist
// That failure reveals the bug BEFORE runtime
```

### Gate Function

```
BEFORE writing any mock:

  1. STOP - Do NOT look at the code under test yet
  2. FIND: The interface/type definition for the dependency
  3. READ: The interface file
  4. LIST: Methods defined in the interface
  5. MOCK: ONLY those methods with EXACTLY those names
  6. DO NOT: Look at what your code calls

  IF your test fails because code calls something not in mock:
    ✅ GOOD - The test found a bug in your code
    Fix the code to call the correct interface method
    NOT the mock

  Red flags:
    - "I'll mock what the code calls"
    - Copying method names from implementation
    - Mock written without reading interface
    - "The test is failing so I'll add this method to the mock"
```

**Detection:**

When you see runtime error "X is not a function" and tests pass:
1. Check if X is mocked
2. Compare mock methods to interface methods
3. Look for method name mismatches
```

**Why this works:**
Directly addresses the failure pattern from feedback.

---

### 7. subagent-driven-development: Require Skills Reading for Test Subagents

**Add to prompt template when task involves testing:**

```markdown
BEFORE writing any tests:

1. Read testing-anti-patterns skill:
   Use Skill tool: superpowers:testing-anti-patterns

2. Apply gate functions from that skill when:
   - Writing mocks
   - Adding methods to production classes
   - Mocking dependencies

This is NOT optional. Tests that violate anti-patterns will be rejected in review.
```

**Why this works:**
Ensures skills are actually used, not just exist.

**Trade-off:**
Adds time to each task, but prevents entire classes of bugs.

---

### 8. subagent-driven-development: Allow Implementer to Fix Self-Identified Issues

**Modify Step 2:**

**Current:**
```
Subagent reports back with summary of work.
```

**Proposed:**
```
Subagent performs self-reflection, then:

IF self-reflection identifies fixable issues:
  1. Fix the issues
  2. Re-run verification
  3. Report: "Initial implementation + self-reflection fix"

ELSE:
  Report: "Implementation complete"

Include in report:
- Self-reflection findings
- Whether fixes were applied
- Final verification results
```

**Why this works:**
Reduces latency when implementer already knows the fix. Documented case: would have saved one round-trip for entrypoint bug.

**Trade-off:**
Slightly more complex prompt, but faster end-to-end.

---

## Implementation Plan

### Phase 1: High-Impact, Low-Risk (Do First)

1. **verification-before-completion: Configuration change verification**
   - Clear addition, doesn't change existing content
   - Addresses high-impact problem (false confidence in tests)
   - File: `skills/verification-before-completion/SKILL.md`

2. **testing-anti-patterns: Mock-interface drift**
   - Adds new anti-pattern, doesn't modify existing
   - Addresses high-impact problem (runtime crashes)
   - File: `skills/testing-anti-patterns/SKILL.md`

3. **requesting-code-review: Explicit file reading**
   - Simple addition to template
   - Fixes concrete problem (reviewers can't find files)
   - File: `skills/requesting-code-review/SKILL.md`

### Phase 2: Moderate Changes (Test Carefully)

4. **subagent-driven-development: Process hygiene**
   - Adds new section, doesn't change workflow
   - Addresses medium-high impact (test reliability)
   - File: `skills/subagent-driven-development/SKILL.md`

5. **subagent-driven-development: Self-reflection**
   - Changes prompt template (higher risk)
   - But documented to catch bugs
   - File: `skills/subagent-driven-development/SKILL.md`

6. **subagent-driven-development: Skills reading requirement**
   - Adds prompt overhead
   - But ensures skills are actually used
   - File: `skills/subagent-driven-development/SKILL.md`

### Phase 3: Optimization (Validate First)

7. **subagent-driven-development: Lean context option**
   - Adds complexity (two approaches)
   - Needs validation that it doesn't cause confusion
   - File: `skills/subagent-driven-development/SKILL.md`

8. **subagent-driven-development: Allow implementer to fix**
   - Changes workflow (higher risk)
   - Optimization, not bug fix
   - File: `skills/subagent-driven-development/SKILL.md`

---

## Open Questions

1. **Lean context approach:**
   - Should we make it the default for pattern-based tasks?
   - How do we decide which approach to use?
   - Risk of being too lean and missing important context?

2. **Self-reflection:**
   - Will this slow down simple tasks significantly?
   - Should it only apply to complex tasks?
   - How do we prevent "reflection fatigue" where it becomes rote?

3. **Process hygiene:**
   - Should this be in subagent-driven-development or a separate skill?
   - Does it apply to other workflows beyond E2E tests?
   - How do we handle cases where process SHOULD persist (dev servers)?

4. **Skills reading enforcement:**
   - Should we require ALL subagents to read relevant skills?
   - How do we keep prompts from becoming too long?
   - Risk of over-documenting and losing focus?

---

## Success Metrics

How do we know these improvements work?

1. **Configuration verification:**
   - Zero instances of "test passed but wrong config was used"
   - Jesse doesn't say "that's not actually testing what you think"

2. **Process hygiene:**
   - Zero instances of "test hit wrong server"
   - No port conflict errors during E2E test runs

3. **Mock-interface drift:**
   - Zero instances of "tests pass but runtime crashes on missing method"
   - No method name mismatches between mocks and interfaces

4. **Self-reflection:**
   - Measurable: Do implementer reports include self-reflection findings?
   - Qualitative: Do fewer bugs make it to code review?

5. **Skills reading:**
   - Subagent reports reference skill gate functions
   - Fewer anti-pattern violations in code review

---

## Risks and Mitigations

### Risk: Prompt Bloat
**Problem:** Adding all these requirements makes prompts overwhelming
**Mitigation:**
- Phase implementation (don't add everything at once)
- Make some additions conditional (E2E hygiene only for E2E tests)
- Consider templates for different task types

### Risk: Analysis Paralysis
**Problem:** Too much reflection/verification slows execution
**Mitigation:**
- Keep gate functions quick (seconds, not minutes)
- Make lean context opt-in initially
- Monitor task completion times

### Risk: False Sense of Security
**Problem:** Following checklist doesn't guarantee correctness
**Mitigation:**
- Emphasize gate functions are minimums, not maximums
- Keep "use judgment" language in skills
- Document that skills catch common failures, not all failures

### Risk: Skill Divergence
**Problem:** Different skills give conflicting advice
**Mitigation:**
- Review changes across all skills for consistency
- Document how skills interact (Integration sections)
- Test with real scenarios before deployment

---

## Recommendation

**Proceed with Phase 1 immediately:**
- verification-before-completion: Configuration change verification
- testing-anti-patterns: Mock-interface drift
- requesting-code-review: Explicit file reading

**Test Phase 2 with Jesse before finalizing:**
- Get feedback on self-reflection impact
- Validate process hygiene approach
- Confirm skills reading requirement is worth overhead

**Hold Phase 3 pending validation:**
- Lean context needs real-world testing
- Implementer-fix workflow change needs careful evaluation

These changes address real problems documented by users while minimizing risk of making skills worse.