playbook/antigravity-awesome-skills/skills/ecl-harness-engineer/references/audit-templates.md

18 KiB
Raw Permalink Blame History

Audit Templates

Templates for auditing and improving existing harness infrastructure.

Advanced profile note: eval and observability sections in this reference apply only when the project explicitly enables advanced agent-platform capabilities. Core ECL harness audits should not fail or lose score just because harness/eval, harness/trace, harness/memory, harness/checkpoints, or harness/metrics are absent.

Audit Checklist

Documentation Audit (25%)

Item Check Score
AGENTS.md exists test -f AGENTS.md 0/10
AGENTS.md is ~100 lines (not monolithic) wc -l AGENTS.md should be 80-120 0/10
docs/ARCHITECTURE.md exists test -f docs/ARCHITECTURE.md 0/10
Architecture matches reality Compare layer hierarchy to go list ./... 0/20
docs/DEVELOPMENT.md exists test -f docs/DEVELOPMENT.md 0/10
Build commands in DEVELOPMENT.md work Run them and check 0/10
docs/QUALITY.md exists test -f docs/QUALITY.md 0/10
Design docs cover major components Check docs/design-docs/ 0/10
Reference docs are complete Check docs/references/ 0/10

Total: /100 → Scale to 25%

Linter Audit (20%)

Item Check Score
scripts/lint-deps.go exists test -f scripts/lint-deps.go 0/15
Layer map covers all packages Compare to go list ./... 0/20
Introducing violation fails lint Add bad import, run lint 0/15
scripts/lint-quality.go exists test -f scripts/lint-quality.go 0/15
Quality rules match QUALITY.md Compare documented rules to linter 0/10
Makefile has lint-arch target grep lint-arch Makefile 0/10
make lint-arch passes Run it 0/15

Total: /100 → Scale to 20%

Observability Audit (15%)

Item Check Score
harness/trace/ exists test -d harness/trace 0/25
Trace format covers all tool types Check ToolTrace struct 0/25
harness/selftest/ exists test -d harness/selftest 0/25
Observability hook registered Check hook wiring 0/25

Total: /100 → Scale to 15%

Eval Audit (20%)

Item Check Score
harness/eval/framework.go exists test -f harness/eval/framework.go 0/10
harness/eval/runner.go exists test -f harness/eval/runner.go 0/10
harness/eval/scorer.go exists test -f harness/eval/scorer.go 0/10
harness/eval/reporter.go exists test -f harness/eval/reporter.go 0/10
file_ops/ has 5+ tasks Count JSON files 0/10
code_gen/ has 5+ tasks Count JSON files 0/10
debugging/ has 5+ tasks Count JSON files 0/10
refactoring/ has 5+ tasks Count JSON files 0/10
Tasks cover new features Manual review 0/10
All tasks still work Run evals 0/10

Total: /100 → Scale to 20%

Quality Automation Audit (10%)

Item Check Score
harness/quality/score.go exists test -f harness/quality/score.go 0/25
Quality score calculation works Run it 0/25
harness/cleanup/tasks.go exists test -f harness/cleanup/tasks.go 0/25
Cleanup tasks find real issues Run dry-run 0/25

Total: /100 → Scale to 10%

Integration Audit (10%)

Item Check Score
go build ./... passes Run it 0/40
make lint-arch passes Run it 0/30
CI runs harness checks Check CI config 0/30

Total: /100 → Scale to 10%


Scoring Rubric

How to Score Each Item

  • Binary items (exists/doesn't): 0 or full points
  • Quality items (matches reality): Partial credit based on accuracy
    • 100%: Exact match
    • 75%: Minor discrepancies (1-2 items)
    • 50%: Moderate discrepancies (3-5 items)
    • 25%: Major discrepancies but structure is right
    • 0%: Completely wrong or missing

Calculating Overall Score

Overall = (Doc × 0.25) + (Linter × 0.20) + (Obs × 0.15) + (Eval × 0.20) + (Quality × 0.10) + (Integration × 0.10)

Score Interpretation

Score Status Action
0-20% Critical Use Create Mode — build from scratch
21-40% Poor Major gaps — extensive improvement needed
41-60% Fair Multiple gaps — targeted improvement
61-80% Good Minor gaps — polish and expand
81-100% Excellent Maintenance mode — keep current

Gap Analysis Templates

Documentation Drift Report

## Documentation Drift Analysis

### ARCHITECTURE.md Layer Hierarchy

**Documented Layers:**

[Copy from ARCHITECTURE.md]


**Actual Package Structure:**
```bash
go list ./... | grep -v vendor

Discrepancies:

Documented Actual Issue
core/types core/types ✓ Match
core/agent core/agent ✓ Match
- core/newpkg Missing from docs

Tool Catalog

Documented Tools: [count] Actual Tools: [count]

Missing from docs:

  • ToolA (added in commit abc123)
  • ToolB (added in commit def456)

Error Codes

Documented Codes: [count] Actual Codes: [count]

Missing from docs:

  • 300105 NotFoundError (added in PR #123)

### Linter Gap Report

```markdown
## Linter Gap Analysis

### Layer Map Coverage

**Packages in layer map:** [count]
**Packages in codebase:** [count]

**Missing from layer map:**
| Package | Suggested Layer | Reason |
|---------|-----------------|--------|
| core/newpkg | Layer 2 | Depends only on core/types |
| api/v2 | Layer 4 | New API version |

### Violation Test Results

| Test | Expected | Actual | Status |
|------|----------|--------|--------|
| Bad import in core/types | Fail | Fail | ✓ Pass |
| Bad import in core/agent | Fail | Fail | ✓ Pass |
| Bad import in api/v2 | Fail | Pass | ✗ Gap |

### Quality Rules Coverage

**Rules in QUALITY.md:** [count]
**Rules in lint-quality.go:** [count]

**Missing enforcement:**
- Rule 5: "No hardcoded timeouts" — not checked by linter

Eval Coverage Report

## Eval Coverage Analysis

### Tasks per Category

| Category | Count | Target | Status |
|----------|-------|--------|--------|
| file_ops | 3 | 5+ | ✗ Below target |
| code_gen | 2 | 5+ | ✗ Below target |
| debugging | 5 | 5+ | ✓ Meets target |
| refactoring | 4 | 5+ | ✗ Below target |

### Feature Coverage

| Feature | Has Eval | Priority |
|---------|----------|----------|
| File write | ✓ | - |
| File read | ✓ | - |
| JSON parsing | ✗ | P1 |
| Error handling | ✓ | - |
| New auth module | ✗ | P0 |

### Task Health

| Task ID | Status | Issue |
|---------|--------|-------|
| file_ops_001 | ✓ Works | - |
| code_gen_001 | ✗ Broken | Uses removed API |
| debug_001 | ✓ Works | - |

Improvement Plan Template

## Harness Improvement Plan

**Project:** [Name]
**Audit Date:** YYYY-MM-DD
**Audit Score:** XX%
**Target Score:** 80%+

### Priority Gaps

#### P0 — Critical (Fix Immediately)
1. [Gap description]
   - Impact: [Why this matters]
   - Fix: [Specific action]
   - Effort: [Hours estimate]

#### P1 — High (Fix This Sprint)
1. [Gap description]
   - Impact: [Why this matters]
   - Fix: [Specific action]
   - Effort: [Hours estimate]

#### P2 — Medium (Fix Next Sprint)
1. [Gap description]
   - Impact: [Why this matters]
   - Fix: [Specific action]
   - Effort: [Hours estimate]

#### P3 — Low (Backlog)
1. [Gap description]
   - Impact: [Why this matters]
   - Fix: [Specific action]
   - Effort: [Hours estimate]

### Improvement Timeline

| Week | Focus | Expected Score |
|------|-------|----------------|
| 1 | P0 gaps | 45% → 55% |
| 2 | P1 gaps | 55% → 70% |
| 3 | P2 gaps | 70% → 80% |
| 4 | P3 gaps + polish | 80% → 85% |

### Success Metrics

- [ ] Audit score ≥ 80%
- [ ] No P0 or P1 gaps remaining
- [ ] `make lint-arch` passes
- [ ] All eval categories have 5+ tasks
- [ ] Quality score trend is positive

Before/After Comparison Template

## Improvement Results

**Project:** [Name]
**Improvement Period:** YYYY-MM-DD to YYYY-MM-DD

### Score Comparison

| Dimension | Before | After | Delta |
|-----------|--------|-------|-------|
| Documentation | XX% | XX% | +XX% |
| Linters | XX% | XX% | +XX% |
| Observability | XX% | XX% | +XX% |
| Evals | XX% | XX% | +XX% |
| Quality | XX% | XX% | +XX% |
| Integration | XX% | XX% | +XX% |
| **Overall** | **XX%** | **XX%** | **+XX%** |

### Changes Made

#### Documentation
- Updated ARCHITECTURE.md with [changes]
- Created design doc for [component]
- Added [N] entries to tool catalog

#### Linters
- Added [N] packages to layer map
- Created new linter for [pattern]
- Fixed [N] false positives

#### Evals
- Added [N] new eval tasks
- Removed [N] obsolete tasks
- Updated [N] broken tasks

#### Quality
- Added cleanup task for [pattern]
- Updated quality score weights
- Fixed [N] golden principle violations

### Remaining Gaps

[List any P2/P3 items not yet addressed]

### Recommendations

[Next steps for maintaining/improving harness]

Automated Audit Script

// scripts/audit-harness.go
//
// Automated harness audit. Run: go run scripts/audit-harness.go
//
// Outputs JSON with scores per dimension.
package main

import (
	"encoding/json"
	"fmt"
	"os"
	"path/filepath"
)

type AuditResult struct {
	Dimension string  `json:"dimension"`
	Score     float64 `json:"score"`
	MaxScore  float64 `json:"max_score"`
	Percent   float64 `json:"percent"`
	Items     []AuditItem `json:"items"`
}

type AuditItem struct {
	Name    string  `json:"name"`
	Score   float64 `json:"score"`
	Max     float64 `json:"max"`
	Notes   string  `json:"notes,omitempty"`
}

func main() {
	results := []AuditResult{
		auditDocumentation(),
		auditLinters(),
		auditObservability(),
		auditEvals(),
		auditQuality(),
		auditIntegration(),
	}

	// Calculate overall
	weights := map[string]float64{
		"Documentation": 0.25,
		"Linters": 0.20,
		"Observability": 0.15,
		"Evals": 0.20,
		"Quality": 0.10,
		"Integration": 0.10,
	}

	var overall float64
	for _, r := range results {
		overall += r.Percent * weights[r.Dimension]
	}

	// Output
	output := map[string]interface{}{
		"results": results,
		"overall": overall,
	}

	data, _ := json.MarshalIndent(output, "", "  ")
	fmt.Println(string(data))
}

func auditDocumentation() AuditResult {
	r := AuditResult{Dimension: "Documentation", MaxScore: 100}

	// Check files exist
	files := map[string]float64{
		"AGENTS.md": 10,
		"docs/ARCHITECTURE.md": 10,
		"docs/DEVELOPMENT.md": 10,
		"docs/QUALITY.md": 10,
	}

	for file, points := range files {
		if _, err := os.Stat(file); err == nil {
			r.Score += points
			r.Items = append(r.Items, AuditItem{Name: file, Score: points, Max: points})
		} else {
			r.Items = append(r.Items, AuditItem{Name: file, Score: 0, Max: points, Notes: "missing"})
		}
	}

	// Check docs/design-docs/ has files (not just the index)
	if matches, _ := filepath.Glob("docs/design-docs/*.md"); len(matches) > 0 {
		// Exclude index.md from count
		actualDocs := 0
		for _, m := range matches {
			if !strings.HasSuffix(m, "index.md") {
				actualDocs++
			}
		}
		score := min(float64(actualDocs)*5, 20)
		r.Score += score
		r.Items = append(r.Items, AuditItem{Name: "docs/design-docs/", Score: score, Max: 20, Notes: fmt.Sprintf("%d design docs (excluding index)", actualDocs)})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "docs/design-docs/", Score: 0, Max: 20, Notes: "empty or missing"})
	}

	// Check docs/references/ has files
	if matches, _ := filepath.Glob("docs/references/*.md"); len(matches) > 0 {
		score := min(float64(len(matches))*5, 20)
		r.Score += score
		r.Items = append(r.Items, AuditItem{Name: "docs/references/", Score: score, Max: 20, Notes: fmt.Sprintf("%d files", len(matches))})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "docs/references/", Score: 0, Max: 20, Notes: "empty or missing"})
	}

	// Remaining 20 points for AGENTS.md line count
	if data, err := os.ReadFile("AGENTS.md"); err == nil {
		lines := len(strings.Split(string(data), "\n"))
		if lines >= 80 && lines <= 150 {
			r.Score += 20
			r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 20, Max: 20, Notes: fmt.Sprintf("%d lines", lines)})
		} else if lines < 80 {
			r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 10, Max: 20, Notes: fmt.Sprintf("%d lines (too short)", lines)})
			r.Score += 10
		} else {
			r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 5, Max: 20, Notes: fmt.Sprintf("%d lines (too long, should be map)", lines)})
			r.Score += 5
		}
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func auditLinters() AuditResult {
	r := AuditResult{Dimension: "Linters", MaxScore: 100}

	linters := []string{"scripts/lint-deps.go", "scripts/lint-quality.go"}
	for _, l := range linters {
		if _, err := os.Stat(l); err == nil {
			r.Score += 25
			r.Items = append(r.Items, AuditItem{Name: l, Score: 25, Max: 25})
		} else {
			r.Items = append(r.Items, AuditItem{Name: l, Score: 0, Max: 25, Notes: "missing"})
		}
	}

	// Check Makefile
	if data, err := os.ReadFile("Makefile"); err == nil {
		if strings.Contains(string(data), "lint-arch") {
			r.Score += 25
			r.Items = append(r.Items, AuditItem{Name: "Makefile lint-arch", Score: 25, Max: 25})
		} else {
			r.Items = append(r.Items, AuditItem{Name: "Makefile lint-arch", Score: 0, Max: 25, Notes: "target missing"})
		}
	}

	// Remaining 25 for additional linters
	if matches, _ := filepath.Glob("scripts/lint-*.go"); len(matches) > 2 {
		r.Score += 25
		r.Items = append(r.Items, AuditItem{Name: "additional linters", Score: 25, Max: 25, Notes: fmt.Sprintf("%d total", len(matches))})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "additional linters", Score: 0, Max: 25, Notes: "only core linters"})
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func auditObservability() AuditResult {
	r := AuditResult{Dimension: "Observability", MaxScore: 100}

	dirs := map[string]float64{
		"harness/trace": 35,
		"harness/selftest": 35,
	}

	for dir, points := range dirs {
		if info, err := os.Stat(dir); err == nil && info.IsDir() {
			r.Score += points
			r.Items = append(r.Items, AuditItem{Name: dir, Score: points, Max: points})
		} else {
			r.Items = append(r.Items, AuditItem{Name: dir, Score: 0, Max: points, Notes: "missing"})
		}
	}

	// Check for observability hook
	if matches, _ := filepath.Glob("**/observability*.go"); len(matches) > 0 {
		r.Score += 30
		r.Items = append(r.Items, AuditItem{Name: "observability hook", Score: 30, Max: 30})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "observability hook", Score: 0, Max: 30, Notes: "not found"})
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func auditEvals() AuditResult {
	r := AuditResult{Dimension: "Evals", MaxScore: 100}

	// Framework files (40 points)
	files := []string{
		"harness/eval/framework.go",
		"harness/eval/runner.go",
		"harness/eval/scorer.go",
		"harness/eval/reporter.go",
	}
	for _, f := range files {
		if _, err := os.Stat(f); err == nil {
			r.Score += 10
			r.Items = append(r.Items, AuditItem{Name: f, Score: 10, Max: 10})
		} else {
			r.Items = append(r.Items, AuditItem{Name: f, Score: 0, Max: 10, Notes: "missing"})
		}
	}

	// Dataset categories (60 points, 15 each)
	categories := []string{"file_ops", "code_gen", "debugging", "refactoring"}
	for _, cat := range categories {
		pattern := fmt.Sprintf("harness/eval/datasets/%s/*.json", cat)
		matches, _ := filepath.Glob(pattern)
		if len(matches) >= 5 {
			r.Score += 15
			r.Items = append(r.Items, AuditItem{Name: cat, Score: 15, Max: 15, Notes: fmt.Sprintf("%d tasks", len(matches))})
		} else if len(matches) > 0 {
			score := float64(len(matches)) * 3
			r.Score += score
			r.Items = append(r.Items, AuditItem{Name: cat, Score: score, Max: 15, Notes: fmt.Sprintf("%d tasks (need 5+)", len(matches))})
		} else {
			r.Items = append(r.Items, AuditItem{Name: cat, Score: 0, Max: 15, Notes: "no tasks"})
		}
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func auditQuality() AuditResult {
	r := AuditResult{Dimension: "Quality", MaxScore: 100}

	items := map[string]float64{
		"harness/quality/score.go": 35,
		"harness/cleanup/tasks.go": 35,
		"docs/QUALITY.md": 30,
	}

	for item, points := range items {
		if _, err := os.Stat(item); err == nil {
			r.Score += points
			r.Items = append(r.Items, AuditItem{Name: item, Score: points, Max: points})
		} else {
			r.Items = append(r.Items, AuditItem{Name: item, Score: 0, Max: points, Notes: "missing"})
		}
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func auditIntegration() AuditResult {
	r := AuditResult{Dimension: "Integration", MaxScore: 100}

	// Check go.mod exists (build will work)
	if _, err := os.Stat("go.mod"); err == nil {
		r.Score += 40
		r.Items = append(r.Items, AuditItem{Name: "go.mod", Score: 40, Max: 40})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "go.mod", Score: 0, Max: 40, Notes: "missing"})
	}

	// Check Makefile exists
	if _, err := os.Stat("Makefile"); err == nil {
		r.Score += 30
		r.Items = append(r.Items, AuditItem{Name: "Makefile", Score: 30, Max: 30})
	} else {
		r.Items = append(r.Items, AuditItem{Name: "Makefile", Score: 0, Max: 30, Notes: "missing"})
	}

	// Check for CI config
	ciConfigs := []string{".github/workflows", ".gitlab-ci.yml", "Jenkinsfile", ".circleci"}
	found := false
	for _, ci := range ciConfigs {
		if _, err := os.Stat(ci); err == nil {
			found = true
			r.Score += 30
			r.Items = append(r.Items, AuditItem{Name: "CI config", Score: 30, Max: 30, Notes: ci})
			break
		}
	}
	if !found {
		r.Items = append(r.Items, AuditItem{Name: "CI config", Score: 0, Max: 30, Notes: "not found"})
	}

	r.Percent = (r.Score / r.MaxScore) * 100
	return r
}

func min(a, b float64) float64 {
	if a < b {
		return a
	}
	return b
}

import "strings"

Note: The script above has a deliberate syntax issue (import at the end) — move the import "strings" to the import block at the top when using.