playbook/antigravity-awesome-skills/skills/ecl-harness-engineer/references/audit-templates.md

650 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Audit Templates
Templates for auditing and improving existing harness infrastructure.
Advanced profile note: eval and observability sections in this reference apply only when the
project explicitly enables advanced agent-platform capabilities. Core ECL harness audits should
not fail or lose score just because `harness/eval`, `harness/trace`, `harness/memory`,
`harness/checkpoints`, or `harness/metrics` are absent.
## Audit Checklist
### Documentation Audit (25%)
| Item | Check | Score |
|------|-------|-------|
| AGENTS.md exists | `test -f AGENTS.md` | 0/10 |
| AGENTS.md is ~100 lines (not monolithic) | `wc -l AGENTS.md` should be 80-120 | 0/10 |
| docs/ARCHITECTURE.md exists | `test -f docs/ARCHITECTURE.md` | 0/10 |
| Architecture matches reality | Compare layer hierarchy to `go list ./...` | 0/20 |
| docs/DEVELOPMENT.md exists | `test -f docs/DEVELOPMENT.md` | 0/10 |
| Build commands in DEVELOPMENT.md work | Run them and check | 0/10 |
| docs/QUALITY.md exists | `test -f docs/QUALITY.md` | 0/10 |
| Design docs cover major components | Check docs/design-docs/ | 0/10 |
| Reference docs are complete | Check docs/references/ | 0/10 |
**Total: /100 → Scale to 25%**
### Linter Audit (20%)
| Item | Check | Score |
|------|-------|-------|
| scripts/lint-deps.go exists | `test -f scripts/lint-deps.go` | 0/15 |
| Layer map covers all packages | Compare to `go list ./...` | 0/20 |
| Introducing violation fails lint | Add bad import, run lint | 0/15 |
| scripts/lint-quality.go exists | `test -f scripts/lint-quality.go` | 0/15 |
| Quality rules match QUALITY.md | Compare documented rules to linter | 0/10 |
| Makefile has lint-arch target | `grep lint-arch Makefile` | 0/10 |
| `make lint-arch` passes | Run it | 0/15 |
**Total: /100 → Scale to 20%**
### Observability Audit (15%)
| Item | Check | Score |
|------|-------|-------|
| harness/trace/ exists | `test -d harness/trace` | 0/25 |
| Trace format covers all tool types | Check ToolTrace struct | 0/25 |
| harness/selftest/ exists | `test -d harness/selftest` | 0/25 |
| Observability hook registered | Check hook wiring | 0/25 |
**Total: /100 → Scale to 15%**
### Eval Audit (20%)
| Item | Check | Score |
|------|-------|-------|
| harness/eval/framework.go exists | `test -f harness/eval/framework.go` | 0/10 |
| harness/eval/runner.go exists | `test -f harness/eval/runner.go` | 0/10 |
| harness/eval/scorer.go exists | `test -f harness/eval/scorer.go` | 0/10 |
| harness/eval/reporter.go exists | `test -f harness/eval/reporter.go` | 0/10 |
| file_ops/ has 5+ tasks | Count JSON files | 0/10 |
| code_gen/ has 5+ tasks | Count JSON files | 0/10 |
| debugging/ has 5+ tasks | Count JSON files | 0/10 |
| refactoring/ has 5+ tasks | Count JSON files | 0/10 |
| Tasks cover new features | Manual review | 0/10 |
| All tasks still work | Run evals | 0/10 |
**Total: /100 → Scale to 20%**
### Quality Automation Audit (10%)
| Item | Check | Score |
|------|-------|-------|
| harness/quality/score.go exists | `test -f harness/quality/score.go` | 0/25 |
| Quality score calculation works | Run it | 0/25 |
| harness/cleanup/tasks.go exists | `test -f harness/cleanup/tasks.go` | 0/25 |
| Cleanup tasks find real issues | Run dry-run | 0/25 |
**Total: /100 → Scale to 10%**
### Integration Audit (10%)
| Item | Check | Score |
|------|-------|-------|
| `go build ./...` passes | Run it | 0/40 |
| `make lint-arch` passes | Run it | 0/30 |
| CI runs harness checks | Check CI config | 0/30 |
**Total: /100 → Scale to 10%**
---
## Scoring Rubric
### How to Score Each Item
- **Binary items** (exists/doesn't): 0 or full points
- **Quality items** (matches reality): Partial credit based on accuracy
- 100%: Exact match
- 75%: Minor discrepancies (1-2 items)
- 50%: Moderate discrepancies (3-5 items)
- 25%: Major discrepancies but structure is right
- 0%: Completely wrong or missing
### Calculating Overall Score
```
Overall = (Doc × 0.25) + (Linter × 0.20) + (Obs × 0.15) + (Eval × 0.20) + (Quality × 0.10) + (Integration × 0.10)
```
### Score Interpretation
| Score | Status | Action |
|-------|--------|--------|
| 0-20% | Critical | Use Create Mode — build from scratch |
| 21-40% | Poor | Major gaps — extensive improvement needed |
| 41-60% | Fair | Multiple gaps — targeted improvement |
| 61-80% | Good | Minor gaps — polish and expand |
| 81-100% | Excellent | Maintenance mode — keep current |
---
## Gap Analysis Templates
### Documentation Drift Report
```markdown
## Documentation Drift Analysis
### ARCHITECTURE.md Layer Hierarchy
**Documented Layers:**
```
[Copy from ARCHITECTURE.md]
```
**Actual Package Structure:**
```bash
go list ./... | grep -v vendor
```
**Discrepancies:**
| Documented | Actual | Issue |
|------------|--------|-------|
| core/types | core/types | ✓ Match |
| core/agent | core/agent | ✓ Match |
| - | core/newpkg | Missing from docs |
### Tool Catalog
**Documented Tools:** [count]
**Actual Tools:** [count]
**Missing from docs:**
- ToolA (added in commit abc123)
- ToolB (added in commit def456)
### Error Codes
**Documented Codes:** [count]
**Actual Codes:** [count]
**Missing from docs:**
- 300105 NotFoundError (added in PR #123)
```
### Linter Gap Report
```markdown
## Linter Gap Analysis
### Layer Map Coverage
**Packages in layer map:** [count]
**Packages in codebase:** [count]
**Missing from layer map:**
| Package | Suggested Layer | Reason |
|---------|-----------------|--------|
| core/newpkg | Layer 2 | Depends only on core/types |
| api/v2 | Layer 4 | New API version |
### Violation Test Results
| Test | Expected | Actual | Status |
|------|----------|--------|--------|
| Bad import in core/types | Fail | Fail | ✓ Pass |
| Bad import in core/agent | Fail | Fail | ✓ Pass |
| Bad import in api/v2 | Fail | Pass | ✗ Gap |
### Quality Rules Coverage
**Rules in QUALITY.md:** [count]
**Rules in lint-quality.go:** [count]
**Missing enforcement:**
- Rule 5: "No hardcoded timeouts" — not checked by linter
```
### Eval Coverage Report
```markdown
## Eval Coverage Analysis
### Tasks per Category
| Category | Count | Target | Status |
|----------|-------|--------|--------|
| file_ops | 3 | 5+ | ✗ Below target |
| code_gen | 2 | 5+ | ✗ Below target |
| debugging | 5 | 5+ | ✓ Meets target |
| refactoring | 4 | 5+ | ✗ Below target |
### Feature Coverage
| Feature | Has Eval | Priority |
|---------|----------|----------|
| File write | ✓ | - |
| File read | ✓ | - |
| JSON parsing | ✗ | P1 |
| Error handling | ✓ | - |
| New auth module | ✗ | P0 |
### Task Health
| Task ID | Status | Issue |
|---------|--------|-------|
| file_ops_001 | ✓ Works | - |
| code_gen_001 | ✗ Broken | Uses removed API |
| debug_001 | ✓ Works | - |
```
---
## Improvement Plan Template
```markdown
## Harness Improvement Plan
**Project:** [Name]
**Audit Date:** YYYY-MM-DD
**Audit Score:** XX%
**Target Score:** 80%+
### Priority Gaps
#### P0 — Critical (Fix Immediately)
1. [Gap description]
- Impact: [Why this matters]
- Fix: [Specific action]
- Effort: [Hours estimate]
#### P1 — High (Fix This Sprint)
1. [Gap description]
- Impact: [Why this matters]
- Fix: [Specific action]
- Effort: [Hours estimate]
#### P2 — Medium (Fix Next Sprint)
1. [Gap description]
- Impact: [Why this matters]
- Fix: [Specific action]
- Effort: [Hours estimate]
#### P3 — Low (Backlog)
1. [Gap description]
- Impact: [Why this matters]
- Fix: [Specific action]
- Effort: [Hours estimate]
### Improvement Timeline
| Week | Focus | Expected Score |
|------|-------|----------------|
| 1 | P0 gaps | 45% → 55% |
| 2 | P1 gaps | 55% → 70% |
| 3 | P2 gaps | 70% → 80% |
| 4 | P3 gaps + polish | 80% → 85% |
### Success Metrics
- [ ] Audit score ≥ 80%
- [ ] No P0 or P1 gaps remaining
- [ ] `make lint-arch` passes
- [ ] All eval categories have 5+ tasks
- [ ] Quality score trend is positive
```
---
## Before/After Comparison Template
```markdown
## Improvement Results
**Project:** [Name]
**Improvement Period:** YYYY-MM-DD to YYYY-MM-DD
### Score Comparison
| Dimension | Before | After | Delta |
|-----------|--------|-------|-------|
| Documentation | XX% | XX% | +XX% |
| Linters | XX% | XX% | +XX% |
| Observability | XX% | XX% | +XX% |
| Evals | XX% | XX% | +XX% |
| Quality | XX% | XX% | +XX% |
| Integration | XX% | XX% | +XX% |
| **Overall** | **XX%** | **XX%** | **+XX%** |
### Changes Made
#### Documentation
- Updated ARCHITECTURE.md with [changes]
- Created design doc for [component]
- Added [N] entries to tool catalog
#### Linters
- Added [N] packages to layer map
- Created new linter for [pattern]
- Fixed [N] false positives
#### Evals
- Added [N] new eval tasks
- Removed [N] obsolete tasks
- Updated [N] broken tasks
#### Quality
- Added cleanup task for [pattern]
- Updated quality score weights
- Fixed [N] golden principle violations
### Remaining Gaps
[List any P2/P3 items not yet addressed]
### Recommendations
[Next steps for maintaining/improving harness]
```
---
## Automated Audit Script
```go
// scripts/audit-harness.go
//
// Automated harness audit. Run: go run scripts/audit-harness.go
//
// Outputs JSON with scores per dimension.
package main
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
)
type AuditResult struct {
Dimension string `json:"dimension"`
Score float64 `json:"score"`
MaxScore float64 `json:"max_score"`
Percent float64 `json:"percent"`
Items []AuditItem `json:"items"`
}
type AuditItem struct {
Name string `json:"name"`
Score float64 `json:"score"`
Max float64 `json:"max"`
Notes string `json:"notes,omitempty"`
}
func main() {
results := []AuditResult{
auditDocumentation(),
auditLinters(),
auditObservability(),
auditEvals(),
auditQuality(),
auditIntegration(),
}
// Calculate overall
weights := map[string]float64{
"Documentation": 0.25,
"Linters": 0.20,
"Observability": 0.15,
"Evals": 0.20,
"Quality": 0.10,
"Integration": 0.10,
}
var overall float64
for _, r := range results {
overall += r.Percent * weights[r.Dimension]
}
// Output
output := map[string]interface{}{
"results": results,
"overall": overall,
}
data, _ := json.MarshalIndent(output, "", " ")
fmt.Println(string(data))
}
func auditDocumentation() AuditResult {
r := AuditResult{Dimension: "Documentation", MaxScore: 100}
// Check files exist
files := map[string]float64{
"AGENTS.md": 10,
"docs/ARCHITECTURE.md": 10,
"docs/DEVELOPMENT.md": 10,
"docs/QUALITY.md": 10,
}
for file, points := range files {
if _, err := os.Stat(file); err == nil {
r.Score += points
r.Items = append(r.Items, AuditItem{Name: file, Score: points, Max: points})
} else {
r.Items = append(r.Items, AuditItem{Name: file, Score: 0, Max: points, Notes: "missing"})
}
}
// Check docs/design-docs/ has files (not just the index)
if matches, _ := filepath.Glob("docs/design-docs/*.md"); len(matches) > 0 {
// Exclude index.md from count
actualDocs := 0
for _, m := range matches {
if !strings.HasSuffix(m, "index.md") {
actualDocs++
}
}
score := min(float64(actualDocs)*5, 20)
r.Score += score
r.Items = append(r.Items, AuditItem{Name: "docs/design-docs/", Score: score, Max: 20, Notes: fmt.Sprintf("%d design docs (excluding index)", actualDocs)})
} else {
r.Items = append(r.Items, AuditItem{Name: "docs/design-docs/", Score: 0, Max: 20, Notes: "empty or missing"})
}
// Check docs/references/ has files
if matches, _ := filepath.Glob("docs/references/*.md"); len(matches) > 0 {
score := min(float64(len(matches))*5, 20)
r.Score += score
r.Items = append(r.Items, AuditItem{Name: "docs/references/", Score: score, Max: 20, Notes: fmt.Sprintf("%d files", len(matches))})
} else {
r.Items = append(r.Items, AuditItem{Name: "docs/references/", Score: 0, Max: 20, Notes: "empty or missing"})
}
// Remaining 20 points for AGENTS.md line count
if data, err := os.ReadFile("AGENTS.md"); err == nil {
lines := len(strings.Split(string(data), "\n"))
if lines >= 80 && lines <= 150 {
r.Score += 20
r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 20, Max: 20, Notes: fmt.Sprintf("%d lines", lines)})
} else if lines < 80 {
r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 10, Max: 20, Notes: fmt.Sprintf("%d lines (too short)", lines)})
r.Score += 10
} else {
r.Items = append(r.Items, AuditItem{Name: "AGENTS.md size", Score: 5, Max: 20, Notes: fmt.Sprintf("%d lines (too long, should be map)", lines)})
r.Score += 5
}
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func auditLinters() AuditResult {
r := AuditResult{Dimension: "Linters", MaxScore: 100}
linters := []string{"scripts/lint-deps.go", "scripts/lint-quality.go"}
for _, l := range linters {
if _, err := os.Stat(l); err == nil {
r.Score += 25
r.Items = append(r.Items, AuditItem{Name: l, Score: 25, Max: 25})
} else {
r.Items = append(r.Items, AuditItem{Name: l, Score: 0, Max: 25, Notes: "missing"})
}
}
// Check Makefile
if data, err := os.ReadFile("Makefile"); err == nil {
if strings.Contains(string(data), "lint-arch") {
r.Score += 25
r.Items = append(r.Items, AuditItem{Name: "Makefile lint-arch", Score: 25, Max: 25})
} else {
r.Items = append(r.Items, AuditItem{Name: "Makefile lint-arch", Score: 0, Max: 25, Notes: "target missing"})
}
}
// Remaining 25 for additional linters
if matches, _ := filepath.Glob("scripts/lint-*.go"); len(matches) > 2 {
r.Score += 25
r.Items = append(r.Items, AuditItem{Name: "additional linters", Score: 25, Max: 25, Notes: fmt.Sprintf("%d total", len(matches))})
} else {
r.Items = append(r.Items, AuditItem{Name: "additional linters", Score: 0, Max: 25, Notes: "only core linters"})
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func auditObservability() AuditResult {
r := AuditResult{Dimension: "Observability", MaxScore: 100}
dirs := map[string]float64{
"harness/trace": 35,
"harness/selftest": 35,
}
for dir, points := range dirs {
if info, err := os.Stat(dir); err == nil && info.IsDir() {
r.Score += points
r.Items = append(r.Items, AuditItem{Name: dir, Score: points, Max: points})
} else {
r.Items = append(r.Items, AuditItem{Name: dir, Score: 0, Max: points, Notes: "missing"})
}
}
// Check for observability hook
if matches, _ := filepath.Glob("**/observability*.go"); len(matches) > 0 {
r.Score += 30
r.Items = append(r.Items, AuditItem{Name: "observability hook", Score: 30, Max: 30})
} else {
r.Items = append(r.Items, AuditItem{Name: "observability hook", Score: 0, Max: 30, Notes: "not found"})
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func auditEvals() AuditResult {
r := AuditResult{Dimension: "Evals", MaxScore: 100}
// Framework files (40 points)
files := []string{
"harness/eval/framework.go",
"harness/eval/runner.go",
"harness/eval/scorer.go",
"harness/eval/reporter.go",
}
for _, f := range files {
if _, err := os.Stat(f); err == nil {
r.Score += 10
r.Items = append(r.Items, AuditItem{Name: f, Score: 10, Max: 10})
} else {
r.Items = append(r.Items, AuditItem{Name: f, Score: 0, Max: 10, Notes: "missing"})
}
}
// Dataset categories (60 points, 15 each)
categories := []string{"file_ops", "code_gen", "debugging", "refactoring"}
for _, cat := range categories {
pattern := fmt.Sprintf("harness/eval/datasets/%s/*.json", cat)
matches, _ := filepath.Glob(pattern)
if len(matches) >= 5 {
r.Score += 15
r.Items = append(r.Items, AuditItem{Name: cat, Score: 15, Max: 15, Notes: fmt.Sprintf("%d tasks", len(matches))})
} else if len(matches) > 0 {
score := float64(len(matches)) * 3
r.Score += score
r.Items = append(r.Items, AuditItem{Name: cat, Score: score, Max: 15, Notes: fmt.Sprintf("%d tasks (need 5+)", len(matches))})
} else {
r.Items = append(r.Items, AuditItem{Name: cat, Score: 0, Max: 15, Notes: "no tasks"})
}
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func auditQuality() AuditResult {
r := AuditResult{Dimension: "Quality", MaxScore: 100}
items := map[string]float64{
"harness/quality/score.go": 35,
"harness/cleanup/tasks.go": 35,
"docs/QUALITY.md": 30,
}
for item, points := range items {
if _, err := os.Stat(item); err == nil {
r.Score += points
r.Items = append(r.Items, AuditItem{Name: item, Score: points, Max: points})
} else {
r.Items = append(r.Items, AuditItem{Name: item, Score: 0, Max: points, Notes: "missing"})
}
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func auditIntegration() AuditResult {
r := AuditResult{Dimension: "Integration", MaxScore: 100}
// Check go.mod exists (build will work)
if _, err := os.Stat("go.mod"); err == nil {
r.Score += 40
r.Items = append(r.Items, AuditItem{Name: "go.mod", Score: 40, Max: 40})
} else {
r.Items = append(r.Items, AuditItem{Name: "go.mod", Score: 0, Max: 40, Notes: "missing"})
}
// Check Makefile exists
if _, err := os.Stat("Makefile"); err == nil {
r.Score += 30
r.Items = append(r.Items, AuditItem{Name: "Makefile", Score: 30, Max: 30})
} else {
r.Items = append(r.Items, AuditItem{Name: "Makefile", Score: 0, Max: 30, Notes: "missing"})
}
// Check for CI config
ciConfigs := []string{".github/workflows", ".gitlab-ci.yml", "Jenkinsfile", ".circleci"}
found := false
for _, ci := range ciConfigs {
if _, err := os.Stat(ci); err == nil {
found = true
r.Score += 30
r.Items = append(r.Items, AuditItem{Name: "CI config", Score: 30, Max: 30, Notes: ci})
break
}
}
if !found {
r.Items = append(r.Items, AuditItem{Name: "CI config", Score: 0, Max: 30, Notes: "not found"})
}
r.Percent = (r.Score / r.MaxScore) * 100
return r
}
func min(a, b float64) float64 {
if a < b {
return a
}
return b
}
import "strings"
```
Note: The script above has a deliberate syntax issue (import at the end) — move the `import "strings"` to the import block at the top when using.