25 KiB
Garbage Collection Templates
"Documentation entropy compounds. A single outdated comment erodes trust in all docs."
Templates for automated documentation hygiene — detecting stale docs, broken links, interface drift, and other forms of documentation rot before they erode agent trust.
Table of Contents
- Why Garbage Collection
- Stale Doc Detector
- Broken Link Checker
- Interface Drift Detector
- Unified GC Runner
- CI Integration
- Scheduling Guide
Why Garbage Collection
Every outdated doc is a trap for agents. When an agent reads stale architecture docs, it makes decisions based on fiction — then validation catches the mismatch 30 tool calls later, wasting tokens and time. Proactive garbage collection prevents this by catching documentation rot at the source.
| Problem | Impact on Agents | Detection Method |
|---|---|---|
| Stale doc (code changed, doc didn't) | Makes decisions on outdated architecture | Compare timestamps |
| Broken link (target moved/deleted) | Dead-end navigation, wasted context | Crawl internal links |
| Interface drift (doc says X, code does Y) | Writes code against wrong interface | Parse code vs doc |
| Orphan doc (no code references it) | Clutters context, wastes tokens | Reverse reference check |
1. Stale Doc Detector
Compares documentation timestamps against the code they describe. If the code changed significantly after the doc was last updated, the doc is potentially stale.
#!/usr/bin/env python3
"""
scripts/gc-stale-docs.py
Detects documentation that may be stale relative to the code it describes.
Usage:
python3 scripts/gc-stale-docs.py .
python3 scripts/gc-stale-docs.py . --json
python3 scripts/gc-stale-docs.py . --threshold 30 # days
"""
import argparse
import json
import os
import re
import subprocess
import sys
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Optional, Tuple
def get_git_last_modified(filepath: str) -> Optional[datetime]:
"""Get the last git commit date for a file."""
try:
result = subprocess.run(
["git", "log", "-1", "--format=%aI", "--", filepath],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0 and result.stdout.strip():
return datetime.fromisoformat(result.stdout.strip())
except Exception:
pass
return None
def extract_code_references(doc_path: str) -> List[str]:
"""Extract file paths referenced in a markdown document.
Looks for patterns like:
- `internal/core/service.go`
- `internal/core/service.go:25-48`
- [link text](../internal/core/service.go)
- Sources: [`internal/types/token.go:10-15`]()
"""
references = []
try:
content = Path(doc_path).read_text()
except Exception:
return references
# Match backtick-quoted file paths
backtick_pattern = r'`([a-zA-Z0-9_./\-]+\.(go|ts|tsx|js|jsx|py|rs))(?::\d+(?:-\d+)?)?`'
references.extend(m[0] for m in re.findall(backtick_pattern, content))
# Match markdown link targets
link_pattern = r'\]\((?:\.\.?/)*([a-zA-Z0-9_./\-]+\.(go|ts|tsx|js|jsx|py|rs))(?::\d+(?:-\d+)?)?\)'
references.extend(m[0] for m in re.findall(link_pattern, content))
# Deduplicate, strip line numbers
cleaned = set()
for ref in references:
clean = re.sub(r':\d+(-\d+)?$', '', ref)
cleaned.add(clean)
return list(cleaned)
def find_doc_to_code_mapping(project_root: Path) -> Dict[str, List[str]]:
"""Map each doc file to the code files it references."""
mapping = {}
doc_dirs = ["docs", "docs/design-docs", "docs/references"]
doc_files = []
for d in doc_dirs:
doc_dir = project_root / d
if doc_dir.exists():
for f in doc_dir.glob("*.md"):
doc_files.append(f)
# Also check AGENTS.md
agents_md = project_root / "AGENTS.md"
if agents_md.exists():
doc_files.append(agents_md)
for doc_file in doc_files:
refs = extract_code_references(str(doc_file))
if refs:
rel_doc = str(doc_file.relative_to(project_root))
mapping[rel_doc] = refs
return mapping
def check_staleness(
project_root: Path,
threshold_days: int = 14
) -> List[Dict]:
"""Check all docs for staleness relative to referenced code."""
mapping = find_doc_to_code_mapping(project_root)
threshold = timedelta(days=threshold_days)
findings = []
for doc_path, code_refs in mapping.items():
doc_modified = get_git_last_modified(doc_path)
if not doc_modified:
continue
stale_refs = []
for code_ref in code_refs:
code_path = str(project_root / code_ref)
if not os.path.exists(code_path):
stale_refs.append({
"file": code_ref,
"reason": "file_deleted",
"detail": f"Referenced file no longer exists"
})
continue
code_modified = get_git_last_modified(code_ref)
if code_modified and code_modified > doc_modified + threshold:
days_behind = (code_modified - doc_modified).days
stale_refs.append({
"file": code_ref,
"reason": "code_newer",
"detail": f"Code updated {days_behind} days after doc",
"code_date": code_modified.isoformat(),
"doc_date": doc_modified.isoformat()
})
if stale_refs:
findings.append({
"doc": doc_path,
"doc_last_modified": doc_modified.isoformat(),
"stale_references": stale_refs,
"severity": "high" if any(r["reason"] == "file_deleted" for r in stale_refs) else "medium"
})
return findings
def main():
parser = argparse.ArgumentParser(description="Detect stale documentation")
parser.add_argument("project_root", help="Project root directory")
parser.add_argument("--threshold", type=int, default=14,
help="Days threshold for staleness (default: 14)")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
project_root = Path(args.project_root).resolve()
findings = check_staleness(project_root, args.threshold)
if args.json:
print(json.dumps({"stale_docs": findings, "threshold_days": args.threshold}, indent=2))
else:
if not findings:
print("✓ No stale documentation detected")
return
print(f"⚠ Found {len(findings)} potentially stale document(s):\n")
for f in findings:
severity_icon = "🔴" if f["severity"] == "high" else "🟡"
print(f"{severity_icon} {f['doc']} (last updated: {f['doc_last_modified'][:10]})")
for ref in f["stale_references"]:
if ref["reason"] == "file_deleted":
print(f" ✗ {ref['file']} — DELETED (reference is broken)")
else:
print(f" ✗ {ref['file']} — code updated {ref['detail']}")
print()
sys.exit(1 if findings else 0)
if __name__ == "__main__":
main()
2. Broken Link Checker
Validates all internal markdown links (to other docs, to code files, to anchors) are still valid.
#!/usr/bin/env python3
"""
scripts/gc-broken-links.py
Checks all markdown files for broken internal links.
Usage:
python3 scripts/gc-broken-links.py .
python3 scripts/gc-broken-links.py . --json
"""
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Dict, List
def find_markdown_files(root: Path) -> List[Path]:
"""Find all markdown files in the project."""
md_files = []
for pattern in ["*.md", "docs/**/*.md"]:
md_files.extend(root.glob(pattern))
return sorted(set(md_files))
def extract_links(md_path: Path) -> List[Dict]:
"""Extract all internal links from a markdown file."""
content = md_path.read_text()
links = []
# [text](target) — skip external URLs
link_pattern = r'\[([^\]]*)\]\(([^)]+)\)'
for match in re.finditer(link_pattern, content):
target = match.group(2)
if target.startswith(("http://", "https://", "mailto:")):
continue
line_num = content[:match.start()].count('\n') + 1
links.append({
"text": match.group(1),
"target": target,
"line": line_num
})
return links
def resolve_link(md_path: Path, target: str, project_root: Path) -> Dict:
"""Check if a link target is valid. Returns status dict."""
# Split anchor from path
if "#" in target:
path_part, anchor = target.rsplit("#", 1)
else:
path_part, anchor = target, None
if not path_part:
# Pure anchor link (#section) — check current file
if anchor:
return check_anchor(md_path, anchor)
return {"valid": True}
# Resolve relative path from the markdown file's directory
resolved = (md_path.parent / path_part).resolve()
if not resolved.exists():
return {
"valid": False,
"reason": "file_not_found",
"resolved_path": str(resolved.relative_to(project_root))
}
if anchor and resolved.suffix == ".md":
return check_anchor(resolved, anchor)
return {"valid": True}
def check_anchor(md_path: Path, anchor: str) -> Dict:
"""Check if an anchor exists in a markdown file."""
try:
content = md_path.read_text()
except Exception:
return {"valid": False, "reason": "cannot_read_file"}
# Generate anchors from headings (GitHub-style)
headings = re.findall(r'^#{1,6}\s+(.+)$', content, re.MULTILINE)
anchors = set()
for h in headings:
slug = re.sub(r'[^\w\s-]', '', h.lower())
slug = re.sub(r'[\s]+', '-', slug).strip('-')
anchors.add(slug)
if anchor.lower() in anchors:
return {"valid": True}
return {
"valid": False,
"reason": "anchor_not_found",
"anchor": anchor,
"available_anchors": sorted(anchors)[:10]
}
def check_all_links(project_root: Path) -> List[Dict]:
"""Check all internal links in all markdown files."""
findings = []
md_files = find_markdown_files(project_root)
for md_file in md_files:
links = extract_links(md_file)
broken = []
for link in links:
result = resolve_link(md_file, link["target"], project_root)
if not result["valid"]:
broken.append({**link, **result})
if broken:
findings.append({
"file": str(md_file.relative_to(project_root)),
"broken_links": broken
})
return findings
def main():
parser = argparse.ArgumentParser(description="Check for broken internal links")
parser.add_argument("project_root", help="Project root directory")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
project_root = Path(args.project_root).resolve()
findings = check_all_links(project_root)
if args.json:
print(json.dumps({"broken_links": findings}, indent=2))
else:
if not findings:
print("✓ No broken internal links found")
return
total = sum(len(f["broken_links"]) for f in findings)
print(f"✗ Found {total} broken link(s) in {len(findings)} file(s):\n")
for f in findings:
print(f" {f['file']}:")
for link in f["broken_links"]:
print(f" Line {link['line']}: [{link['text']}]({link['target']})")
print(f" → {link['reason']}")
print()
sys.exit(1 if findings else 0)
if __name__ == "__main__":
main()
3. Interface Drift Detector
Detects when documented interfaces no longer match the actual code — the most insidious form of documentation rot because the doc looks legitimate but leads agents astray.
#!/usr/bin/env python3
"""
scripts/gc-interface-drift.py
Detects drift between documented interfaces and actual code definitions.
Usage:
python3 scripts/gc-interface-drift.py .
python3 scripts/gc-interface-drift.py . --json
python3 scripts/gc-interface-drift.py . --doc docs/ARCHITECTURE.md
"""
import argparse
import json
import re
import subprocess
import sys
from pathlib import Path
from typing import Dict, List, Optional, Set
def extract_documented_interfaces(doc_path: Path) -> List[Dict]:
"""Extract interface/type names mentioned in documentation."""
content = doc_path.read_text()
interfaces = []
# Match backtick-quoted type names with file references
# Pattern: `TypeName` ... `file.go:line`
type_pattern = r'`([A-Z][a-zA-Z0-9]+(?:Interface|Service|Provider|Handler|Repository|Store|Manager|Client)?)`'
for match in re.finditer(type_pattern, content):
name = match.group(1)
line_num = content[:match.start()].count('\n') + 1
interfaces.append({
"name": name,
"doc_file": str(doc_path),
"doc_line": line_num
})
return interfaces
def find_type_in_code_go(project_root: Path, type_name: str) -> Optional[Dict]:
"""Find a Go type/interface definition in code."""
try:
result = subprocess.run(
["grep", "-rn", f"type {type_name} ", "--include=*.go",
str(project_root)],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0 and result.stdout.strip():
first_match = result.stdout.strip().split('\n')[0]
parts = first_match.split(':', 2)
if len(parts) >= 2:
return {
"file": str(Path(parts[0]).relative_to(project_root)),
"line": int(parts[1]),
"definition": parts[2].strip() if len(parts) > 2 else ""
}
except Exception:
pass
return None
def find_type_in_code_ts(project_root: Path, type_name: str) -> Optional[Dict]:
"""Find a TypeScript interface/class/type definition in code."""
try:
patterns = [
f"(export )?interface {type_name}",
f"(export )?class {type_name}",
f"(export )?type {type_name}"
]
for pattern in patterns:
result = subprocess.run(
["grep", "-rn", "-E", pattern, "--include=*.ts", "--include=*.tsx",
str(project_root)],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0 and result.stdout.strip():
first_match = result.stdout.strip().split('\n')[0]
parts = first_match.split(':', 2)
if len(parts) >= 2:
return {
"file": str(Path(parts[0]).relative_to(project_root)),
"line": int(parts[1]),
"definition": parts[2].strip() if len(parts) > 2 else ""
}
except Exception:
pass
return None
def find_type_in_code_py(project_root: Path, type_name: str) -> Optional[Dict]:
"""Find a Python class definition in code."""
try:
result = subprocess.run(
["grep", "-rn", f"class {type_name}", "--include=*.py",
str(project_root)],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0 and result.stdout.strip():
first_match = result.stdout.strip().split('\n')[0]
parts = first_match.split(':', 2)
if len(parts) >= 2:
return {
"file": str(Path(parts[0]).relative_to(project_root)),
"line": int(parts[1]),
"definition": parts[2].strip() if len(parts) > 2 else ""
}
except Exception:
pass
return None
def detect_project_lang(project_root: Path) -> str:
"""Auto-detect project language."""
if (project_root / "go.mod").exists():
return "go"
if (project_root / "package.json").exists():
return "ts"
if (project_root / "pyproject.toml").exists() or (project_root / "requirements.txt").exists():
return "py"
return "go" # default
def check_interface_drift(project_root: Path, doc_path: Optional[str] = None) -> List[Dict]:
"""Check for interface drift between docs and code."""
lang = detect_project_lang(project_root)
finder = {"go": find_type_in_code_go, "ts": find_type_in_code_ts, "py": find_type_in_code_py}[lang]
# Collect docs to check
if doc_path:
doc_files = [project_root / doc_path]
else:
doc_files = []
for pattern in ["docs/*.md", "docs/**/*.md", "AGENTS.md"]:
doc_files.extend(project_root.glob(pattern))
findings = []
seen_types: Set[str] = set()
for df in doc_files:
if not df.exists():
continue
documented = extract_documented_interfaces(df)
for iface in documented:
if iface["name"] in seen_types:
continue
seen_types.add(iface["name"])
code_loc = finder(project_root, iface["name"])
if not code_loc:
findings.append({
"type_name": iface["name"],
"documented_in": str(df.relative_to(project_root)),
"doc_line": iface["doc_line"],
"status": "not_found_in_code",
"severity": "high",
"suggestion": f"Type '{iface['name']}' referenced in docs but not found in code. "
f"It may have been renamed, deleted, or the doc is outdated."
})
return findings
def main():
parser = argparse.ArgumentParser(description="Detect interface drift between docs and code")
parser.add_argument("project_root", help="Project root directory")
parser.add_argument("--doc", help="Check a specific doc file only")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
project_root = Path(args.project_root).resolve()
findings = check_interface_drift(project_root, args.doc)
if args.json:
print(json.dumps({"interface_drift": findings}, indent=2))
else:
if not findings:
print("✓ No interface drift detected")
return
print(f"⚠ Found {len(findings)} potential interface drift issue(s):\n")
for f in findings:
icon = "🔴" if f["severity"] == "high" else "🟡"
print(f"{icon} {f['type_name']}")
print(f" Documented in: {f['documented_in']}:{f['doc_line']}")
print(f" Status: {f['status']}")
print(f" → {f['suggestion']}")
print()
sys.exit(1 if findings else 0)
if __name__ == "__main__":
main()
4. Unified GC Runner
Runs all garbage collection checks in one pass. Produces a combined report.
#!/usr/bin/env python3
"""
scripts/gc-docs.py
Unified documentation garbage collection runner.
Runs stale doc detection, broken link checking, and interface drift detection.
Usage:
python3 scripts/gc-docs.py .
python3 scripts/gc-docs.py . --json
python3 scripts/gc-docs.py . --checks stale,links # selective checks
"""
import argparse
import json
import sys
from pathlib import Path
from datetime import datetime
# Import the individual checkers (when used as a unified runner,
# these would be in the same scripts/ directory)
# For template purposes, showing the integration pattern:
def run_gc(project_root: Path, checks: list, threshold_days: int = 14) -> dict:
"""Run selected garbage collection checks."""
report = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"project_root": str(project_root),
"checks_run": checks,
"results": {},
"summary": {"total_issues": 0, "high": 0, "medium": 0, "low": 0}
}
if "stale" in checks:
# Run stale doc detection
from gc_stale_docs import check_staleness
stale = check_staleness(project_root, threshold_days)
report["results"]["stale_docs"] = stale
for item in stale:
report["summary"]["total_issues"] += 1
report["summary"][item.get("severity", "medium")] += 1
if "links" in checks:
# Run broken link detection
from gc_broken_links import check_all_links
broken = check_all_links(project_root)
report["results"]["broken_links"] = broken
for item in broken:
count = len(item.get("broken_links", []))
report["summary"]["total_issues"] += count
report["summary"]["high"] += count # broken links are always high severity
if "drift" in checks:
# Run interface drift detection
from gc_interface_drift import check_interface_drift
drift = check_interface_drift(project_root)
report["results"]["interface_drift"] = drift
for item in drift:
report["summary"]["total_issues"] += 1
report["summary"][item.get("severity", "medium")] += 1
return report
def main():
parser = argparse.ArgumentParser(description="Documentation garbage collection")
parser.add_argument("project_root", help="Project root directory")
parser.add_argument("--checks", default="stale,links,drift",
help="Comma-separated checks to run (default: all)")
parser.add_argument("--threshold", type=int, default=14,
help="Staleness threshold in days")
parser.add_argument("--json", action="store_true")
parser.add_argument("-o", "--output", help="Write report to file")
args = parser.parse_args()
project_root = Path(args.project_root).resolve()
checks = [c.strip() for c in args.checks.split(",")]
report = run_gc(project_root, checks, args.threshold)
output = json.dumps(report, indent=2) if args.json else format_report(report)
if args.output:
Path(args.output).write_text(output)
print(f"Report written to {args.output}")
else:
print(output)
sys.exit(1 if report["summary"]["total_issues"] > 0 else 0)
def format_report(report: dict) -> str:
"""Human-readable report format."""
lines = []
s = report["summary"]
total = s["total_issues"]
if total == 0:
return "✓ Documentation is clean — no issues found"
lines.append(f"Documentation GC Report — {report['timestamp'][:10]}")
lines.append(f"{'=' * 50}")
lines.append(f"Total issues: {total} (🔴 {s['high']} high, 🟡 {s['medium']} medium, 🟢 {s['low']} low)")
lines.append("")
for check_name, results in report["results"].items():
if results:
lines.append(f"## {check_name.replace('_', ' ').title()}")
lines.append(f" {len(results)} issue(s) found")
lines.append("")
return "\n".join(lines)
if __name__ == "__main__":
main()
5. CI Integration
GitHub Actions
# .github/workflows/doc-gc.yml
name: Documentation GC
on:
push:
branches: [main]
paths: ['docs/**', 'AGENTS.md', '*.md']
schedule:
- cron: '0 9 * * 1' # Weekly Monday 9am
jobs:
doc-gc:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for git log dates
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Run documentation GC
run: python3 scripts/gc-docs.py . --json -o gc-report.json
- name: Upload report
if: failure()
uses: actions/upload-artifact@v4
with:
name: doc-gc-report
path: gc-report.json
Makefile Target
.PHONY: gc-docs
gc-docs:
@echo "Running documentation garbage collection..."
@python3 scripts/gc-docs.py . || true
@echo ""
@echo "To fix: review stale docs and update or delete them"
6. Scheduling Guide
| Check | Frequency | Trigger | Rationale |
|---|---|---|---|
| Broken links | Per-commit (CI) | Push to main | Catches renames/deletes immediately |
| Stale docs | Weekly | Scheduled CI | Balances signal vs noise |
| Interface drift | Per-PR | PR check | Prevents drift from entering main |
| Full GC | Weekly + on-demand | Manual or scheduled | Comprehensive health check |
Running from Harness Executor
After completing a task that creates or modifies code, the executor can run a quick GC check:
# Quick post-task check — just broken links on changed docs
python3 scripts/gc-broken-links.py . --json | python3 -c "
import sys, json
data = json.load(sys.stdin)
if data['broken_links']:
print('⚠ Task may have introduced broken doc links — review before committing')
"
Running from ECL Harness Engineer (Improve Mode)
During a harness audit, run the full GC suite to assess documentation health:
python3 scripts/gc-docs.py . --json -o harness/trace/gc-report.json
The GC report feeds into the audit score (Documentation dimension) and helps prioritize which docs to fix first.