25 KiB

Raw Blame History

Garbage Collection Templates

"Documentation entropy compounds. A single outdated comment erodes trust in all docs."

Templates for automated documentation hygiene — detecting stale docs, broken links, interface drift, and other forms of documentation rot before they erode agent trust.

Why Garbage Collection
Stale Doc Detector
Broken Link Checker
Interface Drift Detector
Unified GC Runner
CI Integration
Scheduling Guide

Why Garbage Collection

Every outdated doc is a trap for agents. When an agent reads stale architecture docs, it makes decisions based on fiction — then validation catches the mismatch 30 tool calls later, wasting tokens and time. Proactive garbage collection prevents this by catching documentation rot at the source.

Problem	Impact on Agents	Detection Method
Stale doc (code changed, doc didn't)	Makes decisions on outdated architecture	Compare timestamps
Broken link (target moved/deleted)	Dead-end navigation, wasted context	Crawl internal links
Interface drift (doc says X, code does Y)	Writes code against wrong interface	Parse code vs doc
Orphan doc (no code references it)	Clutters context, wastes tokens	Reverse reference check

1. Stale Doc Detector

Compares documentation timestamps against the code they describe. If the code changed significantly after the doc was last updated, the doc is potentially stale.

#!/usr/bin/env python3
"""
scripts/gc-stale-docs.py

Detects documentation that may be stale relative to the code it describes.

Usage:
    python3 scripts/gc-stale-docs.py .
    python3 scripts/gc-stale-docs.py . --json
    python3 scripts/gc-stale-docs.py . --threshold 30  # days
"""

import argparse
import json
import os
import re
import subprocess
import sys
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Optional, Tuple


def get_git_last_modified(filepath: str) -> Optional[datetime]:
    """Get the last git commit date for a file."""
    try:
        result = subprocess.run(
            ["git", "log", "-1", "--format=%aI", "--", filepath],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0 and result.stdout.strip():
            return datetime.fromisoformat(result.stdout.strip())
    except Exception:
        pass
    return None


def extract_code_references(doc_path: str) -> List[str]:
    """Extract file paths referenced in a markdown document.

    Looks for patterns like:
    - `internal/core/service.go`
    - `internal/core/service.go:25-48`
    - [link text](../internal/core/service.go)
    - Sources: [`internal/types/token.go:10-15`]()
    """
    references = []
    try:
        content = Path(doc_path).read_text()
    except Exception:
        return references

    # Match backtick-quoted file paths
    backtick_pattern = r'`([a-zA-Z0-9_./\-]+\.(go|ts|tsx|js|jsx|py|rs))(?::\d+(?:-\d+)?)?`'
    references.extend(m[0] for m in re.findall(backtick_pattern, content))

    # Match markdown link targets
    link_pattern = r'\]\((?:\.\.?/)*([a-zA-Z0-9_./\-]+\.(go|ts|tsx|js|jsx|py|rs))(?::\d+(?:-\d+)?)?\)'
    references.extend(m[0] for m in re.findall(link_pattern, content))

    # Deduplicate, strip line numbers
    cleaned = set()
    for ref in references:
        clean = re.sub(r':\d+(-\d+)?$', '', ref)
        cleaned.add(clean)
    return list(cleaned)


def find_doc_to_code_mapping(project_root: Path) -> Dict[str, List[str]]:
    """Map each doc file to the code files it references."""
    mapping = {}
    doc_dirs = ["docs", "docs/design-docs", "docs/references"]
    doc_files = []

    for d in doc_dirs:
        doc_dir = project_root / d
        if doc_dir.exists():
            for f in doc_dir.glob("*.md"):
                doc_files.append(f)

    # Also check AGENTS.md
    agents_md = project_root / "AGENTS.md"
    if agents_md.exists():
        doc_files.append(agents_md)

    for doc_file in doc_files:
        refs = extract_code_references(str(doc_file))
        if refs:
            rel_doc = str(doc_file.relative_to(project_root))
            mapping[rel_doc] = refs

    return mapping


def check_staleness(
    project_root: Path,
    threshold_days: int = 14
) -> List[Dict]:
    """Check all docs for staleness relative to referenced code."""
    mapping = find_doc_to_code_mapping(project_root)
    threshold = timedelta(days=threshold_days)
    findings = []

    for doc_path, code_refs in mapping.items():
        doc_modified = get_git_last_modified(doc_path)
        if not doc_modified:
            continue

        stale_refs = []
        for code_ref in code_refs:
            code_path = str(project_root / code_ref)
            if not os.path.exists(code_path):
                stale_refs.append({
                    "file": code_ref,
                    "reason": "file_deleted",
                    "detail": f"Referenced file no longer exists"
                })
                continue

            code_modified = get_git_last_modified(code_ref)
            if code_modified and code_modified > doc_modified + threshold:
                days_behind = (code_modified - doc_modified).days
                stale_refs.append({
                    "file": code_ref,
                    "reason": "code_newer",
                    "detail": f"Code updated {days_behind} days after doc",
                    "code_date": code_modified.isoformat(),
                    "doc_date": doc_modified.isoformat()
                })

        if stale_refs:
            findings.append({
                "doc": doc_path,
                "doc_last_modified": doc_modified.isoformat(),
                "stale_references": stale_refs,
                "severity": "high" if any(r["reason"] == "file_deleted" for r in stale_refs) else "medium"
            })

    return findings


def main():
    parser = argparse.ArgumentParser(description="Detect stale documentation")
    parser.add_argument("project_root", help="Project root directory")
    parser.add_argument("--threshold", type=int, default=14,
                        help="Days threshold for staleness (default: 14)")
    parser.add_argument("--json", action="store_true", help="Output as JSON")
    args = parser.parse_args()

    project_root = Path(args.project_root).resolve()
    findings = check_staleness(project_root, args.threshold)

    if args.json:
        print(json.dumps({"stale_docs": findings, "threshold_days": args.threshold}, indent=2))
    else:
        if not findings:
            print("✓ No stale documentation detected")
            return

        print(f"⚠ Found {len(findings)} potentially stale document(s):\n")
        for f in findings:
            severity_icon = "🔴" if f["severity"] == "high" else "🟡"
            print(f"{severity_icon} {f['doc']} (last updated: {f['doc_last_modified'][:10]})")
            for ref in f["stale_references"]:
                if ref["reason"] == "file_deleted":
                    print(f"   ✗ {ref['file']} — DELETED (reference is broken)")
                else:
                    print(f"   ✗ {ref['file']} — code updated {ref['detail']}")
            print()

    sys.exit(1 if findings else 0)


if __name__ == "__main__":
    main()

2. Broken Link Checker

Validates all internal markdown links (to other docs, to code files, to anchors) are still valid.

#!/usr/bin/env python3
"""
scripts/gc-broken-links.py

Checks all markdown files for broken internal links.

Usage:
    python3 scripts/gc-broken-links.py .
    python3 scripts/gc-broken-links.py . --json
"""

import argparse
import json
import re
import sys
from pathlib import Path
from typing import Dict, List


def find_markdown_files(root: Path) -> List[Path]:
    """Find all markdown files in the project."""
    md_files = []
    for pattern in ["*.md", "docs/**/*.md"]:
        md_files.extend(root.glob(pattern))
    return sorted(set(md_files))


def extract_links(md_path: Path) -> List[Dict]:
    """Extract all internal links from a markdown file."""
    content = md_path.read_text()
    links = []

    # [text](target) — skip external URLs
    link_pattern = r'\[([^\]]*)\]\(([^)]+)\)'
    for match in re.finditer(link_pattern, content):
        target = match.group(2)
        if target.startswith(("http://", "https://", "mailto:")):
            continue
        line_num = content[:match.start()].count('\n') + 1
        links.append({
            "text": match.group(1),
            "target": target,
            "line": line_num
        })

    return links


def resolve_link(md_path: Path, target: str, project_root: Path) -> Dict:
    """Check if a link target is valid. Returns status dict."""
    # Split anchor from path
    if "#" in target:
        path_part, anchor = target.rsplit("#", 1)
    else:
        path_part, anchor = target, None

    if not path_part:
        # Pure anchor link (#section) — check current file
        if anchor:
            return check_anchor(md_path, anchor)
        return {"valid": True}

    # Resolve relative path from the markdown file's directory
    resolved = (md_path.parent / path_part).resolve()

    if not resolved.exists():
        return {
            "valid": False,
            "reason": "file_not_found",
            "resolved_path": str(resolved.relative_to(project_root))
        }

    if anchor and resolved.suffix == ".md":
        return check_anchor(resolved, anchor)

    return {"valid": True}


def check_anchor(md_path: Path, anchor: str) -> Dict:
    """Check if an anchor exists in a markdown file."""
    try:
        content = md_path.read_text()
    except Exception:
        return {"valid": False, "reason": "cannot_read_file"}

    # Generate anchors from headings (GitHub-style)
    headings = re.findall(r'^#{1,6}\s+(.+)$', content, re.MULTILINE)
    anchors = set()
    for h in headings:
        slug = re.sub(r'[^\w\s-]', '', h.lower())
        slug = re.sub(r'[\s]+', '-', slug).strip('-')
        anchors.add(slug)

    if anchor.lower() in anchors:
        return {"valid": True}
    return {
        "valid": False,
        "reason": "anchor_not_found",
        "anchor": anchor,
        "available_anchors": sorted(anchors)[:10]
    }


def check_all_links(project_root: Path) -> List[Dict]:
    """Check all internal links in all markdown files."""
    findings = []
    md_files = find_markdown_files(project_root)

    for md_file in md_files:
        links = extract_links(md_file)
        broken = []
        for link in links:
            result = resolve_link(md_file, link["target"], project_root)
            if not result["valid"]:
                broken.append({**link, **result})

        if broken:
            findings.append({
                "file": str(md_file.relative_to(project_root)),
                "broken_links": broken
            })

    return findings


def main():
    parser = argparse.ArgumentParser(description="Check for broken internal links")
    parser.add_argument("project_root", help="Project root directory")
    parser.add_argument("--json", action="store_true", help="Output as JSON")
    args = parser.parse_args()

    project_root = Path(args.project_root).resolve()
    findings = check_all_links(project_root)

    if args.json:
        print(json.dumps({"broken_links": findings}, indent=2))
    else:
        if not findings:
            print("✓ No broken internal links found")
            return

        total = sum(len(f["broken_links"]) for f in findings)
        print(f"✗ Found {total} broken link(s) in {len(findings)} file(s):\n")
        for f in findings:
            print(f"  {f['file']}:")
            for link in f["broken_links"]:
                print(f"    Line {link['line']}: [{link['text']}]({link['target']})")
                print(f"      → {link['reason']}")
            print()

    sys.exit(1 if findings else 0)


if __name__ == "__main__":
    main()

3. Interface Drift Detector

Detects when documented interfaces no longer match the actual code — the most insidious form of documentation rot because the doc looks legitimate but leads agents astray.

#!/usr/bin/env python3
"""
scripts/gc-interface-drift.py

Detects drift between documented interfaces and actual code definitions.

Usage:
    python3 scripts/gc-interface-drift.py .
    python3 scripts/gc-interface-drift.py . --json
    python3 scripts/gc-interface-drift.py . --doc docs/ARCHITECTURE.md
"""

import argparse
import json
import re
import subprocess
import sys
from pathlib import Path
from typing import Dict, List, Optional, Set


def extract_documented_interfaces(doc_path: Path) -> List[Dict]:
    """Extract interface/type names mentioned in documentation."""
    content = doc_path.read_text()
    interfaces = []

    # Match backtick-quoted type names with file references
    # Pattern: `TypeName` ... `file.go:line`
    type_pattern = r'`([A-Z][a-zA-Z0-9]+(?:Interface|Service|Provider|Handler|Repository|Store|Manager|Client)?)`'
    for match in re.finditer(type_pattern, content):
        name = match.group(1)
        line_num = content[:match.start()].count('\n') + 1
        interfaces.append({
            "name": name,
            "doc_file": str(doc_path),
            "doc_line": line_num
        })

    return interfaces


def find_type_in_code_go(project_root: Path, type_name: str) -> Optional[Dict]:
    """Find a Go type/interface definition in code."""
    try:
        result = subprocess.run(
            ["grep", "-rn", f"type {type_name} ", "--include=*.go",
             str(project_root)],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0 and result.stdout.strip():
            first_match = result.stdout.strip().split('\n')[0]
            parts = first_match.split(':', 2)
            if len(parts) >= 2:
                return {
                    "file": str(Path(parts[0]).relative_to(project_root)),
                    "line": int(parts[1]),
                    "definition": parts[2].strip() if len(parts) > 2 else ""
                }
    except Exception:
        pass
    return None


def find_type_in_code_ts(project_root: Path, type_name: str) -> Optional[Dict]:
    """Find a TypeScript interface/class/type definition in code."""
    try:
        patterns = [
            f"(export )?interface {type_name}",
            f"(export )?class {type_name}",
            f"(export )?type {type_name}"
        ]
        for pattern in patterns:
            result = subprocess.run(
                ["grep", "-rn", "-E", pattern, "--include=*.ts", "--include=*.tsx",
                 str(project_root)],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout.strip():
                first_match = result.stdout.strip().split('\n')[0]
                parts = first_match.split(':', 2)
                if len(parts) >= 2:
                    return {
                        "file": str(Path(parts[0]).relative_to(project_root)),
                        "line": int(parts[1]),
                        "definition": parts[2].strip() if len(parts) > 2 else ""
                    }
    except Exception:
        pass
    return None


def find_type_in_code_py(project_root: Path, type_name: str) -> Optional[Dict]:
    """Find a Python class definition in code."""
    try:
        result = subprocess.run(
            ["grep", "-rn", f"class {type_name}", "--include=*.py",
             str(project_root)],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0 and result.stdout.strip():
            first_match = result.stdout.strip().split('\n')[0]
            parts = first_match.split(':', 2)
            if len(parts) >= 2:
                return {
                    "file": str(Path(parts[0]).relative_to(project_root)),
                    "line": int(parts[1]),
                    "definition": parts[2].strip() if len(parts) > 2 else ""
                }
    except Exception:
        pass
    return None


def detect_project_lang(project_root: Path) -> str:
    """Auto-detect project language."""
    if (project_root / "go.mod").exists():
        return "go"
    if (project_root / "package.json").exists():
        return "ts"
    if (project_root / "pyproject.toml").exists() or (project_root / "requirements.txt").exists():
        return "py"
    return "go"  # default


def check_interface_drift(project_root: Path, doc_path: Optional[str] = None) -> List[Dict]:
    """Check for interface drift between docs and code."""
    lang = detect_project_lang(project_root)
    finder = {"go": find_type_in_code_go, "ts": find_type_in_code_ts, "py": find_type_in_code_py}[lang]

    # Collect docs to check
    if doc_path:
        doc_files = [project_root / doc_path]
    else:
        doc_files = []
        for pattern in ["docs/*.md", "docs/**/*.md", "AGENTS.md"]:
            doc_files.extend(project_root.glob(pattern))

    findings = []
    seen_types: Set[str] = set()

    for df in doc_files:
        if not df.exists():
            continue
        documented = extract_documented_interfaces(df)

        for iface in documented:
            if iface["name"] in seen_types:
                continue
            seen_types.add(iface["name"])

            code_loc = finder(project_root, iface["name"])
            if not code_loc:
                findings.append({
                    "type_name": iface["name"],
                    "documented_in": str(df.relative_to(project_root)),
                    "doc_line": iface["doc_line"],
                    "status": "not_found_in_code",
                    "severity": "high",
                    "suggestion": f"Type '{iface['name']}' referenced in docs but not found in code. "
                                  f"It may have been renamed, deleted, or the doc is outdated."
                })

    return findings


def main():
    parser = argparse.ArgumentParser(description="Detect interface drift between docs and code")
    parser.add_argument("project_root", help="Project root directory")
    parser.add_argument("--doc", help="Check a specific doc file only")
    parser.add_argument("--json", action="store_true", help="Output as JSON")
    args = parser.parse_args()

    project_root = Path(args.project_root).resolve()
    findings = check_interface_drift(project_root, args.doc)

    if args.json:
        print(json.dumps({"interface_drift": findings}, indent=2))
    else:
        if not findings:
            print("✓ No interface drift detected")
            return

        print(f"⚠ Found {len(findings)} potential interface drift issue(s):\n")
        for f in findings:
            icon = "🔴" if f["severity"] == "high" else "🟡"
            print(f"{icon} {f['type_name']}")
            print(f"   Documented in: {f['documented_in']}:{f['doc_line']}")
            print(f"   Status: {f['status']}")
            print(f"   → {f['suggestion']}")
            print()

    sys.exit(1 if findings else 0)


if __name__ == "__main__":
    main()

4. Unified GC Runner

Runs all garbage collection checks in one pass. Produces a combined report.

#!/usr/bin/env python3
"""
scripts/gc-docs.py

Unified documentation garbage collection runner.
Runs stale doc detection, broken link checking, and interface drift detection.

Usage:
    python3 scripts/gc-docs.py .
    python3 scripts/gc-docs.py . --json
    python3 scripts/gc-docs.py . --checks stale,links  # selective checks
"""

import argparse
import json
import sys
from pathlib import Path
from datetime import datetime

# Import the individual checkers (when used as a unified runner,
# these would be in the same scripts/ directory)
# For template purposes, showing the integration pattern:


def run_gc(project_root: Path, checks: list, threshold_days: int = 14) -> dict:
    """Run selected garbage collection checks."""
    report = {
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "project_root": str(project_root),
        "checks_run": checks,
        "results": {},
        "summary": {"total_issues": 0, "high": 0, "medium": 0, "low": 0}
    }

    if "stale" in checks:
        # Run stale doc detection
        from gc_stale_docs import check_staleness
        stale = check_staleness(project_root, threshold_days)
        report["results"]["stale_docs"] = stale
        for item in stale:
            report["summary"]["total_issues"] += 1
            report["summary"][item.get("severity", "medium")] += 1

    if "links" in checks:
        # Run broken link detection
        from gc_broken_links import check_all_links
        broken = check_all_links(project_root)
        report["results"]["broken_links"] = broken
        for item in broken:
            count = len(item.get("broken_links", []))
            report["summary"]["total_issues"] += count
            report["summary"]["high"] += count  # broken links are always high severity

    if "drift" in checks:
        # Run interface drift detection
        from gc_interface_drift import check_interface_drift
        drift = check_interface_drift(project_root)
        report["results"]["interface_drift"] = drift
        for item in drift:
            report["summary"]["total_issues"] += 1
            report["summary"][item.get("severity", "medium")] += 1

    return report


def main():
    parser = argparse.ArgumentParser(description="Documentation garbage collection")
    parser.add_argument("project_root", help="Project root directory")
    parser.add_argument("--checks", default="stale,links,drift",
                        help="Comma-separated checks to run (default: all)")
    parser.add_argument("--threshold", type=int, default=14,
                        help="Staleness threshold in days")
    parser.add_argument("--json", action="store_true")
    parser.add_argument("-o", "--output", help="Write report to file")
    args = parser.parse_args()

    project_root = Path(args.project_root).resolve()
    checks = [c.strip() for c in args.checks.split(",")]
    report = run_gc(project_root, checks, args.threshold)

    output = json.dumps(report, indent=2) if args.json else format_report(report)

    if args.output:
        Path(args.output).write_text(output)
        print(f"Report written to {args.output}")
    else:
        print(output)

    sys.exit(1 if report["summary"]["total_issues"] > 0 else 0)


def format_report(report: dict) -> str:
    """Human-readable report format."""
    lines = []
    s = report["summary"]
    total = s["total_issues"]

    if total == 0:
        return "✓ Documentation is clean — no issues found"

    lines.append(f"Documentation GC Report — {report['timestamp'][:10]}")
    lines.append(f"{'=' * 50}")
    lines.append(f"Total issues: {total} (🔴 {s['high']} high, 🟡 {s['medium']} medium, 🟢 {s['low']} low)")
    lines.append("")

    for check_name, results in report["results"].items():
        if results:
            lines.append(f"## {check_name.replace('_', ' ').title()}")
            lines.append(f"   {len(results)} issue(s) found")
            lines.append("")

    return "\n".join(lines)


if __name__ == "__main__":
    main()

5. CI Integration

GitHub Actions

# .github/workflows/doc-gc.yml
name: Documentation GC
on:
  push:
    branches: [main]
    paths: ['docs/**', 'AGENTS.md', '*.md']
  schedule:
    - cron: '0 9 * * 1'  # Weekly Monday 9am

jobs:
  doc-gc:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for git log dates

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Run documentation GC
        run: python3 scripts/gc-docs.py . --json -o gc-report.json

      - name: Upload report
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: doc-gc-report
          path: gc-report.json

Makefile Target

.PHONY: gc-docs
gc-docs:
	@echo "Running documentation garbage collection..."
	@python3 scripts/gc-docs.py . || true
	@echo ""
	@echo "To fix: review stale docs and update or delete them"

6. Scheduling Guide

Check	Frequency	Trigger	Rationale
Broken links	Per-commit (CI)	Push to main	Catches renames/deletes immediately
Stale docs	Weekly	Scheduled CI	Balances signal vs noise
Interface drift	Per-PR	PR check	Prevents drift from entering main
Full GC	Weekly + on-demand	Manual or scheduled	Comprehensive health check

Running from Harness Executor

After completing a task that creates or modifies code, the executor can run a quick GC check:

# Quick post-task check — just broken links on changed docs
python3 scripts/gc-broken-links.py . --json | python3 -c "
import sys, json
data = json.load(sys.stdin)
if data['broken_links']:
    print('⚠ Task may have introduced broken doc links — review before committing')
"

Running from ECL Harness Engineer (Improve Mode)

During a harness audit, run the full GC suite to assess documentation health:

python3 scripts/gc-docs.py . --json -o harness/trace/gc-report.json

The GC report feeds into the audit score (Documentation dimension) and helps prioritize which docs to fix first.

25 KiB Raw Blame History