---
name: efficient-web-research
risk: safe
description: >
  Protocol for token-efficient web research. Use when accessing URLs, GitHub repos, or running search queries. Prevents full-page fetching waste.
---

# Efficient Web Research Skill

A protocol for accessing web content in the most token-efficient, accurate, and structured way —
using the right tool at the right depth, and stopping as soon as the question is answerable.

---

## Core Principle

> **Fetch the minimum needed to answer. Skim before you dive. Stop when you can answer.**

Every unnecessary fetch wastes tokens and adds noise. This skill enforces a layered approach
where you escalate fetch depth only when shallower layers fail.

---

## Step 1 — Classify the Input

Before fetching anything, identify what kind of input you received:

| Input Type | Example | Go To |
|---|---|---|
| GitHub repo URL | `github.com/user/repo` | [GitHub Protocol](#github-protocol) |
| Specific page URL | `docs.python.org/3/library/os` | [URL Protocol](#url-protocol) |
| Topic / query (no URL) | "how does RAFT consensus work" | [Search Protocol](#search-protocol) |
| Multiple URLs | List of links | [Multi-URL Protocol](#multi-url-protocol) |
| PDF / file link | `.pdf`, `.txt`, `.md` URL | [File Protocol](#file-protocol) |

---

## GitHub Protocol

Use when input is a GitHub URL (repo, file, PR, issue, etc.)

### Step 1 — Parse the URL

```
github.com/{owner}/{repo}                → Repo root
github.com/{owner}/{repo}/tree/{branch}  → Directory
github.com/{owner}/{repo}/blob/{branch}/{path} → Single file
github.com/{owner}/{repo}/issues/{n}     → Issue
github.com/{owner}/{repo}/pull/{n}       → Pull request
```

### Step 2 — Use GitHub API (preferred over scraping)

Always prefer the GitHub API. It returns clean JSON — no HTML parsing needed.

```
# Repo metadata (name, description, language, stars, topics)
GET https://api.github.com/repos/{owner}/{repo}

# File tree (see what files exist — very cheap)
GET https://api.github.com/repos/{owner}/{repo}/git/trees/{ref}?recursive=1

# Single file content (base64 encoded)
GET https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={ref}

# README only (usually enough to understand the repo)
GET https://api.github.com/repos/{owner}/{repo}/readme
```

### Step 3 — Layered Fetch for Repos

```
Layer 1 (always do first):
  → Fetch repo metadata + README only
  → Can you answer the user's question now? YES → STOP. NO → continue.

Layer 2 (only if needed):
  → Fetch file tree to understand structure
  → Identify the 1-3 most relevant files based on the question
  → Can you answer now? YES → STOP. NO → continue.

Layer 3 (last resort):
  → Fetch specific relevant files only (never fetch all files)
  → Prioritize: main entry point, config files, key modules
```

### Token Rules for GitHub

- README alone answers ~70% of "what does this repo do" questions — always try it first
- Never fetch more than 3 files in a single research turn
- If a file exceeds ~300 lines, read only the top (imports + class/function signatures)
- Decode base64 content from API before passing to context

---

## URL Protocol

Use when the user gives a specific non-GitHub URL (docs, articles, blogs, etc.)

### Step 1 — Assess the URL type

| Site type | Likely works with | Notes |
|---|---|---|
| Static docs / MDN / ReadTheDocs | `read_url_content` | Fast, clean, cheap |
| News articles / blogs | `read_url_content` | Usually fine |
| SPAs / React/Next.js apps | `browser_subagent` | JS-rendered |
| Auth-gated pages | `browser_subagent` | Needs login |
| Raw GitHub files (raw.githubusercontent) | `read_url_content` | Direct text |

### Step 2 — Layered Fetch

```
Layer 1 — Skim
  → Fetch the URL with read_url_content
  → Read only headings (H1, H2, H3) and first paragraph
  → Does this page contain what the user needs? NO → try a different URL or search. YES → continue.

Layer 2 — Targeted Extract
  → If the page has anchor links (e.g. /docs/page#section), fetch with the anchor
  → Extract only the relevant section (200–500 tokens max)
  → Can you answer? YES → STOP.

Layer 3 — Full Fetch
  → Fetch full page, strip boilerplate (nav, footer, ads, cookie banners, sidebars)
  → Cap at 2000 tokens. Summarize before passing to answer.

Layer 4 — Browser Subagent (last resort only)
  → Use ONLY if read_url_content returns empty, garbled, or JS-placeholder content
  → Instruct subagent: "Navigate to [URL], wait for content to load, extract [specific section]"
  → Do NOT use browser_subagent for static pages — it's expensive
```

### What to Strip from Fetched Pages

Always remove before using fetched content:
- Navigation menus and breadcrumbs
- Cookie banners and GDPR notices
- "Related articles" / "You might also like" blocks
- Footer content (copyright, links)
- Social share buttons
- Ads and sponsored content

Extract and keep:
- Main article / documentation body
- Code blocks
- Tables with data
- Numbered steps or procedures

---

## Search Protocol

Use when the user gives a topic, question, or query — not a specific URL.

### Step 1 — Sharpen the Query Before Searching

Do NOT search the raw user query. Transform it first:

```
Raw: "how to deploy fastapi on aws"
Sharpened: "fastapi AWS deployment tutorial 2024"

Raw: "python async vs threads"
Sharpened: "Python asyncio vs threading performance comparison"

Raw: "best way to structure react project"
Sharpened: "React project folder structure best practices"
```

**Query sharpening rules:**
- Add specificity: version numbers, technology names, "tutorial" / "guide" / "comparison"
- Add recency if relevant: current year
- Remove filler words: "how do I", "what is the", "can you explain"
- For code questions: add the language + framework name explicitly

### Step 2 — Search and Select

```
1. Run search_web with the sharpened query
2. Get results (titles + snippets)
3. Scan titles + snippets ONLY — do not fetch yet
4. Pick the TOP 1-2 most relevant results (max 3 in complex cases)
5. Skip results from: forums (if docs exist), aggregator blogs, paywalled sites
6. Prefer: official docs, GitHub repos, well-known tech blogs, academic sources
```

### Step 3 — Fetch Selected Results

Apply the URL Protocol (above) to each selected URL.
Process results one at a time — only fetch the second URL if the first didn't answer the question.

### Token Rules for Search

- Never read more than 3 URLs per search query
- If the snippet already contains the answer → do NOT fetch the full page, use the snippet
- For factual questions (dates, names, simple facts) → snippet is usually enough
- For procedural questions (how to do X) → fetch 1 relevant page, targeted section only

---

## Multi-URL Protocol

Use when the user provides a list of URLs to compare or summarize.

```
1. Skim all URLs first (Layer 1 fetch for each)
2. Group by relevance to the user's question
3. Deep-fetch only the most relevant 1-3 URLs
4. Summarize each in 3-5 sentences before combining
5. Never dump raw content from multiple pages — always summarize per-source first
```

---

## File Protocol

Use when URL points directly to a file (PDF, .txt, .md, .csv, etc.)

- `.md` / `.txt` / `.csv` → `read_url_content` works directly, read full content
- `.pdf` → Use browser_subagent or a PDF extraction tool; extract text only
- `.json` / `.yaml` → `read_url_content`, parse structure, summarize schema + key values
- Large files (>500 lines) → Read first 100 lines + last 20 lines + search for relevant sections

---

## Anti-Patterns (Never Do These)

| Anti-pattern | Why it's bad | Do this instead |
|---|---|---|
| Fetching full page for a simple fact | Wastes 1000s of tokens | Use snippet or targeted anchor |
| Using browser_subagent for static sites | Very expensive | Use read_url_content first |
| Searching with the raw user query | Vague results | Sharpen query first |
| Fetching 5+ search results | Token explosion | Max 3, stop when answered |
| Dumping raw HTML into context | Noisy, wasteful | Always strip to Markdown |
| Fetching "just in case" | Unnecessary tokens | Only fetch what's needed to answer |
| Re-fetching the same URL | Redundant | Cache result in context, reuse |
| Fetching entire GitHub repo | Extremely wasteful | README + targeted files only |

---

## Decision Flowchart (Quick Reference)

```
Input received
│
├─ GitHub URL?
│   ├─ Fetch README + metadata via API
│   ├─ Answered? → STOP
│   ├─ Need more? → Fetch file tree, pick 1-3 files
│   └─ Still need more? → Fetch specific files only
│
├─ Specific URL?
│   ├─ Try read_url_content → skim headings
│   ├─ Answered? → STOP
│   ├─ Need more? → Targeted section fetch
│   ├─ Still need more? → Full fetch, stripped
│   └─ JS-rendered / broken? → browser_subagent (last resort)
│
├─ Topic/query?
│   ├─ Sharpen query
│   ├─ search_web → scan snippets
│   ├─ Snippet enough? → Answer from snippet, STOP
│   ├─ Need more? → Fetch top 1 result (targeted)
│   └─ Still need more? → Fetch top 2nd result (targeted)
│
└─ List of URLs?
    ├─ Skim all (Layer 1 each)
    ├─ Deep fetch top 1-3 relevant ones
    └─ Summarize per-source, then combine
```

---

## Output Format Rules

After fetching, structure your response as:

```
Source: [URL or "Web search for: query"]
Summary: [2-5 sentences of what was found]
Answer: [Direct answer to user's question]
Confidence: [High / Medium / Low — based on source quality]
```

For multiple sources:
```
Source 1: ...
Source 2: ...
Combined Answer: ...
```

Never output:
- Raw HTML fragments
- Full page dumps
- Unattributed information
- More than needed to answer the question

---

## Token Budget Guide

| Operation | Approximate token cost | When to use |
|---|---|---|
| GitHub README fetch | ~300–800 tokens | Always first for repos |
| GitHub API metadata | ~200 tokens | Always for repos |
| Skim (headings only) | ~100–200 tokens | Always first for URLs |
| Targeted section fetch | ~300–600 tokens | When skim isn't enough |
| Full page fetch (stripped) | ~1000–2000 tokens | Only when targeted fails |
| browser_subagent | ~2000–5000 tokens | Last resort only |
| Search snippet scan | ~300–500 tokens | Always before fetching |

**Rule of thumb:** If you're about to spend >2000 tokens on a fetch, ask yourself if there's a cheaper path first.

---

## Limitations

- **JavaScript Reliance**: Standard fetching may not fully render Single Page Applications (SPAs). You must fallback to `browser_subagent` for these, which is slower and more expensive.
- **Paywalls & Protections**: This skill cannot bypass CAPTCHAs, bot protections (e.g., strict Cloudflare rules), or hard paywalls.
- **GitHub API Limits**: Frequent GitHub API requests without authentication may hit rate limits.