321 lines
11 KiB
Markdown
321 lines
11 KiB
Markdown
---
|
||
name: efficient-web-research
|
||
risk: safe
|
||
description: >
|
||
Protocol for token-efficient web research. Use when accessing URLs, GitHub repos, or running search queries. Prevents full-page fetching waste.
|
||
---
|
||
|
||
# Efficient Web Research Skill
|
||
|
||
A protocol for accessing web content in the most token-efficient, accurate, and structured way —
|
||
using the right tool at the right depth, and stopping as soon as the question is answerable.
|
||
|
||
---
|
||
|
||
## Core Principle
|
||
|
||
> **Fetch the minimum needed to answer. Skim before you dive. Stop when you can answer.**
|
||
|
||
Every unnecessary fetch wastes tokens and adds noise. This skill enforces a layered approach
|
||
where you escalate fetch depth only when shallower layers fail.
|
||
|
||
---
|
||
|
||
## Step 1 — Classify the Input
|
||
|
||
Before fetching anything, identify what kind of input you received:
|
||
|
||
| Input Type | Example | Go To |
|
||
|---|---|---|
|
||
| GitHub repo URL | `github.com/user/repo` | [GitHub Protocol](#github-protocol) |
|
||
| Specific page URL | `docs.python.org/3/library/os` | [URL Protocol](#url-protocol) |
|
||
| Topic / query (no URL) | "how does RAFT consensus work" | [Search Protocol](#search-protocol) |
|
||
| Multiple URLs | List of links | [Multi-URL Protocol](#multi-url-protocol) |
|
||
| PDF / file link | `.pdf`, `.txt`, `.md` URL | [File Protocol](#file-protocol) |
|
||
|
||
---
|
||
|
||
## GitHub Protocol
|
||
|
||
Use when input is a GitHub URL (repo, file, PR, issue, etc.)
|
||
|
||
### Step 1 — Parse the URL
|
||
|
||
```
|
||
github.com/{owner}/{repo} → Repo root
|
||
github.com/{owner}/{repo}/tree/{branch} → Directory
|
||
github.com/{owner}/{repo}/blob/{branch}/{path} → Single file
|
||
github.com/{owner}/{repo}/issues/{n} → Issue
|
||
github.com/{owner}/{repo}/pull/{n} → Pull request
|
||
```
|
||
|
||
### Step 2 — Use GitHub API (preferred over scraping)
|
||
|
||
Always prefer the GitHub API. It returns clean JSON — no HTML parsing needed.
|
||
|
||
```
|
||
# Repo metadata (name, description, language, stars, topics)
|
||
GET https://api.github.com/repos/{owner}/{repo}
|
||
|
||
# File tree (see what files exist — very cheap)
|
||
GET https://api.github.com/repos/{owner}/{repo}/git/trees/{ref}?recursive=1
|
||
|
||
# Single file content (base64 encoded)
|
||
GET https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={ref}
|
||
|
||
# README only (usually enough to understand the repo)
|
||
GET https://api.github.com/repos/{owner}/{repo}/readme
|
||
```
|
||
|
||
### Step 3 — Layered Fetch for Repos
|
||
|
||
```
|
||
Layer 1 (always do first):
|
||
→ Fetch repo metadata + README only
|
||
→ Can you answer the user's question now? YES → STOP. NO → continue.
|
||
|
||
Layer 2 (only if needed):
|
||
→ Fetch file tree to understand structure
|
||
→ Identify the 1-3 most relevant files based on the question
|
||
→ Can you answer now? YES → STOP. NO → continue.
|
||
|
||
Layer 3 (last resort):
|
||
→ Fetch specific relevant files only (never fetch all files)
|
||
→ Prioritize: main entry point, config files, key modules
|
||
```
|
||
|
||
### Token Rules for GitHub
|
||
|
||
- README alone answers ~70% of "what does this repo do" questions — always try it first
|
||
- Never fetch more than 3 files in a single research turn
|
||
- If a file exceeds ~300 lines, read only the top (imports + class/function signatures)
|
||
- Decode base64 content from API before passing to context
|
||
|
||
---
|
||
|
||
## URL Protocol
|
||
|
||
Use when the user gives a specific non-GitHub URL (docs, articles, blogs, etc.)
|
||
|
||
### Step 1 — Assess the URL type
|
||
|
||
| Site type | Likely works with | Notes |
|
||
|---|---|---|
|
||
| Static docs / MDN / ReadTheDocs | `read_url_content` | Fast, clean, cheap |
|
||
| News articles / blogs | `read_url_content` | Usually fine |
|
||
| SPAs / React/Next.js apps | `browser_subagent` | JS-rendered |
|
||
| Auth-gated pages | `browser_subagent` | Needs login |
|
||
| Raw GitHub files (raw.githubusercontent) | `read_url_content` | Direct text |
|
||
|
||
### Step 2 — Layered Fetch
|
||
|
||
```
|
||
Layer 1 — Skim
|
||
→ Fetch the URL with read_url_content
|
||
→ Read only headings (H1, H2, H3) and first paragraph
|
||
→ Does this page contain what the user needs? NO → try a different URL or search. YES → continue.
|
||
|
||
Layer 2 — Targeted Extract
|
||
→ If the page has anchor links (e.g. /docs/page#section), fetch with the anchor
|
||
→ Extract only the relevant section (200–500 tokens max)
|
||
→ Can you answer? YES → STOP.
|
||
|
||
Layer 3 — Full Fetch
|
||
→ Fetch full page, strip boilerplate (nav, footer, ads, cookie banners, sidebars)
|
||
→ Cap at 2000 tokens. Summarize before passing to answer.
|
||
|
||
Layer 4 — Browser Subagent (last resort only)
|
||
→ Use ONLY if read_url_content returns empty, garbled, or JS-placeholder content
|
||
→ Instruct subagent: "Navigate to [URL], wait for content to load, extract [specific section]"
|
||
→ Do NOT use browser_subagent for static pages — it's expensive
|
||
```
|
||
|
||
### What to Strip from Fetched Pages
|
||
|
||
Always remove before using fetched content:
|
||
- Navigation menus and breadcrumbs
|
||
- Cookie banners and GDPR notices
|
||
- "Related articles" / "You might also like" blocks
|
||
- Footer content (copyright, links)
|
||
- Social share buttons
|
||
- Ads and sponsored content
|
||
|
||
Extract and keep:
|
||
- Main article / documentation body
|
||
- Code blocks
|
||
- Tables with data
|
||
- Numbered steps or procedures
|
||
|
||
---
|
||
|
||
## Search Protocol
|
||
|
||
Use when the user gives a topic, question, or query — not a specific URL.
|
||
|
||
### Step 1 — Sharpen the Query Before Searching
|
||
|
||
Do NOT search the raw user query. Transform it first:
|
||
|
||
```
|
||
Raw: "how to deploy fastapi on aws"
|
||
Sharpened: "fastapi AWS deployment tutorial 2024"
|
||
|
||
Raw: "python async vs threads"
|
||
Sharpened: "Python asyncio vs threading performance comparison"
|
||
|
||
Raw: "best way to structure react project"
|
||
Sharpened: "React project folder structure best practices"
|
||
```
|
||
|
||
**Query sharpening rules:**
|
||
- Add specificity: version numbers, technology names, "tutorial" / "guide" / "comparison"
|
||
- Add recency if relevant: current year
|
||
- Remove filler words: "how do I", "what is the", "can you explain"
|
||
- For code questions: add the language + framework name explicitly
|
||
|
||
### Step 2 — Search and Select
|
||
|
||
```
|
||
1. Run search_web with the sharpened query
|
||
2. Get results (titles + snippets)
|
||
3. Scan titles + snippets ONLY — do not fetch yet
|
||
4. Pick the TOP 1-2 most relevant results (max 3 in complex cases)
|
||
5. Skip results from: forums (if docs exist), aggregator blogs, paywalled sites
|
||
6. Prefer: official docs, GitHub repos, well-known tech blogs, academic sources
|
||
```
|
||
|
||
### Step 3 — Fetch Selected Results
|
||
|
||
Apply the URL Protocol (above) to each selected URL.
|
||
Process results one at a time — only fetch the second URL if the first didn't answer the question.
|
||
|
||
### Token Rules for Search
|
||
|
||
- Never read more than 3 URLs per search query
|
||
- If the snippet already contains the answer → do NOT fetch the full page, use the snippet
|
||
- For factual questions (dates, names, simple facts) → snippet is usually enough
|
||
- For procedural questions (how to do X) → fetch 1 relevant page, targeted section only
|
||
|
||
---
|
||
|
||
## Multi-URL Protocol
|
||
|
||
Use when the user provides a list of URLs to compare or summarize.
|
||
|
||
```
|
||
1. Skim all URLs first (Layer 1 fetch for each)
|
||
2. Group by relevance to the user's question
|
||
3. Deep-fetch only the most relevant 1-3 URLs
|
||
4. Summarize each in 3-5 sentences before combining
|
||
5. Never dump raw content from multiple pages — always summarize per-source first
|
||
```
|
||
|
||
---
|
||
|
||
## File Protocol
|
||
|
||
Use when URL points directly to a file (PDF, .txt, .md, .csv, etc.)
|
||
|
||
- `.md` / `.txt` / `.csv` → `read_url_content` works directly, read full content
|
||
- `.pdf` → Use browser_subagent or a PDF extraction tool; extract text only
|
||
- `.json` / `.yaml` → `read_url_content`, parse structure, summarize schema + key values
|
||
- Large files (>500 lines) → Read first 100 lines + last 20 lines + search for relevant sections
|
||
|
||
---
|
||
|
||
## Anti-Patterns (Never Do These)
|
||
|
||
| Anti-pattern | Why it's bad | Do this instead |
|
||
|---|---|---|
|
||
| Fetching full page for a simple fact | Wastes 1000s of tokens | Use snippet or targeted anchor |
|
||
| Using browser_subagent for static sites | Very expensive | Use read_url_content first |
|
||
| Searching with the raw user query | Vague results | Sharpen query first |
|
||
| Fetching 5+ search results | Token explosion | Max 3, stop when answered |
|
||
| Dumping raw HTML into context | Noisy, wasteful | Always strip to Markdown |
|
||
| Fetching "just in case" | Unnecessary tokens | Only fetch what's needed to answer |
|
||
| Re-fetching the same URL | Redundant | Cache result in context, reuse |
|
||
| Fetching entire GitHub repo | Extremely wasteful | README + targeted files only |
|
||
|
||
---
|
||
|
||
## Decision Flowchart (Quick Reference)
|
||
|
||
```
|
||
Input received
|
||
│
|
||
├─ GitHub URL?
|
||
│ ├─ Fetch README + metadata via API
|
||
│ ├─ Answered? → STOP
|
||
│ ├─ Need more? → Fetch file tree, pick 1-3 files
|
||
│ └─ Still need more? → Fetch specific files only
|
||
│
|
||
├─ Specific URL?
|
||
│ ├─ Try read_url_content → skim headings
|
||
│ ├─ Answered? → STOP
|
||
│ ├─ Need more? → Targeted section fetch
|
||
│ ├─ Still need more? → Full fetch, stripped
|
||
│ └─ JS-rendered / broken? → browser_subagent (last resort)
|
||
│
|
||
├─ Topic/query?
|
||
│ ├─ Sharpen query
|
||
│ ├─ search_web → scan snippets
|
||
│ ├─ Snippet enough? → Answer from snippet, STOP
|
||
│ ├─ Need more? → Fetch top 1 result (targeted)
|
||
│ └─ Still need more? → Fetch top 2nd result (targeted)
|
||
│
|
||
└─ List of URLs?
|
||
├─ Skim all (Layer 1 each)
|
||
├─ Deep fetch top 1-3 relevant ones
|
||
└─ Summarize per-source, then combine
|
||
```
|
||
|
||
---
|
||
|
||
## Output Format Rules
|
||
|
||
After fetching, structure your response as:
|
||
|
||
```
|
||
Source: [URL or "Web search for: query"]
|
||
Summary: [2-5 sentences of what was found]
|
||
Answer: [Direct answer to user's question]
|
||
Confidence: [High / Medium / Low — based on source quality]
|
||
```
|
||
|
||
For multiple sources:
|
||
```
|
||
Source 1: ...
|
||
Source 2: ...
|
||
Combined Answer: ...
|
||
```
|
||
|
||
Never output:
|
||
- Raw HTML fragments
|
||
- Full page dumps
|
||
- Unattributed information
|
||
- More than needed to answer the question
|
||
|
||
---
|
||
|
||
## Token Budget Guide
|
||
|
||
| Operation | Approximate token cost | When to use |
|
||
|---|---|---|
|
||
| GitHub README fetch | ~300–800 tokens | Always first for repos |
|
||
| GitHub API metadata | ~200 tokens | Always for repos |
|
||
| Skim (headings only) | ~100–200 tokens | Always first for URLs |
|
||
| Targeted section fetch | ~300–600 tokens | When skim isn't enough |
|
||
| Full page fetch (stripped) | ~1000–2000 tokens | Only when targeted fails |
|
||
| browser_subagent | ~2000–5000 tokens | Last resort only |
|
||
| Search snippet scan | ~300–500 tokens | Always before fetching |
|
||
|
||
**Rule of thumb:** If you're about to spend >2000 tokens on a fetch, ask yourself if there's a cheaper path first.
|
||
|
||
---
|
||
|
||
## Limitations
|
||
|
||
- **JavaScript Reliance**: Standard fetching may not fully render Single Page Applications (SPAs). You must fallback to `browser_subagent` for these, which is slower and more expensive.
|
||
- **Paywalls & Protections**: This skill cannot bypass CAPTCHAs, bot protections (e.g., strict Cloudflare rules), or hard paywalls.
|
||
- **GitHub API Limits**: Frequent GitHub API requests without authentication may hit rate limits.
|