210 lines
9.1 KiB
Markdown
210 lines
9.1 KiB
Markdown
---
|
|
name: youtube-notetaker
|
|
description: "Turn YouTube talks into local study notes with slides, transcripts, editable annotations, and a markdown-backed viewer."
|
|
category: "video"
|
|
risk: "safe"
|
|
source: "official"
|
|
source_repo: "dair-ai/dair-academy-plugins"
|
|
source_type: "official"
|
|
date_added: "2026-06-19"
|
|
author: "DAIR.AI"
|
|
license: "MIT"
|
|
license_source: "https://github.com/dair-ai/dair-academy-plugins/blob/main/README.md#license"
|
|
tags:
|
|
- dair-academy
|
|
- ai
|
|
- workflow
|
|
tools:
|
|
- claude-code
|
|
- codex-cli
|
|
- cursor
|
|
---
|
|
|
|
# YouTube Notetaker
|
|
|
|
## When to Use
|
|
|
|
Use when this workflow matches the user request: >
|
|
|
|
|
|
_Source: [dair-ai/dair-academy-plugins](https://github.com/dair-ai/dair-academy-plugins) (MIT)._
|
|
|
|
Build a personal library of YouTube talks you study with. Each video becomes one **plain
|
|
markdown file**: slide snapshots at their timestamps, a full timestamped transcript, and
|
|
editable notes. A small bundled server renders the library as an interactive deep-dive in the
|
|
browser. No database, no cloud service. Everything is files on disk you fully own.
|
|
|
|
## Architecture (read this first)
|
|
|
|
The **markdown library is the single source of truth**. The artifact is a thin HTML shell that
|
|
fetches from the server and writes notes back. Never hardcode video data into the HTML.
|
|
|
|
- **Library:** a plain folder, set by `VIDEO_LIBRARY_DIR` (default `~/video-deepdives/`).
|
|
- One markdown file per video, **filename slug = YouTube id** (e.g. `RtywqDFBYnQ.md`).
|
|
- Frontmatter holds video metadata + a `slides` array.
|
|
- Body holds the full transcript as `[HH:MM:SS] text` lines.
|
|
- `_media/` holds slide images, **namespaced per video** as `<youtube_id>-slide-NN.jpg`
|
|
to avoid collisions between videos.
|
|
- **Server:** `scripts/serve.py`, a single stdlib + PyYAML file. Start it with:
|
|
```
|
|
python3 scripts/serve.py --dir ~/video-deepdives --port 8000
|
|
```
|
|
It serves the artifact at `/` and a small API the artifact talks to:
|
|
- `GET /api/video-deepdives` (front page fetches this) lists every video.
|
|
- `GET /api/video-deepdives/<id>` returns one video `{meta, body}`.
|
|
- `GET /api/video-deepdives/_media/<file>` serves a slide image.
|
|
- `PATCH /api/video-deepdives/<id>` with `{fields:{slides:[...]}}` writes notes back.
|
|
- **It picks up new videos automatically** the moment a markdown file exists. Adding a video
|
|
means writing a markdown file + media; you almost never touch the HTML.
|
|
- The `/api/video-deepdives` URL namespace is local to the bundled server.
|
|
- **Artifact:** `reference/artifact.html`, served by `serve.py` at `/`. A clean reference copy;
|
|
only rewrite it if the user wants a UI change. For new videos, leave it alone.
|
|
|
|
## Requirements
|
|
|
|
- `yt-dlp` and `ffmpeg` on PATH (download + frame/scene extraction).
|
|
- Python 3 with `Pillow` (contact sheet) and `PyYAML` (markdown file + server).
|
|
```
|
|
pip install yt-dlp pillow pyyaml # ffmpeg via your package manager
|
|
```
|
|
|
|
## Adding a video — the pipeline
|
|
|
|
All helper scripts are in `scripts/`. Work in a scratch dir (e.g. `/tmp/ytnote-<id>/`), then
|
|
copy final assets into the library. Set `VIDEO_LIBRARY_DIR` once per shell if you don't want the
|
|
default. **Do not use em dashes (—) or arrows (→) in notes/titles.**
|
|
|
|
### 1. Resolve the id and check embeddability
|
|
```
|
|
scripts/setup.sh "<youtube_url_or_id>"
|
|
```
|
|
Prints the 11-char `YTID`, the scratch dir, the target library path, and whether YouTube
|
|
**embedding is allowed** (oembed 200) or **blocked** (oembed 401, e.g. some university talks).
|
|
If blocked, inline playback won't work but the artifact degrades gracefully to an "open at this
|
|
moment on YouTube" link, so proceed normally.
|
|
|
|
### 2. Download video + subtitles
|
|
```
|
|
scripts/download.sh "<YTID>" /tmp/ytnote-<YTID>
|
|
```
|
|
Uses `yt-dlp` to grab the video (≤720p is plenty for slide frames) and the best available
|
|
subtitles (manual if present, else auto-captions) as `.vtt`. Also fetches title/uploader.
|
|
|
|
### 3. Detect candidate slide timestamps
|
|
```
|
|
scripts/detect_slides.sh /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>
|
|
```
|
|
Runs ffmpeg scene detection (`select='gt(scene,0.3)'`) and writes `scene_times.txt` (seconds).
|
|
0.3 is a good default; lower it (0.2) for subtle slide decks, raise it (0.4) for busy video.
|
|
|
|
### 4. Build a contact sheet and CURATE
|
|
```
|
|
python3 scripts/contact_sheet.py /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/scene_times.txt /tmp/ytnote-<YTID>/contact.jpg
|
|
```
|
|
Read `contact.jpg` (labeled with index + timestamp). **This is the human-judgment step:** keep
|
|
frames that are real content slides; **drop talking-head shots, transitions, duplicates, and
|
|
blurry mid-animation frames.** Save the kept timestamps (seconds) to `/tmp/ytnote-<YTID>/keep.txt`,
|
|
one per line. Typical talk yields 15-25 slides.
|
|
|
|
### 5. Extract the curated slides at full quality and install to _media
|
|
```
|
|
python3 scripts/extract_slides.py <YTID> /tmp/ytnote-<YTID>/video.mp4 /tmp/ytnote-<YTID>/keep.txt > /tmp/ytnote-<YTID>/slides.json
|
|
```
|
|
Extracts each kept timestamp at 1280px wide, JPEG, and copies them into
|
|
`$VIDEO_LIBRARY_DIR/_media/` as `<YTID>-slide-01.jpg`, `-02.jpg`, … (numbered in time order).
|
|
Progress goes to stderr; a clean `slides.json` scaffold prints to **stdout**, so redirect it to a
|
|
file as shown, then fill in `title` and `note`.
|
|
|
|
Tip: talks are often a slide + speaker-cam composite, and speakers flip back and forth, so the
|
|
same slide appears at several timestamps. Keep the cleanest instance of each, and re-anchor each
|
|
slide's `t` to where it is actually discussed in the transcript (better "play from here" UX).
|
|
|
|
### 6. Build the transcript
|
|
```
|
|
python3 scripts/vtt_to_transcript.py /tmp/ytnote-<YTID>/*.vtt /tmp/ytnote-<YTID>/transcript.txt
|
|
```
|
|
Parses the VTT into clean, de-duplicated `[HH:MM:SS] text` lines (YouTube auto-captions repeat
|
|
rolling text; the script collapses it). This becomes the markdown body.
|
|
|
|
### 7. Write notes and assemble the markdown file
|
|
For each kept slide, write a 1-3 sentence `note` grounded in the transcript around that timestamp
|
|
(don't invent claims). Then assemble:
|
|
```
|
|
python3 scripts/write_library_item.py \
|
|
--id <YTID> \
|
|
--title "Talk title" \
|
|
--speaker "Name, Role, Org" \
|
|
--tags tag1,tag2,tag3 \
|
|
--slides /tmp/ytnote-<YTID>/slides.json \
|
|
--transcript /tmp/ytnote-<YTID>/transcript.txt
|
|
```
|
|
Writes `$VIDEO_LIBRARY_DIR/<YTID>.md` with correct frontmatter + body.
|
|
|
|
### 8. Serve and verify (always do this)
|
|
```
|
|
python3 scripts/serve.py --dir "$VIDEO_LIBRARY_DIR" --port 8000 &
|
|
scripts/verify.sh <YTID> # defaults to http://127.0.0.1:8000
|
|
```
|
|
`verify.sh` curls the collection list, the item, the first slide image, and the artifact,
|
|
asserting HTTP 200 and that the new id appears in the index. Then open
|
|
`http://127.0.0.1:8000/#/<YTID>` in a browser to confirm slides + transcript + notes render.
|
|
|
|
## Markdown file shape (reference)
|
|
|
|
```markdown
|
|
---
|
|
id: RtywqDFBYnQ
|
|
title: Memory and dreaming for self-learning agents
|
|
youtube_id: RtywqDFBYnQ
|
|
speaker: Mahesh, Product Manager, Platform team at Anthropic
|
|
source_url: https://www.youtube.com/watch?v=RtywqDFBYnQ
|
|
slide_count: 19
|
|
created: '2026-05-25'
|
|
tags: [anthropic, memory, agents]
|
|
slides:
|
|
- idx: 1
|
|
t: 55.7 # seconds (float ok), used for seeking
|
|
mmss: 00:55 # display label
|
|
title: Agent primitives have evolved
|
|
note: One to three sentences grounded in the transcript at this timestamp.
|
|
img: /api/video-deepdives/_media/RtywqDFBYnQ-slide-01.jpg
|
|
# ... more slides
|
|
---
|
|
## Transcript
|
|
[00:00:08] Hello, everyone...
|
|
[00:00:11] ...
|
|
```
|
|
|
|
Notes:
|
|
- `idx` can be sparse/non-contiguous; the artifact sorts slides by `t`, so ordering is by
|
|
timestamp, not idx.
|
|
- `img` is always a `/api/video-deepdives/_media/<file>` URL (served by serve.py),
|
|
never base64.
|
|
- Slide `note` is what the user edits in the UI; PATCH writes the whole `slides` array back.
|
|
|
|
## Gotchas
|
|
- **Embedding disabled** (oembed 401): inline player is blocked by the video owner. Not a bug;
|
|
the artifact shows an "open at this moment on YouTube" link instead. Mention it to the user.
|
|
- **Image collisions:** always namespace media `<YTID>-slide-NN.jpg`. Never reuse bare
|
|
`slide-NN.jpg` for a new video.
|
|
- **Auto-caption noise:** rolling YouTube captions duplicate text across cues; use the provided
|
|
VTT parser, don't dump raw VTT into the body.
|
|
- **Don't touch existing videos** when adding a new one. Each video is an independent file.
|
|
- **Server not picking up a video:** confirm the `.md` file is directly inside `--dir` (not a
|
|
subfolder) and the filename is `<YTID>.md`.
|
|
|
|
## What makes this portable
|
|
- **No orchestrator / no database.** Storage is a plain folder of markdown + images.
|
|
- **One env var** (`VIDEO_LIBRARY_DIR`) controls where the library lives.
|
|
- **One small server file** (`serve.py`, stdlib + PyYAML) renders everything and handles
|
|
note write-back. Drop it anywhere Python runs.
|
|
- The markdown files are portable: readable in Obsidian or any editor, and the frontmatter is
|
|
standard YAML.
|
|
|
|
|
|
## Limitations
|
|
|
|
- Requires the upstream tool, account, API key, or local setup when the workflow names one.
|
|
- Does not authorize destructive, production, paid, or external-message actions without explicit user approval.
|
|
- Validate generated artifacts or recommendations against the user's real sources before treating them as final.
|