playbook/antigravity-awesome-skills/skills/lookdev-auto/SKILL.md

103 lines
5.5 KiB
Markdown

---
name: lookdev-auto
description: "Automated visual tuning: a vision or video model rates rendered variants in a loop. Render several labeled variants into one artifact, ask the model to rate them and suggest better values, render the suggestions, ask it to pick the best, repeat until good — the model is the eye, you run the loop."
risk: safe
source: community
source_type: community
source_repo: connerkward/lookdev-auto-skill
date_added: "2026-06-16"
author: Conner K Ward
license: MIT
tags:
- visual-eval
- vision-model
- tuning
- automation
- render-loop
tools:
- claude-code
- antigravity
- cursor
- gemini-cli
- codex-cli
---
## When to Use
Use whenever "looks/feels right" is the success criterion and there's no cheap numeric metric — animation easing/timing, zoom/camera feel, color grade, layout/spacing, design params, render/encoder settings, prompt params. Use the automated counterpart to lookdev when there's no human to sit the loop.
_Source: [connerkward/lookdev-auto-skill](https://github.com/connerkward/lookdev-auto-skill) (MIT)._
# Visual eval loop — let a vision/video model tune what only an eye can judge
When the target is "does this LOOK/FEEL right" (not a number you can minimize), a
vision model (image) or video-understanding model (motion/timing) can be the judge in
a tight optimize loop. Worked reference: the `screenstudio-alternative` skill (`iteration.py`)
(tuned zoom-animation feel via `fal-ai/video-understanding`).
## The loop
1. **Render N labeled variants into ONE artifact.** Vary the parameter(s) across a
small spread. **Annotate each variant's params ON the artifact** (burn the label in:
"A · 2.2Hz · ζ0.5"). Images → a labeled grid/contact sheet. Video/motion → a
labeled *sequence* (label card or burned-in overlay before/over each clip) so the
model can compare temporally.
2. **One model call, structured output.** Send the single artifact with an explicit
rubric (define what "good" means — and what "too much"/"too little" look like).
Ask for **per-variant ratings + concrete suggested new values as JSON**:
`{"ratings":{"A":n,...},"best_so_far":"X","suggest":[[p1,p2],...]}`.
3. **Coarse → fine.** Round 1 = wide spread to locate the region. Round 2 = render the
model's suggestions (+ carry the current best) into one artifact; ask it to **pick
the single best**. Usually converges in **2 rounds**.
4. **Stop when sufficient** — best rates high and suggestions cluster. Apply the winner.
## Token / quality / step reductions (do these)
- **One artifact per round, not one call per variant.** The biggest saver — a 6-variant
round is 1 upload + 1 inference, not 6. Montage/grid beats a loop of single calls.
- **Burn params onto the artifact.** The model sees label+result together → no separate
"variant A used X" context to carry → fewer tokens, fewer mistakes.
- **Structured JSON out + parse.** No re-asking, no free-text wrangling. Prompt "return
ONLY JSON"; regex the first `{...}`.
- **Short representative sample.** Tune on a 3-5s clip / one frame / one component, not
the whole asset. Cheaper render, smaller upload, faster inference. Apply the found
params to the full render once.
- **Cap variants at ~5-6.** More doesn't improve the model's discrimination and multiplies
render + token cost. Wide-but-sparse round 1, narrow round 2.
- **Calibration anchors.** Include one deliberately-bad and one safe-default variant as
fixed anchors each round — gives the model a reference scale and exposes when its
"best" is worse than the safe default (catch a bad recommendation early).
- **Independent rubric, stated up front.** Define "good" concretely in the prompt
(smooth, subtle settle, not bouncy, not sluggish). Don't ask "which do you like" —
that lets it echo your framing. A held-out criterion keeps the judge honest
(see verify-outputs-rule: the check must be independent of what you tuned).
- **Reuse renders across rounds.** Carry the round-1 winner's clip into round 2 instead
of re-rendering it.
- **Early-exit.** If round-1 top ≥9/10 and the three suggestions are within a small delta,
skip round 2.
- **Cheapest judge that can see the failure.** Frames-through an image VLM can judge
spatial things (layout, color, crop); only reach for a true *video* model when the
thing being judged is **temporal** (easing, timing, motion smoothness) — those are
invisible in stills.
## When NOT to use it
- A real numeric metric exists and correlates with quality → optimize that directly;
don't pay a model per step.
- The judgment is subjective-to-the-user (their taste, brand) → show them the variants
and let them pick; a model's "best" isn't their best. (This is why the screen-studio
spring auto-tune was dropped — the model's pick didn't match the owner's eye.)
- One or two variants → just look yourself.
## Caveats (learned)
- The model's pick is an *opinion*, not ground truth — anchor it, and sanity-check the
winner against the safe default yourself before committing.
- Vision/video models perceive gross differences well, fine ones poorly — keep variant
spacing perceptible; near-identical variants get noise-rated.
## Limitations
- Model ratings are probabilistic aesthetic judgments, not objective truth; keep a human review step for brand-critical or subjective work.
- Automated rounds can become expensive or slow when renders are heavy or many variants are explored.
- This skill needs screenshots, frames, or clips that expose the quality difference; it is weak for subtle motion, audio, copy nuance, or user-preference calls.