--- name: lookdev-auto description: "Automated visual tuning: a vision or video model rates rendered variants in a loop. Render several labeled variants into one artifact, ask the model to rate them and suggest better values, render the suggestions, ask it to pick the best, repeat until good — the model is the eye, you run the loop." risk: safe source: community source_type: community source_repo: connerkward/lookdev-auto-skill date_added: "2026-06-16" author: Conner K Ward license: MIT tags: - visual-eval - vision-model - tuning - automation - render-loop tools: - claude-code - antigravity - cursor - gemini-cli - codex-cli --- ## When to Use Use whenever "looks/feels right" is the success criterion and there's no cheap numeric metric — animation easing/timing, zoom/camera feel, color grade, layout/spacing, design params, render/encoder settings, prompt params. Use the automated counterpart to lookdev when there's no human to sit the loop. _Source: [connerkward/lookdev-auto-skill](https://github.com/connerkward/lookdev-auto-skill) (MIT)._ # Visual eval loop — let a vision/video model tune what only an eye can judge When the target is "does this LOOK/FEEL right" (not a number you can minimize), a vision model (image) or video-understanding model (motion/timing) can be the judge in a tight optimize loop. Worked reference: the `screenstudio-alternative` skill (`iteration.py`) (tuned zoom-animation feel via `fal-ai/video-understanding`). ## The loop 1. **Render N labeled variants into ONE artifact.** Vary the parameter(s) across a small spread. **Annotate each variant's params ON the artifact** (burn the label in: "A · 2.2Hz · ζ0.5"). Images → a labeled grid/contact sheet. Video/motion → a labeled *sequence* (label card or burned-in overlay before/over each clip) so the model can compare temporally. 2. **One model call, structured output.** Send the single artifact with an explicit rubric (define what "good" means — and what "too much"/"too little" look like). Ask for **per-variant ratings + concrete suggested new values as JSON**: `{"ratings":{"A":n,...},"best_so_far":"X","suggest":[[p1,p2],...]}`. 3. **Coarse → fine.** Round 1 = wide spread to locate the region. Round 2 = render the model's suggestions (+ carry the current best) into one artifact; ask it to **pick the single best**. Usually converges in **2 rounds**. 4. **Stop when sufficient** — best rates high and suggestions cluster. Apply the winner. ## Token / quality / step reductions (do these) - **One artifact per round, not one call per variant.** The biggest saver — a 6-variant round is 1 upload + 1 inference, not 6. Montage/grid beats a loop of single calls. - **Burn params onto the artifact.** The model sees label+result together → no separate "variant A used X" context to carry → fewer tokens, fewer mistakes. - **Structured JSON out + parse.** No re-asking, no free-text wrangling. Prompt "return ONLY JSON"; regex the first `{...}`. - **Short representative sample.** Tune on a 3-5s clip / one frame / one component, not the whole asset. Cheaper render, smaller upload, faster inference. Apply the found params to the full render once. - **Cap variants at ~5-6.** More doesn't improve the model's discrimination and multiplies render + token cost. Wide-but-sparse round 1, narrow round 2. - **Calibration anchors.** Include one deliberately-bad and one safe-default variant as fixed anchors each round — gives the model a reference scale and exposes when its "best" is worse than the safe default (catch a bad recommendation early). - **Independent rubric, stated up front.** Define "good" concretely in the prompt (smooth, subtle settle, not bouncy, not sluggish). Don't ask "which do you like" — that lets it echo your framing. A held-out criterion keeps the judge honest (see verify-outputs-rule: the check must be independent of what you tuned). - **Reuse renders across rounds.** Carry the round-1 winner's clip into round 2 instead of re-rendering it. - **Early-exit.** If round-1 top ≥9/10 and the three suggestions are within a small delta, skip round 2. - **Cheapest judge that can see the failure.** Frames-through an image VLM can judge spatial things (layout, color, crop); only reach for a true *video* model when the thing being judged is **temporal** (easing, timing, motion smoothness) — those are invisible in stills. ## When NOT to use it - A real numeric metric exists and correlates with quality → optimize that directly; don't pay a model per step. - The judgment is subjective-to-the-user (their taste, brand) → show them the variants and let them pick; a model's "best" isn't their best. (This is why the screen-studio spring auto-tune was dropped — the model's pick didn't match the owner's eye.) - One or two variants → just look yourself. ## Caveats (learned) - The model's pick is an *opinion*, not ground truth — anchor it, and sanity-check the winner against the safe default yourself before committing. - Vision/video models perceive gross differences well, fine ones poorly — keep variant spacing perceptible; near-identical variants get noise-rated. ## Limitations - Model ratings are probabilistic aesthetic judgments, not objective truth; keep a human review step for brand-critical or subjective work. - Automated rounds can become expensive or slow when renders are heavy or many variants are explored. - This skill needs screenshots, frames, or clips that expose the quality difference; it is weak for subtle motion, audio, copy nuance, or user-preference calls.