103 lines
5.5 KiB
Markdown
103 lines
5.5 KiB
Markdown
---
|
|
name: lookdev-auto
|
|
description: "Automated visual tuning: a vision or video model rates rendered variants in a loop. Render several labeled variants into one artifact, ask the model to rate them and suggest better values, render the suggestions, ask it to pick the best, repeat until good — the model is the eye, you run the loop."
|
|
risk: safe
|
|
source: community
|
|
source_type: community
|
|
source_repo: connerkward/lookdev-auto-skill
|
|
date_added: "2026-06-16"
|
|
author: Conner K Ward
|
|
license: MIT
|
|
tags:
|
|
- visual-eval
|
|
- vision-model
|
|
- tuning
|
|
- automation
|
|
- render-loop
|
|
tools:
|
|
- claude-code
|
|
- antigravity
|
|
- cursor
|
|
- gemini-cli
|
|
- codex-cli
|
|
---
|
|
## When to Use
|
|
|
|
Use whenever "looks/feels right" is the success criterion and there's no cheap numeric metric — animation easing/timing, zoom/camera feel, color grade, layout/spacing, design params, render/encoder settings, prompt params. Use the automated counterpart to lookdev when there's no human to sit the loop.
|
|
|
|
_Source: [connerkward/lookdev-auto-skill](https://github.com/connerkward/lookdev-auto-skill) (MIT)._
|
|
|
|
# Visual eval loop — let a vision/video model tune what only an eye can judge
|
|
|
|
When the target is "does this LOOK/FEEL right" (not a number you can minimize), a
|
|
vision model (image) or video-understanding model (motion/timing) can be the judge in
|
|
a tight optimize loop. Worked reference: the `screenstudio-alternative` skill (`iteration.py`)
|
|
(tuned zoom-animation feel via `fal-ai/video-understanding`).
|
|
|
|
## The loop
|
|
|
|
1. **Render N labeled variants into ONE artifact.** Vary the parameter(s) across a
|
|
small spread. **Annotate each variant's params ON the artifact** (burn the label in:
|
|
"A · 2.2Hz · ζ0.5"). Images → a labeled grid/contact sheet. Video/motion → a
|
|
labeled *sequence* (label card or burned-in overlay before/over each clip) so the
|
|
model can compare temporally.
|
|
2. **One model call, structured output.** Send the single artifact with an explicit
|
|
rubric (define what "good" means — and what "too much"/"too little" look like).
|
|
Ask for **per-variant ratings + concrete suggested new values as JSON**:
|
|
`{"ratings":{"A":n,...},"best_so_far":"X","suggest":[[p1,p2],...]}`.
|
|
3. **Coarse → fine.** Round 1 = wide spread to locate the region. Round 2 = render the
|
|
model's suggestions (+ carry the current best) into one artifact; ask it to **pick
|
|
the single best**. Usually converges in **2 rounds**.
|
|
4. **Stop when sufficient** — best rates high and suggestions cluster. Apply the winner.
|
|
|
|
## Token / quality / step reductions (do these)
|
|
|
|
- **One artifact per round, not one call per variant.** The biggest saver — a 6-variant
|
|
round is 1 upload + 1 inference, not 6. Montage/grid beats a loop of single calls.
|
|
- **Burn params onto the artifact.** The model sees label+result together → no separate
|
|
"variant A used X" context to carry → fewer tokens, fewer mistakes.
|
|
- **Structured JSON out + parse.** No re-asking, no free-text wrangling. Prompt "return
|
|
ONLY JSON"; regex the first `{...}`.
|
|
- **Short representative sample.** Tune on a 3-5s clip / one frame / one component, not
|
|
the whole asset. Cheaper render, smaller upload, faster inference. Apply the found
|
|
params to the full render once.
|
|
- **Cap variants at ~5-6.** More doesn't improve the model's discrimination and multiplies
|
|
render + token cost. Wide-but-sparse round 1, narrow round 2.
|
|
- **Calibration anchors.** Include one deliberately-bad and one safe-default variant as
|
|
fixed anchors each round — gives the model a reference scale and exposes when its
|
|
"best" is worse than the safe default (catch a bad recommendation early).
|
|
- **Independent rubric, stated up front.** Define "good" concretely in the prompt
|
|
(smooth, subtle settle, not bouncy, not sluggish). Don't ask "which do you like" —
|
|
that lets it echo your framing. A held-out criterion keeps the judge honest
|
|
(see verify-outputs-rule: the check must be independent of what you tuned).
|
|
- **Reuse renders across rounds.** Carry the round-1 winner's clip into round 2 instead
|
|
of re-rendering it.
|
|
- **Early-exit.** If round-1 top ≥9/10 and the three suggestions are within a small delta,
|
|
skip round 2.
|
|
- **Cheapest judge that can see the failure.** Frames-through an image VLM can judge
|
|
spatial things (layout, color, crop); only reach for a true *video* model when the
|
|
thing being judged is **temporal** (easing, timing, motion smoothness) — those are
|
|
invisible in stills.
|
|
|
|
## When NOT to use it
|
|
|
|
- A real numeric metric exists and correlates with quality → optimize that directly;
|
|
don't pay a model per step.
|
|
- The judgment is subjective-to-the-user (their taste, brand) → show them the variants
|
|
and let them pick; a model's "best" isn't their best. (This is why the screen-studio
|
|
spring auto-tune was dropped — the model's pick didn't match the owner's eye.)
|
|
- One or two variants → just look yourself.
|
|
|
|
## Caveats (learned)
|
|
|
|
- The model's pick is an *opinion*, not ground truth — anchor it, and sanity-check the
|
|
winner against the safe default yourself before committing.
|
|
- Vision/video models perceive gross differences well, fine ones poorly — keep variant
|
|
spacing perceptible; near-identical variants get noise-rated.
|
|
|
|
## Limitations
|
|
|
|
- Model ratings are probabilistic aesthetic judgments, not objective truth; keep a human review step for brand-critical or subjective work.
|
|
- Automated rounds can become expensive or slow when renders are heavy or many variants are explored.
|
|
- This skill needs screenshots, frames, or clips that expose the quality difference; it is weak for subtle motion, audio, copy nuance, or user-preference calls.
|