nl-video-score

Installation
SKILL.md

NL Video Score

Claude cannot play video, but it CAN read image frames. This skill extracts frames at tactical timestamps, compares them against reference frames (if provided), and produces a weighted scorecard so the user can decide: ship, re-roll, or switch model.

Prerequisites

ffmpeg must be installed. Verify: ffmpeg -version

Workflow

1. Get video metadata

ffprobe -v error \
  -show_entries stream=codec_name,width,height,r_frame_rate,duration,nb_frames \
  -show_entries format=duration \
  -of default=noprint_wrappers=1 <video>

2. Extract frames — three-tier strategy

Scale to ~942x550 for fast reading while preserving detail.

2a. Overview (1 fps, full duration)

Overall flow, pose transitions, BG stability.

ffmpeg -y -i <video> -vf "fps=1,scale=942:550" <out_dir>/overview_%02d.png

2b. Critical-motion window (8 fps, ~1s window)

Zoom into the moment that matters most — a dialogue beat, gesture, or expression change. Center the window on the scripted action timestamp.

# Example: action at t=1.0s, sample 0.7s-1.7s
ffmpeg -y -ss 0.7 -i <video> -t 1.0 -vf "fps=8,scale=942:550" <out_dir>/detail_%02d.png

For shots with no dialogue, use the critical gesture window (eye movement, head turn, hand action).

2c. End-stage stability (4 fps, last ~2.5s)

AI video models often degrade in the final seconds — morphing, finger drift, face collapse.

# For 10s video: sample 7.5s-10s
ffmpeg -y -ss 7.5 -i <video> -t 2.5 -vf "fps=4,scale=942:550" <out_dir>/end_%02d.png

Adjust timestamps proportionally for non-10s shots.

3. Read frames

Read all extracted frames in parallel (single message, many Read tool calls). Also read reference start/end frames if available. Preserve temporal order.

4. Score — 5 dimensions, 55 points

Dimension Weight Max What to look for
Face consistency x3 15 Identity stability. Same eyes, nose, hair, blush. Penalize: morph, identity drift, eye shape change, jaw migration.
Micro-expression fidelity x3 15 Does performance match the prompt's emotional intent? Penalize: model softening, wrong valence, missed beats, generic "pleasant" default.
Drift / morph control x2 10 Structural integrity of hands, fingers, props, clothing. Penalize: extra fingers, hand warping, prop morphing, button migration.
Full-duration stability x2 10 Does the shot hold together for the full duration? Penalize: late-stage collapse, accumulating drift, motion that starts fine then derails.
Color / style match x1 5 Target style preserved. Penalize: style drift (e.g. 2D to 3D), photorealism creep, color grading mismatch, oversaturation.

Each dimension: provide 1-2 concrete frame-specific callouts (e.g. "end_07: left index finger merging with middle finger").

5. Decision matrix

Score Verdict Action
>= 45 Golden Lock model + seed. Ship it.
40-44 Viable Ship, but consider one comparison with a stronger model for critical shots.
30-39 Marginal Re-roll with tightened prompt, or A/B test a different model.
< 30 Fail Do not use. Switch model, split shot, or rework prompt/references.

6. Report format

# <Model> Scorecard — <ShotID>

## TL;DR
<2-3 sentence verdict with score and the single most important finding>

## Frame-by-frame observations
### Overview (0-Ns)
<table: timestamp | observation>

### Critical motion window
<did the scripted action happen? pass/fail with frame references>

### End stage
<late-stage stability verdict>

## Scorecard
| Dimension | Weight | Raw | Weighted | Notes |
|---|---|---|---|---|
| Face consistency | x3 | X/15 | X | ... |
| Micro-expression | x3 | X/15 | X | ... |
| Drift / morph | x2 | X/10 | X | ... |
| Stability | x2 | X/10 | X | ... |
| Style | x1 | X/5 | X | ... |
| **Total** | | | **X/55** | |

## Critical notes
<actionable issues: missing props, BG mismatch, lip sync needed, seed to preserve>

## Next moves
<ship / re-roll / switch model / A/B test>

Tips

  • Score the model against the brief, not the brief itself. If reference frames have a BG mismatch, that is a brief problem — flag it but don't penalize the model.
  • Props missing from output that aren't in reference frames: the model cannot create what it wasn't shown. Score the prompt/reference setup.
  • Mouth never moves for scripted dialogue: expected for models without native audio. Note that lip-sync post will be needed, but don't penalize drift.
  • Mouth moves continuously when it should be still: this IS a model failure ("liveliness bias"). Penalize under Micro-expression.
  • Read 20+ frames in parallel: single message with many Read calls. Never one at a time.
  • Scoring calibration: 55/55 is near-impossible. 45-50 is excellent. Most first-run outputs land 35-42.
  • The scorecard is advisory: the user's eye is the final judge. Present numbers with reasoning so they can override.
Related skills
Installs
1
GitHub Stars
1
First Seen
3 days ago