mediapipe-usage

SKILL.md

Google MediaPipe Usage (Web / Pose Landmarker)

Quick Start

  1. Install @mediapipe/tasks-vision, resolve WASM from CDN
  2. Create PoseLandmarker with createFromOptions
  3. Use detect() for single image, or detectForVideo() in a throttled requestAnimationFrame loop

Setup

Install (prefer pnpm):

pnpm add @mediapipe/tasks-vision

WASM root: Resolve vision tasks from CDN when creating the task:

const vision = await FilesetResolver.forVisionTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm",
);

Create the Pose Landmarker Task

Use PoseLandmarker.createFromOptions(vision, options):

import { PoseLandmarker, FilesetResolver } from "@mediapipe/tasks-vision";

const vision = await FilesetResolver.forVisionTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@latest/wasm",
);

const poseLandmarker = await PoseLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: modelUrl, // see reference.md for lite/full/heavy URLs
    delegate: "GPU", // falls back to CPU if unavailable
  },
  runningMode: "VIDEO", // or "IMAGE" for single image
  numPoses: 1,
  minPoseDetectionConfidence: 0.5,
  minPosePresenceConfidence: 0.5,
  minTrackingConfidence: 0.5,
});
  • runningMode: IMAGE for single image → use detect(image). VIDEO for stream → use detectForVideo(video, timestamp).
  • baseOptions.modelAssetPath: URL to a .task model (lite / full / heavy). See reference.md for URLs.
  • delegate: "GPU" preferred; some environments fall back to CPU.

Run the Task

Single image (runningMode IMAGE):

const result = poseLandmarker.detect(imageElement);

Video / webcam (runningMode VIDEO):

Call detectForVideo(video, timestamp) inside a requestAnimationFrame loop. Throttle by time (e.g. ~33 ms between frames) to avoid excessive work:

let lastFrameTime = 0;
function detectLoop() {
  const now = performance.now();
  if (video.readyState >= 2 && now - lastFrameTime > 33) {
    lastFrameTime = now;
    const result = poseLandmarker.detectForVideo(video, now);
    if (result.landmarks?.length) {
      const landmarks = result.landmarks[0]; // first person
      // use landmarks
    }
  }
  requestAnimationFrame(detectLoop);
}
requestAnimationFrame(detectLoop);

Result Shape

  • result.landmarks: Array of poses; each pose is NormalizedLandmark[] (33 points). Each landmark: x, y, z (normalized 0–1; z is depth relative to hip center), visibility (0–1).
  • result.worldLandmarks: Optional 3D coordinates in meters (same indices).
  • Single person: use result.landmarks[0].

Practical Patterns (Know-how)

  • State machine: idle → loading (load model) → ready (can start) → active (webcam + detection) → error. When switching model variant, close the old PoseLandmarker instance and create a new one.
  • Throttle: Run detectForVideo only when performance.now() - lastFrameTime > 33 (≈30 fps) to avoid blocking the main thread.
  • Smoothing: Apply a smoothing factor (e.g. 0.3) to derived values (pitch, bank) to reduce jitter; use a dead zone (in degrees) to ignore small movements.
  • Confidence: Use each landmark’s visibility; ignore or downweight points below a threshold. Helper: getLandmark(landmarks, index, minConfidence) returning the point only if visibility >= minConfidence.
  • Gestures: e.g. “hands forward” = compare shoulder vs wrist z; “hands overhead” = compare wrist y to shoulder y. Use consecutive-frame counters for toggles (e.g. require N frames in pose before firing an action).

Cleanup

  • Stop webcam: stream.getTracks().forEach(t => t.stop()).
  • Release task: poseLandmarker.close() when done or before creating a new instance.

Additional Resources

Weekly Installs
7
GitHub Stars
1
First Seen
Feb 27, 2026
Installed on
opencode7
gemini-cli7
github-copilot7
amp7
cline7
codex7