human-extractor
Human Extractor Skill
Description
GPU-accelerated pipeline for detecting, tracking, and classifying humans in dashcam footage. Processes MP4 videos to extract human crops with optional CLIP-based head covering classification, saving all outputs to a unified directory with comprehensive indexing.
Purpose
Extract visual evidence of human presence from dashcam recordings for investigative analysis. Optimized for high throughput using NVDEC decoding, batched YOLOv8 detection, ByteTrack multi-object tracking, and optional CLIP classification.
Usage
Basic Invocation
Extract humans from Park_R videos on October 6, 2025
Advanced Invocation
Scan Park_R\20251006 and 20251007, keep only frames with people,
save all outputs in one folder, add one full-frame per timestamp with boxes,
use my GPU at max, filter for head-covered individuals at 80% confidence
Input Parameters
Required
- roots (list[str]): One or more source directories containing MP4 files
- Example:
["G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251006"]
- Example:
Core Detection
- confidence (float, default: 0.35): YOLOv8 detection confidence threshold (0.0-1.0)
- iou (float, default: 0.50): IoU threshold for NMS (0.0-1.0)
- yolo_batch (int, default: 64): YOLOv8 batch size (32-128 depending on VRAM)
CLIP Filtering (Optional)
- clip_filter.enabled (bool, default: false): Enable head covering classification
- clip_filter.threshold (float, default: 0.80): CLIP confidence threshold
- clip_filter.batch (int, default: 384): CLIP batch size (256-512)
Hardware Acceleration
- nvdec (bool, default: true): Use NVIDIA hardware video decoding
- gpu_id (int, default: 0): CUDA device ID
Output Control
- single_output_dir (str, default: "parsed\ALL_CROPS"): Unified output directory
- save_full_frame (bool, default: false): Save one annotated full-frame per timestamp
- full_frame_maxw (int, default: 1280): Max width for full-frame saves
- draw_boxes (bool, default: true): Annotate boxes on full-frames
Filename Convention
- filename_version (str, default: "v1"): Version tag for output filenames
Deduplication
- dedup.enabled (bool, default: true): Enable similarity deduplication
- dedup.ssim (float, default: 0.92): SSIM threshold (0.0-1.0)
- dedup.rate_cap_per_track_per_min (int, default: 12): Max crops per track per minute
Parallel Processing
- parallel.dates (list[str], optional): Process multiple dates concurrently
- parallel.max_workers (int, default: 3): Max parallel date workers
Output Format
Success Response
{
"status": "ok",
"summary": {
"videos_processed": 142,
"crops_saved": 4414,
"frames_saved": 728,
"gpu_util_avg": 0.85,
"processing_time_sec": 2847,
"errors": 0
},
"artifacts": {
"index_csv": "G:\\My Drive\\PROJECTS\\APPS\\Human_Detection\\parsed\\ALL_CROPS\\INDEX.csv",
"output_dir": "G:\\My Drive\\PROJECTS\\APPS\\Human_Detection\\parsed\\ALL_CROPS",
"log_file": "G:\\My Drive\\PROJECTS\\APPS\\Human_Detection\\parsed\\ALL_CROPS\\run_20251006_143022.log"
},
"performance": {
"nvdec_active": true,
"yolo_batch": 64,
"clip_batch": 384,
"avg_fps": 48.3,
"vram_peak_gb": 9.2
},
"notes": [
"NVDEC hardware decoding active",
"Batched YOLO=64, CLIP=384",
"GPU utilization: 85%"
]
}
Error Response
{
"status": "error",
"error": "CUDA out of memory",
"suggestion": "Reduce batch sizes: yolo_batch=48, clip_batch=256",
"partial_results": {
"videos_processed": 67,
"crops_saved": 2103
}
}
Output Structure
Directory Layout
parsed\ALL_CROPS\
├── INDEX.csv # Global master index
├── INDEX.20251006_pid1234.csv # Shard (pre-merge)
├── run_20251006_143022.log # Execution log
│
# Crop files (per person detection)
├── 20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp
├── 20251006__20251006143844_070787B__t8420__f202__trk003__x234y567w180h420__c92__v1.webp
│
# Full-frame files (optional, one per timestamp)
├── 20251006__20251006142644_070785B__t15234__FRAME__v1.webp
└── 20251006__20251006143844_070787B__t8420__FRAME__v1.webp
Filename Convention
Crop Format:
<date>__<video_stem>__t<ts_ms>__f<frame_idx>__trk<track_id>__x<x1>y<y1>w<w>h<h>__c<covered_0to100>__v<ver>.webp
Example:
20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp
Decoded:
- Date: 2025-10-06
- Video: 20251006142644_070785B.MP4
- Timestamp: 15234 ms
- Frame: 365
- Track: 17
- BBox: x=1014, y=46, w=266, h=659
- CLIP confidence: 85% (head covering)
- Version: v1
Full-Frame Format:
<date>__<video_stem>__t<ts_ms>__FRAME__v<ver>.webp
Example:
20251006__20251006142644_070785B__t15234__FRAME__v1.webp
INDEX.csv Schema
dataset,date,video_rel,video_stem,frame_idx,ts_ms,track_id,x1,y1,w,h,person_conf,covered_conf,file_type,crop_file,sha1,bboxes_json,annotated,pipeline_ver,yolo_batch,clip_batch,nvdec,created_utc
# Example rows:
Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,17,1014,46,266,659,0.92,0.85,crop,20251006__20251006142644_070785B__t15234__f365__trk017__x1014y46w266h659__c85__v1.webp,a3f2c8b9...,,,v1,64,384,1,2025-10-06T14:30:22Z
Park_R,20251006,20251006\20251006142644_070785B.MP4,20251006142644_070785B,365,15234,,,,,,,frame,20251006__20251006142644_070785B__t15234__FRAME__v1.webp,d4e1a2c7...,"[{""x1"":1014,""y1"":46,""w"":266,""h"":659,""conf"":0.92,""track"":17}]",1,v1,64,384,1,2025-10-06T14:30:22Z
Column Definitions:
- dataset: Source camera (Park_R, Park_F, Movie_F, Movie_R)
- date: YYYYMMDD
- video_rel: Relative path from dataset root
- video_stem: Filename without .MP4 extension
- frame_idx: Frame number in video
- ts_ms: Timestamp in milliseconds
- track_id: ByteTrack ID (empty for FRAME rows)
- x1,y1,w,h: Bounding box (empty for FRAME rows)
- person_conf: YOLOv8 detection confidence
- covered_conf: CLIP head covering confidence (0-100 scale, empty if disabled)
- file_type: "crop" or "frame"
- crop_file: Relative filename
- sha1: File hash for integrity
- bboxes_json: All detections in frame (FRAME rows only)
- annotated: 1 if boxes drawn on frame, 0 otherwise
- pipeline_ver: Semantic version tag
- yolo_batch: YOLO batch size used
- clip_batch: CLIP batch size used (0 if disabled)
- nvdec: 1 if NVDEC used, 0 otherwise
- created_utc: ISO 8601 timestamp
Implementation Details
Processing Pipeline
[MP4 Videos]
│
▼
[NVDEC Decoder (GPU)]
RGB tensor → CUDA Stream A
│
▼
[YOLOv8s Detection]
Batched (64 frames)
FP16, conf=0.35
│
▼
[ByteTrack Tracking]
IoU=0.5, max_age=10
│
├──────────────────────► [Full-Frame Saver]
│ (optional, downscaled, annotated)
▼
[ROI Align (GPU)]
Extract crops on GPU
│
▼
[CLIP Classification] ◄────── (optional)
Batched (384 crops)
FP16, threshold=0.80
│
▼
[Deduplication Filter]
SSIM ≥ 0.92
Rate cap: 12/min/track
│
▼
[Async I/O Thread Pool]
WebP encode (q=85)
Shard INDEX writes
│
▼
[Final Merge]
INDEX.csv
GPU Optimization Strategy
Dual CUDA Streams:
- Stream A: YOLOv8 detection
- Stream B: CLIP classification
- Overlap compute + memory transfers
Dynamic Batching:
- Accumulate frames until batch size reached
- Process immediately on timeout (100ms)
- Keep GPU pipeline full (80-90% utilization)
Memory Management:
- Pinned memory for faster CPU↔GPU transfers
- Pre-allocated tensor buffers
- Stream-ordered operations
Decoder Priority:
- NVDEC (GPU hardware decoder) - 5-10x faster
- CPU fallback (OpenCV) if NVDEC unavailable
- Multi-threaded DataLoader (8-12 workers)
Performance Targets (RTX 4080 16GB)
| Metric | Target | Notes |
|---|---|---|
| GPU Utilization | 80-90% | NVDEC + dual streams + large batches |
| Throughput | 3-4 videos/min | Parking videos (2 FPS sampling) |
| VRAM Usage | 6-10 GB | YOLO=64, CLIP=384 |
| Latency | <30s per video | Including decode, detect, track, classify |
Configuration Tuning
If GPU util < 70%:
- Increase batch sizes:
yolo_batch=80,clip_batch=448 - Verify NVDEC active (check
nvdec_activein response) - Increase parallel workers:
max_workers=4
If CUDA OOM:
- Reduce CLIP batch first:
clip_batch=256 - Then reduce YOLO batch:
yolo_batch=48 - Disable full-frame saves:
save_full_frame=false
If disk I/O bottleneck:
- Disable full-frame:
save_full_frame=false - Reduce quality:
full_frame_maxw=960, WebP q=75 - Use faster storage (NVMe SSD)
CLI Equivalent
# Basic usage
python -m src.cli.run_multi_dates \
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R" \
--out parsed\ALL_CROPS \
--dates 20251006 20251007 \
--use-nvdec --conf 0.35 --iou 0.5
# Advanced usage with CLIP filtering
python -m src.cli.run_multi_dates \
--root "G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Park_R" \
--out parsed\ALL_CROPS \
--dates 20251006 20251007 20251008 \
--use-nvdec \
--yolo-batch 64 \
--clip-batch 384 \
--clip-threshold 0.80 \
--conf 0.35 \
--iou 0.5 \
--save-full-frame \
--draw-boxes \
--parallel 3
Example Interactions
Example 1: Basic Detection
User: "Extract all humans from Park_R videos on October 6"
Skill invokes:
{
"mode": "extract_humans",
"roots": ["G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251006"],
"confidence": 0.35,
"single_output_dir": "parsed\\ALL_CROPS",
"nvdec": true
}
Example 2: Advanced with CLIP
User: "Scan Park_R for October 6-8, filter for people with head coverings at 80% confidence, save annotated frames, max GPU usage"
Skill invokes:
{
"mode": "extract_humans",
"roots": [
"G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251006",
"G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251007",
"G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251008"
],
"confidence": 0.35,
"iou": 0.50,
"yolo_batch": 64,
"clip_filter": {
"enabled": true,
"threshold": 0.80,
"batch": 384
},
"nvdec": true,
"save_full_frame": true,
"draw_boxes": true,
"single_output_dir": "parsed\\ALL_CROPS",
"parallel": {
"max_workers": 3
}
}
Example 3: Low-Resource Mode
User: "Process Park_R October 6 with minimal GPU memory"
Skill invokes:
{
"mode": "extract_humans",
"roots": ["G:\\My Drive\\PROJECTS\\INVESTIGATION\\DASHCAM\\Park_R\\20251006"],
"confidence": 0.35,
"yolo_batch": 32,
"clip_filter": {
"enabled": false
},
"nvdec": false,
"save_full_frame": false,
"single_output_dir": "parsed\\ALL_CROPS"
}
Safety & Guardrails
Do Not
- ❌ Move or delete source MP4 files
- ❌ Infer gender unless explicitly enabled (sensitive, noisy)
- ❌ Process videos without user consent
- ❌ Share outputs containing identifiable persons
Do
- ✅ Verify GPU availability before processing
- ✅ Enforce longitude sign corrections for GPS overlays
- ✅ Maintain audit trail in INDEX.csv
- ✅ Log versions, batches, NVDEC usage
- ✅ Handle OOM gracefully with suggestions
Resume Safety
- Idempotent: skip already-processed crops by filename
- Shard-based: partial runs can resume
- Index integrity: SHA1 hashes verify file correctness
Testing & Verification
Pre-Run Checks
# GPU availability
assert torch.cuda.is_available(), "CUDA required"
assert torch.cuda.device_count() > 0, "No GPU found"
# Model files
assert Path("models/yolov8s.pt").exists(), "YOLOv8 model missing"
# Output directory writable
output_dir = Path("parsed/ALL_CROPS")
output_dir.mkdir(parents=True, exist_ok=True)
assert os.access(output_dir, os.W_OK), "Output dir not writable"
Post-Run Verification
# Check outputs exist
assert Path("parsed/ALL_CROPS/INDEX.csv").exists()
assert len(list(Path("parsed/ALL_CROPS").glob("*.webp"))) > 0
# Validate INDEX.csv
df = pd.read_csv("parsed/ALL_CROPS/INDEX.csv")
assert df['crop_file'].notna().all()
assert df['person_conf'].between(0, 1).all()
# Sample roundtrip
sample = df.sample(1).iloc[0]
assert Path(f"parsed/ALL_CROPS/{sample['crop_file']}").exists()
# GPU utilization check
assert gpu_util_avg > 0.70, f"Low GPU util: {gpu_util_avg}"
Dependencies
Required
- Python 3.10+
- PyTorch 2.0+ with CUDA 11.8+
- ultralytics (YOLOv8)
- transformers (CLIP)
- opencv-python
- pillow
- pandas
- numpy
Optional (Performance)
- NVIDIA Video Codec SDK (NVDEC)
- TensorRT (future optimization)
- nvJPEG (GPU JPEG encoding)
Installation
cd "G:\My Drive\PROJECTS\APPS\Human_Detection"
pip install -r requirements.txt
Troubleshooting
Common Issues
1. CUDA Out of Memory
Error: CUDA out of memory. Tried to allocate 2.50 GiB
Solution: Reduce batch sizes
yolo_batch: 64 → 48 → 32
clip_batch: 384 → 256 → 128
2. NVDEC Not Available
Warning: NVDEC unavailable, falling back to CPU decode
Solution: Check NVIDIA driver version (≥525.60)
GPU must support Video Codec SDK
Verify with: nvidia-smi --query-gpu=name --format=csv
3. Low GPU Utilization
Warning: GPU util only 45%
Solutions:
1. Increase batch sizes (if VRAM allows)
2. Enable NVDEC: nvdec=true
3. Increase parallel workers: max_workers=4
4. Check CPU bottleneck (use more DataLoader workers)
4. Slow Processing
Performance: 0.8 videos/min (expected 3-4)
Diagnostics:
1. Check disk I/O (use SSD)
2. Verify NVDEC active (5-10x faster than CPU)
3. Profile with: python -m torch.utils.bottleneck script.py
Future Enhancements
Planned
- TensorRT optimization (2-4x CLIP speedup)
- Multi-GPU sharding (process different dates on different GPUs)
- GPU JPEG/WebP encoding (nvJPEG)
- Real-time streaming mode
Under Consideration
- Face recognition integration
- Gender classification (opt-in only, with warnings)
- Action recognition (walking, standing, etc.)
- Multi-camera fusion (correlate detections across cameras)
Version History
v1.0 (Current)
- Initial release
- YOLOv8s + ByteTrack + CLIP
- NVDEC support
- Unified output directory
- Global INDEX.csv
References
Contact & Support
For issues or questions:
- Check
parsed/ALL_CROPS/run_*.logfor error details - Review GPU diagnostics:
nvidia-smi - Validate input paths exist and are readable
- Verify CUDA/PyTorch installation:
python -c "import torch; print(torch.cuda.is_available())"