Author
Aditya Sundar - Waseda UniversityAbstract
This project creates an automated pipeline for preprocessing video and audio data, extracting key information for training video generation models. The pipeline handles face detection, emotion classification, pose estimation, and audio processing. Key Features:- Automatic unique face detection and isolation
- Head pose estimation (yaw, pitch, roll)
- Emotion classification using deep learning
- Audio classification and speech isolation
- Clip generation (3-10 second segments)
1. Introduction
The rise of video-based generative models has created a need for robust preprocessed video datasets. This project automates the preprocessing of video and audio data to:- Automatically classify and recognize unique faces
- Detect facial emotions and head poses across time
- Classify audio for background music and isolate speech
- Trim and refine videos for use in generative models
2. Methodology
Pipeline Overview

Video preprocessing pipeline workflow
| Stage | Function |
|---|---|
| Audio Classification | Identify speech and isolate from background noise |
| Face Detection | Detect and identify unique faces in video |
| Face Cropping | Generate face-focused clips (3-10 seconds) |
| Pose Estimation | Estimate head orientation (yaw, pitch, roll) |
| Emotion Classification | Detect emotions in each frame |
2.1 Video Processing
Videos are downloaded using yt-dlp and processed frame by frame:2.2 Audio Classification
Audio is classified using the Audio Spectrogram Transformer model:- Converts audio to spectrogram
- Applies Vision Transformer for classification
- Threshold of ~20% used to detect background noise
| Video Type | Speech % | Music % |
|---|---|---|
| Commentary with music | 50.28% | 37.20% |
| Live performance | 1.50% | 46.54% |
| News interview | 82.64% | 0% |
2.3 Face Detection and Cropping
Using YuNet face detection model:- Detect all faces in each frame
- Select largest face as subject
- Crop and resize to consistent dimensions
- Generate 3-10 second clips
Optional background removal using rembg can further isolate subjects.

Face cropped clip example

Background removed clip
2.4 Pose Estimation
Head pose estimated using 68 facial landmarks:- Yaw: Left-right rotation (>10° = looking right/left)
- Pitch: Up-down rotation (>10° = looking up/down)
- Roll: Head tilt
2.5 Emotion Classification
Using facial_emotions_image_detection from Hugging Face:- Detects: happy, sad, angry, neutral, fear, disgust, surprise
- Scores normalized to sum to 100%
- Averaged across entire video for summary
3. Results
Example Video Analysis
Test video: “Hacksaw Ridge Interview - Andrew Garfield” (4 min 11 sec)| Metric | Value |
|---|---|
| Total frames | 6,024 |
| FPS | 23.97 |
| Face detection rate | 98.26% |
| Average faces per frame | 1.0 |
| Clips generated | 26 |
Audio Classification
Pose Estimation Examples
Forward-facing clip:- Yaw: 0.65°, Pitch: 4.07°
- Direction: “Forward”

Forward-facing pose detection
- Yaw: 10.36°, Pitch: -1.22°
- Direction: “Right”

Right-facing pose detection
Emotion Classification
Model showed uncertainty across emotions, with highest probability ~25%. This indicates the complexity of emotion detection from static facial images.4. Future Directions
- Additional classifications: Lip reading, gesture detection
- GPU acceleration: Currently CPU-only due to resource limits
- Fine-tuned models: Custom models for specific tasks
- Advanced emotion detection: Multi-modal approaches beyond static images
References
- 1adrianb/face-alignment - 2D and 3D Face alignment library
- ageitgey/face_recognition - Face recognition API for Python
- CelebV-HQ - Large-Scale Video Facial Attributes Dataset
- danielgatis/rembg - Background removal tool
- dima806/facial_emotions_image_detection - Hugging Face
- facebookresearch/demucs - Music source separation
- MIT/ast-finetuned-audioset - Audio Spectrogram Transformer
