Skip to main content

Author

Aditya Sundar - Waseda University

Abstract

This project creates an automated pipeline for preprocessing video and audio data, extracting key information for training video generation models. The pipeline handles face detection, emotion classification, pose estimation, and audio processing. Key Features:
  • Automatic unique face detection and isolation
  • Head pose estimation (yaw, pitch, roll)
  • Emotion classification using deep learning
  • Audio classification and speech isolation
  • Clip generation (3-10 second segments)

1. Introduction

The rise of video-based generative models has created a need for robust preprocessed video datasets. This project automates the preprocessing of video and audio data to:
  • Automatically classify and recognize unique faces
  • Detect facial emotions and head poses across time
  • Classify audio for background music and isolate speech
  • Trim and refine videos for use in generative models

2. Methodology

Pipeline Overview

Pipeline Workflow

Video preprocessing pipeline workflow

The preprocessing pipeline consists of five main stages:
StageFunction
Audio ClassificationIdentify speech and isolate from background noise
Face DetectionDetect and identify unique faces in video
Face CroppingGenerate face-focused clips (3-10 seconds)
Pose EstimationEstimate head orientation (yaw, pitch, roll)
Emotion ClassificationDetect emotions in each frame

2.1 Video Processing

Videos are downloaded using yt-dlp and processed frame by frame:
def download_video(url, output_dir):
    ydl_opts = {
        'format': 'best',
        'outtmpl': os.path.join(output_dir, '%(title)s-%(id)s.%(ext)s'),
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

2.2 Audio Classification

Audio is classified using the Audio Spectrogram Transformer model:
  • Converts audio to spectrogram
  • Applies Vision Transformer for classification
  • Threshold of ~20% used to detect background noise
Example Results:
Video TypeSpeech %Music %
Commentary with music50.28%37.20%
Live performance1.50%46.54%
News interview82.64%0%

2.3 Face Detection and Cropping

Using YuNet face detection model:
  1. Detect all faces in each frame
  2. Select largest face as subject
  3. Crop and resize to consistent dimensions
  4. Generate 3-10 second clips
Optional background removal using rembg can further isolate subjects.
Cropped Face

Face cropped clip example

Background Removed

Background removed clip

2.4 Pose Estimation

Head pose estimated using 68 facial landmarks:
  • Yaw: Left-right rotation (>10° = looking right/left)
  • Pitch: Up-down rotation (>10° = looking up/down)
  • Roll: Head tilt
Pose values are smoothed using a buffer of consecutive frames.

2.5 Emotion Classification

Using facial_emotions_image_detection from Hugging Face:
  • Detects: happy, sad, angry, neutral, fear, disgust, surprise
  • Scores normalized to sum to 100%
  • Averaged across entire video for summary

3. Results

Example Video Analysis

Test video: “Hacksaw Ridge Interview - Andrew Garfield” (4 min 11 sec)
MetricValue
Total frames6,024
FPS23.97
Face detection rate98.26%
Average faces per frame1.0
Clips generated26

Audio Classification

Speech: 88.89%
Rustling leaves: 1.12%
Rustle: 0.74%
No speech isolation needed due to high speech confidence.

Pose Estimation Examples

Forward-facing clip:
  • Yaw: 0.65°, Pitch: 4.07°
  • Direction: “Forward”
Forward Pose

Forward-facing pose detection

Right-facing clip:
  • Yaw: 10.36°, Pitch: -1.22°
  • Direction: “Right”
Right Pose

Right-facing pose detection

Emotion Classification

Model showed uncertainty across emotions, with highest probability ~25%. This indicates the complexity of emotion detection from static facial images.

4. Future Directions

  1. Additional classifications: Lip reading, gesture detection
  2. GPU acceleration: Currently CPU-only due to resource limits
  3. Fine-tuned models: Custom models for specific tasks
  4. Advanced emotion detection: Multi-modal approaches beyond static images

References

  1. 1adrianb/face-alignment - 2D and 3D Face alignment library
  2. ageitgey/face_recognition - Face recognition API for Python
  3. CelebV-HQ - Large-Scale Video Facial Attributes Dataset
  4. danielgatis/rembg - Background removal tool
  5. dima806/facial_emotions_image_detection - Hugging Face
  6. facebookresearch/demucs - Music source separation
  7. MIT/ast-finetuned-audioset - Audio Spectrogram Transformer