Preprocessing and Analysis of Video and Audio Data for Face and Emotion Recognition

Author
Abstract
1. Introduction
2. Methodology
Pipeline Overview
2.1 Video Processing
2.2 Audio Classification
2.3 Face Detection and Cropping
2.4 Pose Estimation
2.5 Emotion Classification
3. Results
Example Video Analysis
Audio Classification
Pose Estimation Examples
Emotion Classification
4. Future Directions
References

Author

Aditya Sundar - Waseda University

Abstract

This project creates an automated pipeline for preprocessing video and audio data, extracting key information for training video generation models. The pipeline handles face detection, emotion classification, pose estimation, and audio processing. Key Features:

Automatic unique face detection and isolation
Head pose estimation (yaw, pitch, roll)
Emotion classification using deep learning
Audio classification and speech isolation
Clip generation (3-10 second segments)

1. Introduction

The rise of video-based generative models has created a need for robust preprocessed video datasets. This project automates the preprocessing of video and audio data to:

Automatically classify and recognize unique faces
Detect facial emotions and head poses across time
Classify audio for background music and isolate speech
Trim and refine videos for use in generative models

2. Methodology

Pipeline Overview

Video preprocessing pipeline workflow

The preprocessing pipeline consists of five main stages:

Stage	Function
Audio Classification	Identify speech and isolate from background noise
Face Detection	Detect and identify unique faces in video
Face Cropping	Generate face-focused clips (3-10 seconds)
Pose Estimation	Estimate head orientation (yaw, pitch, roll)
Emotion Classification	Detect emotions in each frame

2.1 Video Processing

Videos are downloaded using yt-dlp and processed frame by frame:

def download_video(url, output_dir):
    ydl_opts = {
        'format': 'best',
        'outtmpl': os.path.join(output_dir, '%(title)s-%(id)s.%(ext)s'),
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

2.2 Audio Classification

Audio is classified using the Audio Spectrogram Transformer model:

Converts audio to spectrogram
Applies Vision Transformer for classification
Threshold of ~20% used to detect background noise

Example Results:

Video Type	Speech %	Music %
Commentary with music	50.28%	37.20%
Live performance	1.50%	46.54%
News interview	82.64%	0%

2.3 Face Detection and Cropping

Using YuNet face detection model:

Detect all faces in each frame
Select largest face as subject
Crop and resize to consistent dimensions
Generate 3-10 second clips

Optional background removal using rembg can further isolate subjects.

Face cropped clip example

Background removed clip

2.4 Pose Estimation

Head pose estimated using 68 facial landmarks:

Yaw: Left-right rotation (>10° = looking right/left)
Pitch: Up-down rotation (>10° = looking up/down)
Roll: Head tilt

Pose values are smoothed using a buffer of consecutive frames.

2.5 Emotion Classification

Using facial_emotions_image_detection from Hugging Face:

Detects: happy, sad, angry, neutral, fear, disgust, surprise
Scores normalized to sum to 100%
Averaged across entire video for summary

3. Results

Example Video Analysis

Test video: “Hacksaw Ridge Interview - Andrew Garfield” (4 min 11 sec)

Metric	Value
Total frames	6,024
FPS	23.97
Face detection rate	98.26%
Average faces per frame	1.0
Clips generated	26

Audio Classification

Speech: 88.89%
Rustling leaves: 1.12%
Rustle: 0.74%

No speech isolation needed due to high speech confidence.

Pose Estimation Examples

Forward-facing clip:

Yaw: 0.65°, Pitch: 4.07°
Direction: “Forward”

Forward-facing pose detection

Right-facing clip:

Yaw: 10.36°, Pitch: -1.22°
Direction: “Right”

Right-facing pose detection

Emotion Classification

Model showed uncertainty across emotions, with highest probability ~25%. This indicates the complexity of emotion detection from static facial images.

4. Future Directions

Additional classifications: Lip reading, gesture detection
GPU acceleration: Currently CPU-only due to resource limits
Fine-tuned models: Custom models for specific tasks
Advanced emotion detection: Multi-modal approaches beyond static images

References

1adrianb/face-alignment - 2D and 3D Face alignment library
ageitgey/face_recognition - Face recognition API for Python
CelebV-HQ - Large-Scale Video Facial Attributes Dataset
danielgatis/rembg - Background removal tool
dima806/facial_emotions_image_detection - Hugging Face
facebookresearch/demucs - Music source separation
MIT/ast-finetuned-audioset - Audio Spectrogram Transformer

Web Crawler (Sep 2024)RAFT Translation (Aug 2024)

⌘I

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

Preprocessing and Analysis of Video and Audio Data for Face and Emotion Recognition

Author

Abstract

1. Introduction