김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why...
36
Movie Content Analysis, Indexing and Skimming 김김김 (Duck Ju Kim)
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated
Problems What is the objective of content-based video analysis?
Why supervised identification has limitation? Why should use
integrated media data?
Slide 3
Introduction Analysis Structured organization Embedded
semantics Indexing Tagging semantic units Limited machine
perception Skimming Abstraction & Presentation Video
browsing
Slide 4
Event Detection Approach Shot detection Low-level structure Not
correspond directly to video semantics Scene extraction
Higher-level context Many unimportant contents Event extraction
Higher semantic level Better reveal, represent, abstraction
Slide 5
Speaker Identification Approach Standard speech databases YOHO,
HUB4, SWITCHBOARD Integration from media cues Speaker recognition +
Facial analysis Speech cues + Visual cues Supervised Identification
Fixed speaker models Insufficient training data Data collection
before processing
Slide 6
Video Skimming Approach Pre-developed schemes Discontinuous
semantic flow Ignored embedded audio cue Computation of six types
of features Importance evaluation Assembling important events
Slide 7
Content Pre-analysis Shot detection Color histogram-based
approach Extract keyframes The first and last frames Audio content
Classification Silence, speech, music, environmental sounds Visual
content Detect human faces
Slide 8
Movie Event Extraction Develop thematic topics Through actions
or dialogs What to extract? Two-speaker dialogs Multiple-speaker
dialogs Hybrid Events
Slide 9
Movie Event Extraction How to extract? Shot sink computation
Grouping close and similar shots Sink clustering and
characterization Periodic, partly-periodic, non-periodic Event
extraction and classification Post-processing
Slide 10
Shot Sink Computation Pool of close and similar shots Using
Visual Information Window-based Sweep Algorithm
Slide 11
Shot Sink Clustering Clustering & Characterizing Periodic,
Partly-periodic, Non-periodic Degree of shot repetition Determining
the sink periodicity Calculate relative temporal distance Compute
mean , standard deviation Grouping with K-means algorithm
Slide 12
Slide 13
Integrating Speech & Face Information False Alarm Montage
presentation -> Spoken Dialog Multiple-speaker dialog ->
Two-speaker dialog Solution to reducing Embedded audio information
integration Speech shot ratio calculation Facial cue inclusion Face
detection
Face Detection & Mouth Tracking Detection & Recognition
of talking faces Distance between eyes and mouth : dist Eyes
position : (x1, y1), (x2, y2) Mouth center : (x, y)
Slide 17
Speech Segmentation
Slide 18
Speech Clustering Two separate segments X1, X2 Joined segment X
= {X1, X2} For cluster C have n homogeneous speech segments Dist(X,
C) =, Negative value -> Considered from the same speaker
Slide 19
Initial Speaker Modeling Required for identification process
Exploiting the inter-relations between facial and speech cues For
each target cast member A Find a speech shot where A is talking
Collect all the speech segments Build initial model Gaussian
Mixture Model(GMM)
Slide 20
Likelihood-based speaker identification GMM model notation, j =
1, 2, , m For ith enrolled speaker The log likelihood between X and
Mi
Slide 21
Audiovisual integration for speaker identification Finalizing
the speaker identification task Integration of audio and video cues
Examine the existence of temporal overlap Overlap ratio >
Threshold Assign face vector to cluster Otherwise, set face vector
to null Speaker Identity
Slide 22
Unsupervised Speaker Model Adaptation Updating the speaker
model Three approaches Average-based model adaptation MAP-based
model adaptation Viterbi-based model adaptation
Slide 23
Average-based Model Adaptation Compute BIC distances Compare
between d min and threshold T d min < T : d min > T :
Initialize new mixture component Update the weight for each
component
Slide 24
MAP-based Model Adaptation i : Mean of b i d L i : Occupation
likelihood of the adaptation data -bar : Mean of the observed
adaptation data
Slide 25
Viterbi-based Model Adaptation Allows different feature vectors
from different components Hard decision Any vector can either
occupy component or not Indicator function instead of probability
function Mixture component
Slide 26
Event-based Movie Skimming Event feature extraction Six types
of mid- to high-level features Evaluation of importance Movie skim
generation Assemble major events -> final skim
Slide 27
Event Feature Extraction Music Ratio Speech Ratio Sound
Loudness Action Level Normalized by dividing the largest value
Present Cast Theme Topic
Slide 28
Event Feature Extraction M : # of features extracted N : # of
events a i,j : value of jth feature in ith event
Slide 29
Movie Skim Generation Choosing important events Users feature
preference Event importance vector
Slide 30
Event Detection Results Correctness of the event classification
System performance evaluation Hybrid class excluded
Slide 31
Slide 32
Speaker Identification Results Evaluation of adaptive speaker
identification system False acceptance(FA) False rejection(FR)
Identification accuracy(IA)
Slide 33
Slide 34
Average-based, MAP-based, Viterbi-based
Slide 35
Slide 36
Movie Skimming Results Difficulties of Qualitative evaluation
Quantitative measure based on user study 5-point scale : 1~5 Visual
comprehension Audio comprehension Semantic continuity Good
abstraction Quick browsing Video skipping