33
A Discriminative CNN Video Representation for Event Detection Zhongwen Xu , Yi Yang and Alexander G. Hauptmann †QCIS, University of Technology, Sydney ‡SCS, Carnegie Mellon University

A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Embed Size (px)

Citation preview

Page 1: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

A Discriminative CNN Video Representation for Event Detection

Zhongwen Xu†, Yi Yang† and Alexander G. Hauptmann‡

†QCIS, University of Technology, Sydney

‡SCS, Carnegie Mellon University

Page 2: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Multimedia Event Detection

• Detect user-defined event by analyzing the visual, acoustic, and textual information in the web videos

• Event: a phrase like “Birthday party”, “Wedding ceremony”, “Making a sandwich” and “Changing a vehicle tire”.

• Part of TRECVID 2011-2015 competition, the largest video analysis competition in the world.

Page 3: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

MED data

• Source:– Collected from YouTube– Uploaded by different users

• MEDEval 14 dataset– num of events = 20– Training Data: 32k videos, duration: ~1,200 hours– Test data: 200k videos, duration: ~8,000 hours, size:

~5 TB– Well human-labeled

Page 4: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Video analysis costs a lot

• Dense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detection

– superior performance over other features such as the motion feature STIP and the static appearance feature Dense SIFT

Credits: Heng Wang

Page 5: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Video analysis costs a lot

• Paralleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collection

“Blacklight” with 4,096 CPU cores,32 TB shared memoryin Pittsburgh Supercomputing Center

Page 6: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Video analysis costs a lot

• And, it is really a headache when dealing with the I/O problems caused by one thousand threads, which are reading videos and outputting generated features very heavily.

• The whole system would be slowed down extremely when you did not coordinate the I/O well.

Page 7: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Video analysis costs a lot

• As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale video datasets.

• It becomes important to propose an efficient representation for complex event detection with only affordable computational resources, e.g., a single machine.

Page 8: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Turn to CNNs?

• One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their– overwhelming accuracy in image analysis– fast processing speed, which is achieved by leveraging

the massive parallel processing power of GPUs.

Page 9: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Turn to CNN?

• However, it has been reported that the event detection performance of CNN based video representation is WORSE than the improved Dense Trajectories in TRECVID MED 2013.

Page 10: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Average Pooling for Videos

Winning solution for the TRECVID MED 2013 competition

Page 11: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Average Pooling of CNN frame features

• Convolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level features

• What’s wrong with CNN video representation?

MEDTest 13 MEDTest 14

Improved Dense Trajectories 34.0 27.6

CNN in CMU@MED 2013 29.0 N.A.

CNN from VGG-16 32.7 24.8

Page 12: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Video Pooling on CNN Descriptors

• Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video.

• For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector (FV) or Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video representation.

• To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as FV and VLAD from local descriptors to CNN descriptors in video analysis.

Page 13: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Illustration of VLAD encoding

Credits: Prateek Joshi

Page 14: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14

Page 15: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Resultsfc6 fc6_relu fc7 fc7_relu

Average pooling 19.8 24.8 18.8 23.8

Fisher vector 28.3 28.4 27.4 29.1

VLAD 33.1 32.6 33.2 31.5

Table: Performance comparison (mAP in percentage) on MEDTest 14 100Ex

Figure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10Ex

Page 16: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Results

• For more references, we provide the performance of a number of widely used features on MEDTest 14 for comparison

• MEDTest 100Ex– MoSIFT with Fisher vector achieves mAP 18.1%; – STIP with Fisher vector achieves mAP 15.0%;– CSIFT with Fisher vector achieves mAP 14.7%;– IDT with Fisher vector achieves mAP 27.6%;– Our single layer achieves mAP 33.2%

• Note that with VLAD encoded CNN descriptors, we can achieve better performance (mAP 20.8%) with 10Ex than the relatively poorer features such as MoSIFT, STIP, and CSIFT with 100Ex!

Page 17: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Features from Convolutional Layers

Credits: Matthew Zeiler

Page 18: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Latent Concept Descriptors (LCD)

• Convolutional filters can be regarded as generalized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept.

• From this interpretation, pool5 layer of size a×a×M can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses from the M filters for a specific pooling location.

Page 19: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Latent Concept Descriptors (LCD)

Page 20: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

LCD with SPP

• LCD can be incorporated with Spatial Pyramid Pooling (SPP) layer, to enrich the visual information while only marginal computation cost is increased.

• The last convolutional layer is pooled into 6x6, 3x3, 2x2, 1x1 regions, each with M filters.

Page 21: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

LCD Results on pool5

100Ex 10Ex

Average pooling 31.2 18.8

LCDVLAD 38.2 25.0

LCDVLAD + SPP 40.3 25.6

Table 1: Performance comparisons for pool5 on MEDTest 13

100Ex 10Ex

Average pooling 24.6 15.3

LCDVLAD 33.9 22.8

LCDVLAD + SPP 35.7 23.2

Table 2: Performance comparisons for pool5 on MEDTest 14

Page 22: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

LCD for image analysis

• Deep filter banks for texture recognition and segmentation, M. Cimpoi, S. Maji and A. Vedaldi, in CVPR, 2015 (Oral)

• Deep Spatial Pyramid: The Devil is Once Again in the Details, B. Gao, X. Wei, J. Wu and W. Lin, arXiv, 2015

Page 23: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Comparisons with previous best features IDT

Ours IDT Relative improvement

MEDTest 13 100 Ex 44.6 34.0 31.2%

MEDTest 13 10 Ex 29.8 18.0 65.6%

MEDTest 14 100Ex 36.8 27.6 33.3%

MEDTest 14 10Ex 24.5 13.9 76.3%

Page 24: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Comparison to the state-of-the-art Systems on MEDTest 13

• Natarajan et al. report mAP 38.5% on 100Ex, 17.9% on 10Ex from their whole visual system of combining all their low-level visual features.

• Lan et al. report 39.3% mAP on 100Ex of their whole system including non-visual features.

• Our results achieve 44.6% mAP on 100Ex and 29.8% mAP on 10Ex• Ours + IDT + MFCC achieve 48.6% mAP on 100Ex, and 32.2% mAP

on 10Ex– Our single feature beats the state-of-the-art MED systems with

more than 20 features– Ours lightweight system (static + motion + acoustic features)

sets up a high standard for MED

Page 25: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Notes

• The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding techniques.

• The proposed representation is very general for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world.

• The proposed representation is pretty simple but very effective, it is easy to generate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits.

Page 26: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

THUMOS’ 15 Action Recognition ChallengeAction Recognition in Temporally Untrimmed Videos!

A new forward-looking dataset containing over 430 hours of video data and 45 million frames (70% larger than THUMOS‘14) with the following components is made available under this challenge:Training Set: over 13,000 temporally trimmed videos from 101 action classes.Validation Set: Over 2100 temporally untrimmed videos with temporal annotations of actions.Background Set: Approximately 3000 relevant videos guaranteed to not include any instance of the 101 actions.Test Set: Over 5600 temporally untrimmed videos with withheld ground truth.

Page 27: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Results for THUMOS’ 15 validation set

• Setting:– Training data: training part only (UCF-101)– Testing data: validation part in 2015

• Set C = 100 in linear SVM with LIBSVM toolkit• Metric: mean Average Precision (mAP)• Comparison between average pooling and VLAD

encodingfc6 fc7

Average pooling* 0.521 0.493VLAD encoding 0.589 0.566

* In average pooling, we utilize the layer after ReLU since it shows better performance

Page 28: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Results for THUMOS’15 validation set

Performance from VLAD encoded CNN features:

fc6 fc7 LCDmAP 0.589 0.566 0.619

Page 29: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Results for THUMOS’ 15 validation set

• LCD with a better CNN model, GoogLeNet with Batch Normalization (Inception v2)– Batch Normalization, Ioffe and Szegedy, ICML 2015– Trained by cxxnet toolkit* with 4 NVIDIA K20 GPUs– Timing: about 4.5 days (~40 epochs) (ref. VGG-16 takes 2-3

weeks to train on 4 GPUs)– Achieve the same performance as the single network of

Google’s submission at ILSVRC 2014 last year

*with great Multi-GPU training support

LCD from VGG-16 LCD from Inception v2mAP 0.619 0.628

Page 30: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Results for THUMOS’ 15 validation set

LCD with Inception v2 Multi-skip IDT FlowNet

mAP 0.628 0.529 0.416

with late fusion of the prediction scores, we can achieve mAP 0.689

Page 31: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

THUMOS’15 Ranking

Rank Entry Best Result1 UTS & CMU 0.73842 MSR Asia (MSM) 0.68973 Zhejiang University* 0.68764 INRIA_LEAR* 0.68145 CUHK & SIAT 0.68036 University of Amsterdam 0.6798

* Utilized our CVPR 2015 paper as the main system component

Page 32: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Shared features

• We share all the features for MED datasets and THUMOS 2015 datasets

• You can download the features via Dropbox / Baidu Yun. Links are on my homepage.

• Features can be utilized to conduct machine learning / pattern recognition tasks

Page 33: A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

Thanks