A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney

A Discriminative CNN Video Representation for Event Detection

Zhongwen Xu†, Yi Yang† and Alexander G. Hauptmann‡

†QCIS, University of Technology, Sydney

‡SCS, Carnegie Mellon University

Multimedia Event Detection

• Detect user-defined event by analyzing the visual, acoustic, and textual information in the web videos

• Event: a phrase like “Birthday party”, “Wedding ceremony”, “Making a sandwich” and “Changing a vehicle tire”.

• Part of TRECVID 2011-2015 competition, the largest video analysis competition in the world.

MED data

• Source:– Collected from YouTube– Uploaded by different users

• MEDEval 14 dataset– num of events = 20– Training Data: 32k videos, duration: ~1,200 hours– Test data: 200k videos, duration: ~8,000 hours, size:

~5 TB– Well human-labeled

Video analysis costs a lot

• Dense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detection

– superior performance over other features such as the motion feature STIP and the static appearance feature Dense SIFT

Credits: Heng Wang


• Paralleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collection

“Blacklight” with 4,096 CPU cores,32 TB shared memoryin Pittsburgh Supercomputing Center


• And, it is really a headache when dealing with the I/O problems caused by one thousand threads, which are reading videos and outputting generated features very heavily.

• The whole system would be slowed down extremely when you did not coordinate the I/O well.


• As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale video datasets.

• It becomes important to propose an efficient representation for complex event detection with only affordable computational resources, e.g., a single machine.

Turn to CNNs?

• One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their– overwhelming accuracy in image analysis– fast processing speed, which is achieved by leveraging

the massive parallel processing power of GPUs.

Turn to CNN?

• However, it has been reported that the event detection performance of CNN based video representation is WORSE than the improved Dense Trajectories in TRECVID MED 2013.

Average Pooling for Videos

Winning solution for the TRECVID MED 2013 competition

Average Pooling of CNN frame features

• Convolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level features

• What’s wrong with CNN video representation?

MEDTest 13 MEDTest 14

Improved Dense Trajectories 34.0 27.6

CNN in CMU@MED 2013 29.0 N.A.

CNN from VGG-16 32.7 24.8

Video Pooling on CNN Descriptors

• Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video.

• For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector (FV) or Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video representation.

• To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as FV and VLAD from local descriptors to CNN descriptors in video analysis.

Illustration of VLAD encoding

Credits: Prateek Joshi

Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14

Resultsfc6 fc6_relu fc7 fc7_relu

Average pooling 19.8 24.8 18.8 23.8

Fisher vector 28.3 28.4 27.4 29.1

VLAD 33.1 32.6 33.2 31.5

Table: Performance comparison (mAP in percentage) on MEDTest 14 100Ex

Figure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10Ex

Results

• For more references, we provide the performance of a number of widely used features on MEDTest 14 for comparison

• MEDTest 100Ex– MoSIFT with Fisher vector achieves mAP 18.1%; – STIP with Fisher vector achieves mAP 15.0%;– CSIFT with Fisher vector achieves mAP 14.7%;– IDT with Fisher vector achieves mAP 27.6%;– Our single layer achieves mAP 33.2%

• Note that with VLAD encoded CNN descriptors, we can achieve better performance (mAP 20.8%) with 10Ex than the relatively poorer features such as MoSIFT, STIP, and CSIFT with 100Ex!

Features from Convolutional Layers

Credits: Matthew Zeiler

Latent Concept Descriptors (LCD)

• Convolutional filters can be regarded as generalized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept.

• From this interpretation, pool5 layer of size a×a×M can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses from the M filters for a specific pooling location.

Latent Concept Descriptors (LCD)

LCD with SPP

• LCD can be incorporated with Spatial Pyramid Pooling (SPP) layer, to enrich the visual information while only marginal computation cost is increased.

• The last convolutional layer is pooled into 6x6, 3x3, 2x2, 1x1 regions, each with M filters.

LCD Results on pool5

100Ex 10Ex

Average pooling 31.2 18.8

LCDVLAD 38.2 25.0

LCDVLAD + SPP 40.3 25.6

Table 1: Performance comparisons for pool5 on MEDTest 13

100Ex 10Ex

Average pooling 24.6 15.3

LCDVLAD 33.9 22.8

LCDVLAD + SPP 35.7 23.2

Table 2: Performance comparisons for pool5 on MEDTest 14

LCD for image analysis

• Deep filter banks for texture recognition and segmentation, M. Cimpoi, S. Maji and A. Vedaldi, in CVPR, 2015 (Oral)

• Deep Spatial Pyramid: The Devil is Once Again in the Details, B. Gao, X. Wei, J. Wu and W. Lin, arXiv, 2015

Comparisons with previous best features IDT

Ours IDT Relative improvement

MEDTest 13 100 Ex 44.6 34.0 31.2%

MEDTest 13 10 Ex 29.8 18.0 65.6%

MEDTest 14 100Ex 36.8 27.6 33.3%

MEDTest 14 10Ex 24.5 13.9 76.3%

Comparison to the state-of-the-art Systems on MEDTest 13

• Natarajan et al. report mAP 38.5% on 100Ex, 17.9% on 10Ex from their whole visual system of combining all their low-level visual features.

• Lan et al. report 39.3% mAP on 100Ex of their whole system including non-visual features.

• Our results achieve 44.6% mAP on 100Ex and 29.8% mAP on 10Ex• Ours + IDT + MFCC achieve 48.6% mAP on 100Ex, and 32.2% mAP

on 10Ex– Our single feature beats the state-of-the-art MED systems with

more than 20 features– Ours lightweight system (static + motion + acoustic features)

sets up a high standard for MED

Notes

• The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding techniques.

• The proposed representation is very general for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world.

• The proposed representation is pretty simple but very effective, it is easy to generate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits.

THUMOS’ 15 Action Recognition ChallengeAction Recognition in Temporally Untrimmed Videos!

A new forward-looking dataset containing over 430 hours of video data and 45 million frames (70% larger than THUMOS‘14) with the following components is made available under this challenge:Training Set: over 13,000 temporally trimmed videos from 101 action classes.Validation Set: Over 2100 temporally untrimmed videos with temporal annotations of actions.Background Set: Approximately 3000 relevant videos guaranteed to not include any instance of the 101 actions.Test Set: Over 5600 temporally untrimmed videos with withheld ground truth.

http://www.thumos.info/download.html

Results for THUMOS’ 15 validation set

• Setting:– Training data: training part only (UCF-101)– Testing data: validation part in 2015

• Set C = 100 in linear SVM with LIBSVM toolkit• Metric: mean Average Precision (mAP)• Comparison between average pooling and VLAD

encodingfc6 fc7

Average pooling* 0.521 0.493VLAD encoding 0.589 0.566

* In average pooling, we utilize the layer after ReLU since it shows better performance

Results for THUMOS’15 validation set

Performance from VLAD encoded CNN features:

fc6 fc7 LCDmAP 0.589 0.566 0.619


• LCD with a better CNN model, GoogLeNet with Batch Normalization (Inception v2)– Batch Normalization, Ioffe and Szegedy, ICML 2015– Trained by cxxnet toolkit* with 4 NVIDIA K20 GPUs– Timing: about 4.5 days (~40 epochs) (ref. VGG-16 takes 2-3

weeks to train on 4 GPUs)– Achieve the same performance as the single network of

Google’s submission at ILSVRC 2014 last year

*with great Multi-GPU training support

LCD from VGG-16 LCD from Inception v2mAP 0.619 0.628


LCD with Inception v2 Multi-skip IDT FlowNet

mAP 0.628 0.529 0.416

with late fusion of the prediction scores, we can achieve mAP 0.689

THUMOS’15 Ranking

Rank Entry Best Result1 UTS & CMU 0.73842 MSR Asia (MSM) 0.68973 Zhejiang University* 0.68764 INRIA_LEAR* 0.68145 CUHK & SIAT 0.68036 University of Amsterdam 0.6798

* Utilized our CVPR 2015 paper as the main system component

Shared features

• We share all the features for MED datasets and THUMOS 2015 datasets

• You can download the features via Dropbox / Baidu Yun. Links are on my homepage.

• Features can be utilized to conduct machine learning / pattern recognition tasks

Thanks

Documents

A Discriminative CNN Video Representation for Event Detection Zhongwen Xu †, Yi Yang † and Alexander G. Hauptmann ‡ †QCIS, University of Technology, Sydney