Understanding and Predicting Interestingness of Videos

Understanding and Predicting Interestingness of VideosYu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue

School of Computer Science, Fudan University, Shanghai, ChinaAAAI 2013Bellevue, USA

Applications:• Web Video Search• Video Recommendation System

Related Work:• There is a few studies about predicting Aesthetics and

Interestingness of Images

Key Idea is building computational model to predict which video is more interesting, when given two videos.

Contributions:• Conducted a pilot study on video interestingness• Built two new datasets to support this study• Evaluated a large number of features and get interesting

observations

Can a computational model automatically analyze video contents and predict the interestingness of videos?

We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos.

The problem

Key Idea

VS.

Two New DatasetsFlickr Dataset:• Source: Flickr.com• Video Type: Consumer Videos• Video Number: 1200 • Categories: 15 (basketball, beach…)• Duration: 20 hrs in total• Label: Top 10% as interesting videos;

Bottom 10% as uninteresting

YouTube Dataset:• Source: YouTube.com• Video Type: Advertisements• Video Number: 420• Categories: 14 (food, drink…)• Duration: 4.2 hrs in total• Label: 10 human assessors to compare

video pairs

Prediction & EvaluationComputational Framework: • Aim: train a model to compare the interestingness of two videos

Feature:

Prediction:• Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models• For both datasets, we use 2/3 of the videos for training and 1/3 for testing• Use Kernel-level Fusion & Equal Weights to fuse multiple features.

Evaluation：• Accuracy (the percentage of correctly ranked test video pairs)

Visual features

Audio features

High-level attribute features

Ranking SVM

resultsMulti-modal fusionVS.

Multi-modal feature extraction

Visual features Color Histogram SIFT HOG SSIM GIST

Audio features MFCC Spectrogram SIFT Audio-Six

High-level attribute features

Classemes Objectbank Style

ResultsVisual Feature Results:

• Overall the visual features achieve very impressive performance on both datasets• Among five features, SIFT and HOG are very effective, and their combination performs best

Audio Feature Results:

• The three audio features are effective and complementary. Comparing them gets best performance

Attribute Feature Results:

• Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective

Visual+Audio+Attribute Fusion Results:

• Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80 74.2

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80

50556065707580 76.4

50556065707580

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Visual(S

IFT+HOG)

Audio(MFCC+SS+Audio-Six)

Attribute(O

bjectbank+

Classeme)

Visual+Audio

Visual+Audio+Attrib

ute50607080

50607080

Flickr YouTube

Datasets are available at: www.yugangjiang.info/research/interestingness

76.6 68.074.567.0 67.1

65.764.874.7

64.5 56.8

71.778.676.6

68.0

2.6% 5.4%

ConclusionWe conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations:• Visual and Audio features are effective in predicting video interestingness• A few features useful in image interestingness do not extend to video domain

(Style…)

Documents

Understanding and Predicting Interestingness of Videos