1
Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China AAAI 2013 Bellevue, USA Applications: Web Video Search Video Recommendation System Related Work: There is a few studies about predicting Aesthetics and Interestingness of Images Key Idea is building computational model to predict which video is more interesting, when given two videos. Contributions: Conducted a pilot study on video interestingness Built two new datasets to support this study Evaluated a large number of features and get interesting observations Can a computational model automatically analyze video contents and predict the interestingness of videos? We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos. The problem Key Idea VS. Two New Datasets Flickr Dataset: Source: Flickr.com Video Type: Consumer Videos Video Number: 1200 Categories: 15 (basketball, beach…) Duration: 20 hrs in total Label: Top 10% as interesting videos; Bottom 10% as uninteresting YouTube Dataset: Source: YouTube.com Video Type: Advertisements Video Number: 420 Categories: 14 (food, drink…) Duration: 4.2 hrs in total Label: 10 human assessors to compare video pairs Prediction & Evaluation Computational Framework: Aim: train a model to compare the interestingness of two videos Feature: Prediction: Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models For both datasets, we use 2/3 of the videos for training and 1/3 for testing Use Kernel-level Fusion & Equal Weights to fuse multiple features. Evaluation Visual features Audio features High-level attribute features Ranking SVM results Multi- modal fusion VS. Multi-modal feature extraction Visual features Color Histogram SIFT HOG SSIM GIST Audio features MFCC Spectrogram SIFT Audio-Six High-level attribute features Classemes Objectbank Style Results Visual Feature Results: Overall the visual features achieve very impressive performance on both datasets Among five features, SIFT and HOG are very effective, and their combination performs best Audio Feature Results: The three audio features are effective and complementary. Comparing them gets best performance Attribute Feature Results: Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective Visual+Audio+Attribute Fusion Results: Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective SIF T HOG SSI M GIST Color Histrogram SIF T+HOG SIF T+HOG+SS IM SIF T+HOG+GI ST SIF T + HOG+Color 50 70 74.2 50 60 70 80 MFC C Spe ctrogram SIFT Audio-Six MFCC+ SS MFC C+SS+Audio-Si x 50 70 76.4 MFC C Spect rogra m SIFT Audio-Six MFC C+ SS MFC C+ SS+Audio-Si x 50 70 Style Classemes Objectb a nk Style +Clas semes Classemes+Objectbank 50 70 Sty le Classemes Obj ectbank Sty le+Classemes Classeme s+Objectb a nk 50 70 50 70 50 60 70 80 Flickr YouTub e Datasets are available at: www.yugangjiang.info/research/interestingness 76.6 68.0 74.5 67.0 67.1 65.7 64.8 74.7 64.5 56.8 71.7 78.6 76.6 68.0 2.6 % 5.4% Conclusion We conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations: Visual and Audio features are effective in predicting video interestingness A few features useful in image interestingness do not extend to video domain (Style…)

Understanding and Predicting Interestingness of Videos

  • Upload
    qamar

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng , Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China. AAAI 2013 Bellevue, USA. Two New Datasets. The problem. Results. Flickr Dataset: - PowerPoint PPT Presentation

Citation preview

Page 1: Understanding and Predicting Interestingness of Videos

Understanding and Predicting Interestingness of VideosYu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang Xue

School of Computer Science, Fudan University, Shanghai, ChinaAAAI 2013Bellevue, USA

Applications:• Web Video Search• Video Recommendation System

Related Work:• There is a few studies about predicting Aesthetics and

Interestingness of Images

Key Idea is building computational model to predict which video is more interesting, when given two videos.

Contributions:• Conducted a pilot study on video interestingness• Built two new datasets to support this study• Evaluated a large number of features and get interesting

observations

Can a computational model automatically analyze video contents and predict the interestingness of videos?

We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos.

The problem

Key Idea

VS.

Two New DatasetsFlickr Dataset:• Source: Flickr.com• Video Type: Consumer Videos• Video Number: 1200 • Categories: 15 (basketball, beach…)• Duration: 20 hrs in total• Label: Top 10% as interesting videos;

Bottom 10% as uninteresting

YouTube Dataset:• Source: YouTube.com• Video Type: Advertisements• Video Number: 420• Categories: 14 (food, drink…)• Duration: 4.2 hrs in total• Label: 10 human assessors to compare

video pairs

Prediction & EvaluationComputational Framework: • Aim: train a model to compare the interestingness of two videos

Feature:

Prediction:• Adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models• For both datasets, we use 2/3 of the videos for training and 1/3 for testing• Use Kernel-level Fusion & Equal Weights to fuse multiple features.

Evaluation:• Accuracy (the percentage of correctly ranked test video pairs)

Visual features

Audio features

High-level attribute features

Ranking SVM

resultsMulti-modal fusionVS.

Multi-modal feature extraction

Visual features Color Histogram SIFT HOG SSIM GIST

Audio features MFCC Spectrogram SIFT Audio-Six

High-level attribute features

Classemes Objectbank Style

ResultsVisual Feature Results:

• Overall the visual features achieve very impressive performance on both datasets• Among five features, SIFT and HOG are very effective, and their combination performs best

Audio Feature Results:

• The three audio features are effective and complementary. Comparing them gets best performance

Attribute Feature Results:

• Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective

Visual+Audio+Attribute Fusion Results:

• Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effective

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80 74.2

SIFT

HOGSSI

M GIST

Color Hist

rogram

SIFT+H

OG

SIFT+H

OG+SSIM

SIFT+H

OG+GIST

SIFT+H

OG+Color

50

60

70

80

50556065707580 76.4

50556065707580

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Style

Classemes

Objectbank

Style+Classe

mes

Classemes+

Objectbank

50607080

Visual(S

IFT+HOG)

Audio(MFCC+SS+Audio-Six)

Attribute(O

bjectbank+

Classeme)

Visual+Audio

Visual+Audio+Attrib

ute50607080

50607080

Flickr YouTube

Datasets are available at: www.yugangjiang.info/research/interestingness

76.6 68.074.567.0 67.1

65.764.874.7

64.5 56.8

71.778.676.6

68.0

2.6% 5.4%

ConclusionWe conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations:• Visual and Audio features are effective in predicting video interestingness• A few features useful in image interestingness do not extend to video domain

(Style…)