Click here to load reader

Understanding and Predicting Interestingness of Videos

  • View

  • Download

Embed Size (px)


Understanding and Predicting Interestingness of Videos Yu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng , Xiangyang Xue School of Computer Science, Fudan University, Shanghai, China. AAAI 2013 Bellevue, USA. Two New Datasets. The problem. Results. Flickr Dataset: - PowerPoint PPT Presentation

Text of Understanding and Predicting Interestingness of Videos


Understanding and Predicting Interestingness of VideosYu-Gang Jiang , Yanran Wang , Rui Feng , Hanfang Yang, Yingbin Zheng, Xiangyang XueSchool of Computer Science, Fudan University, Shanghai, China

AAAI 2013Bellevue, USAApplications:Web Video SearchVideo Recommendation System

Related Work:There is a few studies about predicting Aesthetics and Interestingness of Images

Key Idea is building computational model to predict which video is more interesting, when given two videos.

Contributions:Conducted a pilot study on video interestingnessBuilt two new datasets to support this studyEvaluated a large number of features and get interesting observationsCan a computational model automatically analyze video contents and predict the interestingness of videos?

We conduct a pilot study on this problem, and demonstrates a simple method to identify more interesting videos.

The problemKey Idea

VS.Two New DatasetsFlickr Dataset:Source: Flickr.comVideo Type: Consumer VideosVideo Number: 1200 Categories: 15 (basketball, beach)Duration: 20 hrs in totalLabel: Top 10% as interesting videos; Bottom 10% as uninteresting

YouTube Dataset:Source: YouTube.comVideo Type: AdvertisementsVideo Number: 420Categories: 14 (food, drink)Duration: 4.2 hrs in totalLabel: 10 human assessors to compare video pairs

Prediction & EvaluationComputational Framework: Aim: train a model to compare the interestingness of two videos


Prediction:Adopt Joachims Ranking SVM (Joachims 2003) to train prediction modelsFor both datasets, we use 2/3 of the videos for training and 1/3 for testingUse Kernel-level Fusion & Equal Weights to fuse multiple features.

EvaluationAccuracy (the percentage of correctly ranked test video pairs)Visual featuresAudio featuresHigh-level attribute featuresRanking SVMresultsMulti-modal fusion

VS.Multi-modal feature extractionVisual featuresColor HistogramSIFTHOGSSIMGIST Audio featuresMFCCSpectrogram SIFTAudio-SixHigh-level attribute featuresClassemesObjectbankStyle ResultsVisual Feature Results:

Overall the visual features achieve very impressive performance on both datasetsAmong five features, SIFT and HOG are very effective, and their combination performs best

Audio Feature Results:

The three audio features are effective and complementary. Comparing them gets best performance

Attribute Feature Results:

Attribute features do not work as well as we expected. Especially style performs poorly. It is a very interesting observation since in the prediction of image interestingness, style is claimed effective

Visual+Audio+Attribute Fusion Results:

Fusing visual and audio features leads to substantial performance gains with 2.6% increase on Flickr and 5.4% increase on YouTube. While adding Attribute features is not that effectiveFlickrYouTubeDatasets are available at: 5.4% ConclusionWe conducted a study on predicting video interestingness. We also built two new datasets. A great number of features have been evaluated, leading to interesting observations:Visual and Audio features are effective in predicting video interestingnessA few features useful in image interestingness do not extend to video domain (Style)1