3
Semantic Indexing Using Deep CNN and GMM Supervectors Nakamasa Inoue and Koichi Shinoda, Tokyo Institute of Technology System Overview ! A hybrid system of Gaussian-mixture-model (GMM) supervectors and deep convolutional neural networks. ! Our best result was 0.299 (Mean InfAP), which is ranked 3rd among participating teams. ! Future work: audio and motion analysis using deep neural networks. Results & Conclusion Gaussian- Mixuture -Model Supervectors ! Each video shot is modeled by a GMM. Maximum a poste- riori adaptation is used to estimate parameters. ! 6 types of low-level features: Harris SIFT, Hessian SIFT, Dense SIFT, Dense HOG, Dense LBP, and MFCC. Prior GMM MAP adaptation GMM supervector Convolutional Neural Network ! The convolutional network with 16 layers in [1] is used to extract 4096 dimensional features. ! Parameters of the CNN are trained on ImageNET 2012. Video shot Score fusion Deep Convolutional Neural Network (CNN) SVM GMM Supervectors SVMs Audio & Visual Low-level features GMM TokyoTech at TRECVID 2015 CNN CNN CNN ! Features are extracted from multiple frames in each video shot by using convolutional neural networks. SVM max pooling ! video shot [1] K. Simonyan, and A. Zisserman, Very Deep Convolutional Networks for Large- Scale Image Recognition In Proc. of ICLR, 2015. Method Mean InfAP Deep CNN 0.274 GMM Supervector 0.226 Fusion 0.299 TokyoTech Runs InfAP " : Highest score in 2015 # : Ours $ : Median Input Convolution Max pooling Full connection 4096 dim. Low-level Features Mean vector of a GMM

Semantic Indexing Using Deep CNN and GMM … " : Highest score in 2015 # : Ours ... M4.-))616-4?.,-?C2E.?2.-7?*+1?.OAA 012*-0.) ... 11/25/2015 12:28:18 AM

  • Upload
    tranque

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Semantic Indexing Using Deep CNN and GMM Supervectors Nakamasa Inoue and Koichi Shinoda, Tokyo Institute of Technology

System Overview

!  A hybrid system of Gaussian-mixture-model (GMM) supervectors and deep convolutional neural networks.

!  Our best result was 0.299 (Mean InfAP), which is ranked 3rd among participating teams. !  Future work: audio and motion analysis using deep neural networks.

Results & Conclusion

Gaussian-Mixuture-Model Supervectors

!  Each video shot is modeled by a GMM. Maximum a poste- riori adaptation is used to estimate parameters. !  6 types of low-level features: Harris SIFT, Hessian SIFT, Dense SIFT, Dense HOG, Dense LBP, and MFCC.

Prior GMM

MAP adaptation

GMM supervector

Convolutional Neural Network

!  The convolutional network with 16 layers in [1] is used to extract 4096 dimensional features. !  Parameters of the CNN are trained on ImageNET 2012.

Video shot

Score fusion

Deep Convolutional Neural Network (CNN)

SVM

GMM Supervectors

SVMs

Audio & Visual Low-level features GMM

TokyoTech at TRECVID 2015

CNN

CNN

CNN

!  Features are extracted from multiple frames in each video shot by using convolutional neural networks.

SVM

max pooling

!

video shot

[1] K. Simonyan, and A. Zisserman, Very Deep Convolutional Networks for Large- Scale Image Recognition In Proc. of ICLR, 2015.

Method Mean InfAP

Deep CNN 0.274GMM Supervector 0.226Fusion 0.299

TokyoTech Runs

InfA

P

" : Highest score in 2015 # : Ours $ : Median

Input Convolution Max pooling Full connection

4096 dim.

Low-level Features Mean vector of a GMM

!

!"#

!"$

!"%

!"&

'()*+,-./(012*- 3-+4.567-8./(012*- 9+*,2461.,-+4

!"#$"!%&'()*(!+,-./0(1234

5"&)678)*7"9(:7*'(;<)*7"=!%><"?)6;%6%&*7@%(;%)?&'()9A(;BB9%*

:;20<=-.>+,+,[email protected]+=+,+0+.'42<[email protected]+!"#$"%&'()*)+),%"-%!,./'"0"1$

CD?(;$E*%>

+%ED6*E(F(-"9&6DE7"9G(HD*D?%(I"?#E

J"@%6*$(3

;<)*7"=!%><"?)6(+%K7"9(B?"<"E)6E

J"@%6*$(1

LD6*7=H?)>%(;&"?%(HDE7"9

J"@%6*$(M

J%7K'N"?(H?)>%(;&"?%(O""E*79K

! ;%6%&*7@%(;%)?&'FGH.I6?C.?-,52*+8.E6,-40624+8.-7?-4E-E.*-J624.5*2520+80! KC60.I688.5*2E<1-.+.8+*J-.4<,L-*.2).*%><"?)66$(&"9*79D"DE.*-J624.5*2520+80

! M0.D-8-1?6N-.D-+*1C@.:-0<8?0.124?+64.+.82?.2).DE%6%EE(<?"<"E)6E

! M4.-))616-4?.,-?C2E.?2.-7?*+1?.OAA012*-0.)*2,.+.8+*J-.4<,L-*.2).2LP-1?*-J6240.2).+.6,+J-

! M1C6-N-E.+.0?+?-(2)(?C-(+*?.*-0<8?.64.2LP-1?.821+86Q+?624

! KC-.D-8-1?6N-.D-+*1C.*-0<8?0.+*-.<0-E.+0.*-J624.5*2520+8

! K2.+N26E.9"7E%.2*."NP%&*(A%Q"?>)*7"9@.)<0-.)-+?<*-.,+50.+,24J.E%@%?)6(Q?)>%E

! R718<E-.<0-8-00.5*2520+80

FGH.S".:".:"[email protected]".R".M"[email protected]".U-N-*[email protected]".V".3".D,-<8E-*[email protected]?6N-.0-+*1C.)2*.2LP-1?.*-12J46?624".'4.'[email protected]"[email protected]"GX$(GYG@.#!GZ

F#H.B".9-@.[".\[email protected]".:[email protected]".D<[email protected]+?6+8.5;*+,6E.522864J.64.E--5.124N28<?624+8.4-?I2*=0.)2*.N60<+8.*-12J46?624".'4.'RRR.K*+40+1?6240.24.]+??-*4.M4+8;060.+4E.3+1C64-.'[email protected]"G^!$(G^G%@.#!GX

! ').0;0?-,.)+680.?2.821+86Q-.64.02,-.)*+,[email protected]*./*+,-.D12*-._220?64J.I688.*-12N-*.<064J.4-6JCL2*0

_220?_220?

9+*,".3-+4.2)./(012*-0

3-?C2E W+8 K-0?

D-8-1?6N-.D-+*1C.`.D]]4-? !"$$&G !"X%X%

`.DK(:-J624.]*[email protected]<8?6(/*+,-.D12*-./<0624

!"$XG& !"XYG%

`.A-6JCL2<*(/*+,-.D12*-._220? 2RS4TU 2R4V42

! 3<8?6(/*+,-.D12*-./<0624.+4E.A-6JCL2*(/*+,-.D12*-._220?64J.6,5*2N-E.?C-.012*-

! V-.+*1C6N-E.Z*E.58+1-.+,24J.+88.?-+,0.I6?CC+*,2461.,-+42)./(012*-0

! /<?<*-.I2*=a.KC-.E-?-1?624.*-0<8?0.0?*24J8;.E-5-4E.24.b<+86?;.2).DK(:-J624.]*2520+80" ',5*2N-.DK(:-J624.]*2520+80.b<+86?;" c21+86Q+?624.I6?C2<?.*-J624.1+4E6E+?-0

! U-4-*+?-.*-J6240.)*2,.)-+?<*-.,+50 D;0?-,.2<?5<?

U*2<4E.?*<?C

!"# !"#

!$%&'

()*+%, ()*+%,

()*+%,

-.(&/0'-.(&/0' 1.(&/0'1.(&/0' 1.(&/0'1.(&/0' 1.(&/0'1.(&/0'

!$%&'!$%&'

!1123/4'&!1123/4'&(5

6'78

(5

6'78

!"#

!$%&'

599

(5

6'78

(5

6'78

!"#

!$%&'

6':+%,2;&%;%*/3*2<4

!'3'$=+>'2!'/&$?@AB

3-E6+3688._-0?

OOA>._-0?

CD?(O%E*

K*6,50._-0?

]61Dd3._-0?

eOT._

-0?

K6,-

!11,'=

!11,'=

!11,'=

!11,'=

KC-.6,+J-.)*2,.FGH

! D-8-1?6N-.D-+*1C.5*2E<1-0.+.8+*J-.4<,L-*.2).2LP-1?.*-J624.5*2520+80.)*2,.+.0?688.6,+J-

! M4.6,+J-.60.0-J,-4?-E.C6-*+*1C61+88;.I6?C.0-N-*+8.0-J,-4?+?624.0?*+?-J6-0.6418<E64J.DE%6%EE("9%E

B?%@7"DE(:"?#W(;%6%&*7@%(;%)?&'FGH.`.;<)*7)6(B$?)>7A(B""679K(9%*F#H

! !

!"#$%&"'"($&#)$*+,

!"#$%&'"()")&*#+,&'"*,&"&-#,*.#&/0

12344"'+%&,5&.#(,'")(,"/&6'&"783"9:783;<2""""""""""""""=""""""""""""""")(,">3?@ABCD"9ABCD;E2""""""""""""""=""""""""""""""")(,"/&6'&"#,*F&.#(,$"),(G"783H""""""""""""""""""""""""""""""""78CH"*6/"4?7"9:D;

!2IJ/&(A#(,$",&%,&'&6#*#J(6'"9K&L"M;

4*-JG+G" *" %('#&,J(,J" 94NO;" */*%#*#J(6" *6/" P6J5@&,'*Q" ?*.RS,(+6/" 4(/&Q" 9P?4;" *,&" +'&/" #(" G*R&"344"'+%&,5&.#(,'

-+./+-")0'#$'-123456'7%8$*9":*#'2;",$'6"$")$*+,'<=>?

3+9@*,#$*+,'+A'4*:"+B$+&/'#,:'C77'D%E"&;")$+&DD,*6"7*J":*6SH"K*R*G*'*"B6(+&H"*6/"T(J.UJ"AUJ6(/*

!"#$"%&'()*)+),%"-%!,./'"0"1$

B"$$*,F *,AGH<==IJK

VJ#U(+#"IJ/&(A#(,$ 1E2WW

VJ#U"IJ/&(A#(,$ >LMNO

IJ/&(

344"'+%&,5&.#(,'

344"'+%&,5&.#(,")(,"

:D@4?7"x

D&,G"5&.#(,"y"

IJ'+*Q"

%,(F&.#J(6"BD&-#+*Q"

%,(F&.#J(6"A

X

Y

1X

1Y

<X

<Y

EX

EY

!XOA"1XXZ-

Z5*QC+QQ

Z5*QA+[

J6)N

O<XX"9\

;

AI4

AI4'

C+'J(6

drivesafeelectriccar

IJ/&(A#(,$",&%,&'&6#*#J(6

B/D$"9'+;"&;*"P

V&".(G[J6&"IJ/&(A#(,$",&%,&'&6#*#J(6"LJ#U"#U&"344"'+%&,5&.#(,"'$'#&G

4*:"+B$+&/Q>R

]1^"!"#$%&''(#)*+,-#-#,).*/%&",'*0()'#)1.*2(('*34*04*5)&(14*6#7(&58&$9:*!*;(<*0=>8#"(7#,*?"-(77#)@*A&$*B(<C?D,"E>(*F(G&@)#8#&)*,)7*/$,)'>,8#&)*&A*?H()8'4*!20*0=>8#"(7#,.*IJKL

1"D%8$D 3+,)8%D*+,D! IJ/&(A#(,$" 'U(L'"&))&.#J5&6&''" J6"&5@&6#'" '+.U" *'" _>(.R" .QJG[J6S`H" _CJ-J6S"G+'J.*Q"" J6'#,+G&6#'`H""_O*,RJ6S""*"5&@UJ.Q&`H"_D+6J6S""G+'J.*Q"J6'#,+G&6#'`

! B#"J'"6&&/&/"#("J6.,&*'&"#U&"*G(+6#""()"#,*J6J6S"/*#*")(,"JG%,(5J6S"#U&"%&,)(,G@*6.&

344"'+%&,5&.#(,")(,"

:D@4?7"x

IJ'+*Q"%,(F&.#J(6"*6/"#&-#+*Q"%,(F&.#J(6"*,&"#,*J6&/"F(J6#Q$"+'J6S"5J/&('"*6/"#U&J,"#J#Q&'"),(G"#U&"IJ/&(@A#(,$!aT"/*#*'&#

86Q$" #U&" 5J'+*Q" %,(F&.#J(6" J'" +'&/" #(".(G%+#&"IJ/&(A#(,$",&%,&'&6#*#J(6'"()"D>ZbIB:"5J/&('

E,/"*G(6S"c"#&*G'

"!

"

!#

$$$#

O,&@#,*J6J6S"(6"#U&"IJ/&(A#(,$!aT"/*#*'&#

"!

"

!#

$$$#

IJ/&(A#(,$",&%,&'&6#*#J(6'"()"D>ZbIB:"5J/&('

IJ/&(A#(,$"

,&%,&'&6#*#J(6"s

O,&@#,*J6&/"5J'+*Q"

%,(F&.#J(6"B

IJ/&(A#(,$"

,&%,&'&6#*#J(6"ss=Bxs=Bx ŷ=As

AI4

b(G%*,J'(6"()"(+,"/J))&,&6#"'&##J6S'"J6"#U&".(6/J#J(6"()

OA"1XZ-"Z5*QA+[