19
Localiza)on using Faster R-CNN and Mul)-Frame Fusion Ryosuke Yamamoto, Nakamasa Inoue , Koichi Shinoda Tokyo Ins8tute of Technology

Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Localiza)onusingFasterR-CNNandMul)-FrameFusion

RyosukeYamamoto,NakamasaInoue,KoichiShinodaTokyoIns8tuteofTechnology

Page 2: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Outline

Mo)va)on:detectanac)onconcept“Si?ngDown”

Ourmethod:FasterR-CNN+LSTM+Re-scoring

Annota)on:Frame-wiseannota)onforSi?ngDown,Key-frameannota)onforotherconcepts

Results:2ndamong3teams,bestresultatSi?ngDown

0

0.1

0.2

0.3

0.4

0.5iframe_fscore

mean_pixel_fscore

F-s

core

Page 3: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Mo)va)on

・Localiza)ontaskfocusesnotonlyonsta)cobjects,butalsoonac)onconcepts・WefocusonSi?ngDown,oneofac)onconcepts・Howtodis)nguishbetweenSi?ngandSi?ngDown?→Dynamicinforma)onis

importantforprecisedetec)on

Si?ng Si?ngDown

Page 4: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

OurMethod

・Faster-RCNN(Ren2015)-Efficientobjectlocaliza)on

・LSTM(Donahue2015)-Preciseac)onlocaliza)on-AppliedtoSi?ngDown

・Re-scoring(Yamamoto2015)

-Mul)-frameScoreFusion-Mul)-ShotScoreBoos)ng

Faster R-CNN

PredictionPrediction Prediction

Fusion

LSTMLSTM LSTM

BoostBoost Boost

Time Sequence

Page 5: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

FasterR-CNN(Ren2015)

EfficientEnd-to-Endobjectlocaliza)on1.Generateregionproposalsbyanetwork2.PredictscoresforeachregionbyusingCNNfeaturesExampleCNNs:

-ZFNet(Zeiler2014) weuse-VGG-16(Simonyan2014)-GoogLeNet(Szegedy2015)-ResNet(He2016)

ROI PoolingROI Pooling

CNN

Region Region

proposalsproposals

DN

N

DN

N

Page 6: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

FasterR-CNN

LSTM

Prediction

FasterR-CNN

LSTM

Prediction

FasterR-CNN

LSTM

Prediction

Time Sequence

LongShort-TermMemory(LSTM)

AnLSTMlayerisintroducedtoFasterR-CNN-memorizelongandshortterminforma)on-appliedonlytoSi?ngDown

Page 7: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Mul)-FrameandMul)-Shot(Yamamoto2015)

l  Mul)-FrameScoreFusionAveragepoolingofscoresover5framesinashot

l  Mul)-ShotScoreBoos)ngAddadjacentshotscores

Key-frame(I-frame)

Average

Page 8: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Key-FrameAnnota)ons

Bounding-boxannota)onontherepresenta)vekey-frameforeachshotlabeledasposi)veincollabora)veannota)on

Concept #frames #boxes Concept #frames #boxesAnimalBicyclingBoyDancingExplosionFire

11,545599

1,8482,1182,483

9,1551,3552,4925,1992,402

Inst.MusicianRunningSi?ngDownBabySkier

4,923945

-898320

7,2291,394

-895521

Page 9: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

I-FrameAnnota)onsforSi?ngDown

l  I-Frameannota)onforSi?ngDowntotrainLSTMl  Annota)onresults

#shots=92#frames=481#bounding-boxes=515

*WefoundSi?ngDowninonly92shotsinthe3Kshotslabeledasposi)veincollabora)veannota)on

Page 10: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Results

0

0.1

0.2

0.3

0.4

0.5iframe_fscore

mean_pixel_fscore

F-s

core

TokyoTechRuns

ID Method RunID1*2*3*4*5

FasterR-CNN+Mul)-FrameScoreFusion1+Mul)-ShotScoreBoos)ng1+LSTM(4096units)forSi?ngDown2+LSTM(4096units)forSi?ngDown2+LSTM(64units)forSi?ngDown

fusionboostfusion.lstmboost.lstm(postexp.)

l  2ndamong3teams

Page 11: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

ResultsforSi?ngDown

ID Method I-FrameF-score PixelF-score2*4*5

Fusion+Boos)ng2+LSTM(4096units)2+LSTM(64units)

0.630.00

11.96

0.220.004.51

BestresultforSi?ngDownwithrun#2LSTMwith4096units(run#4)didnotwork→LSTMwith64units(run#5)avoidedover-fi?ng

andworkedinpostsubmissionexperiment

Page 12: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

SittingDown

System outputGround truthGood cases Bad cases

Moving but not sitting down Moving around a chairSitting down

Re-trained network with LSTM 64 units

Page 13: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Animal, Good Results

System output Ground truth

Faster R-CNN Score Fusion

Cat (no movement)

Score Boosting

Dog (walking)

Page 14: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Animal, Bad Results

System output Ground truth

Faster R-CNN Score Fusion

Many animals

Score Boosting

Bird (flying fast)

Page 15: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Bicycling

Boy

Page 16: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Dancing

ExplosionFire

Page 17: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

InstrumentalMusician

Running

Page 18: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Baby

Skier

Page 19: Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1. Generate region proposals by a network 2. Predict scores for each region by using

Conclusion&FutureWork

l  Weproposedalocaliza)onsystem-FasterR-CNN+LSTM+Re-scoring

l  Manualannota)on-31Kboundingboxes

l  Results-2ndamong3teams,bestresultatSi?ngDown-LSTMwith64unitswaseffec)veforSi?ngDown

l  Futurework-Findabeoerwaytolocalizeac)on