Localiza)on using Faster R-CNN and Mul)-Frame Fusion · Efficient End-to-End object localiza)on 1....

Preview:

Citation preview

Localiza)onusingFasterR-CNNandMul)-FrameFusion

RyosukeYamamoto,NakamasaInoue,KoichiShinodaTokyoIns8tuteofTechnology

Outline

Mo)va)on:detectanac)onconcept“Si?ngDown”

Ourmethod:FasterR-CNN+LSTM+Re-scoring

Annota)on:Frame-wiseannota)onforSi?ngDown,Key-frameannota)onforotherconcepts

Results:2ndamong3teams,bestresultatSi?ngDown

0

0.1

0.2

0.3

0.4

0.5iframe_fscore

mean_pixel_fscore

F-s

core

Mo)va)on

・Localiza)ontaskfocusesnotonlyonsta)cobjects,butalsoonac)onconcepts・WefocusonSi?ngDown,oneofac)onconcepts・Howtodis)nguishbetweenSi?ngandSi?ngDown?→Dynamicinforma)onis

importantforprecisedetec)on

Si?ng Si?ngDown

OurMethod

・Faster-RCNN(Ren2015)-Efficientobjectlocaliza)on

・LSTM(Donahue2015)-Preciseac)onlocaliza)on-AppliedtoSi?ngDown

・Re-scoring(Yamamoto2015)

-Mul)-frameScoreFusion-Mul)-ShotScoreBoos)ng

Faster R-CNN

PredictionPrediction Prediction

Fusion

LSTMLSTM LSTM

BoostBoost Boost

Time Sequence

FasterR-CNN(Ren2015)

EfficientEnd-to-Endobjectlocaliza)on1.Generateregionproposalsbyanetwork2.PredictscoresforeachregionbyusingCNNfeaturesExampleCNNs:

-ZFNet(Zeiler2014) weuse-VGG-16(Simonyan2014)-GoogLeNet(Szegedy2015)-ResNet(He2016)

ROI PoolingROI Pooling

CNN

Region Region

proposalsproposals

DN

N

DN

N

FasterR-CNN

LSTM

Prediction

FasterR-CNN

LSTM

Prediction

FasterR-CNN

LSTM

Prediction

Time Sequence

LongShort-TermMemory(LSTM)

AnLSTMlayerisintroducedtoFasterR-CNN-memorizelongandshortterminforma)on-appliedonlytoSi?ngDown

Mul)-FrameandMul)-Shot(Yamamoto2015)

l  Mul)-FrameScoreFusionAveragepoolingofscoresover5framesinashot

l  Mul)-ShotScoreBoos)ngAddadjacentshotscores

Key-frame(I-frame)

Average

Key-FrameAnnota)ons

Bounding-boxannota)onontherepresenta)vekey-frameforeachshotlabeledasposi)veincollabora)veannota)on

Concept #frames #boxes Concept #frames #boxesAnimalBicyclingBoyDancingExplosionFire

11,545599

1,8482,1182,483

9,1551,3552,4925,1992,402

Inst.MusicianRunningSi?ngDownBabySkier

4,923945

-898320

7,2291,394

-895521

I-FrameAnnota)onsforSi?ngDown

l  I-Frameannota)onforSi?ngDowntotrainLSTMl  Annota)onresults

#shots=92#frames=481#bounding-boxes=515

*WefoundSi?ngDowninonly92shotsinthe3Kshotslabeledasposi)veincollabora)veannota)on

Results

0

0.1

0.2

0.3

0.4

0.5iframe_fscore

mean_pixel_fscore

F-s

core

TokyoTechRuns

ID Method RunID1*2*3*4*5

FasterR-CNN+Mul)-FrameScoreFusion1+Mul)-ShotScoreBoos)ng1+LSTM(4096units)forSi?ngDown2+LSTM(4096units)forSi?ngDown2+LSTM(64units)forSi?ngDown

fusionboostfusion.lstmboost.lstm(postexp.)

l  2ndamong3teams

ResultsforSi?ngDown

ID Method I-FrameF-score PixelF-score2*4*5

Fusion+Boos)ng2+LSTM(4096units)2+LSTM(64units)

0.630.00

11.96

0.220.004.51

BestresultforSi?ngDownwithrun#2LSTMwith4096units(run#4)didnotwork→LSTMwith64units(run#5)avoidedover-fi?ng

andworkedinpostsubmissionexperiment

SittingDown

System outputGround truthGood cases Bad cases

Moving but not sitting down Moving around a chairSitting down

Re-trained network with LSTM 64 units

Animal, Good Results

System output Ground truth

Faster R-CNN Score Fusion

Cat (no movement)

Score Boosting

Dog (walking)

Animal, Bad Results

System output Ground truth

Faster R-CNN Score Fusion

Many animals

Score Boosting

Bird (flying fast)

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Bicycling

Boy

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Dancing

ExplosionFire

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

InstrumentalMusician

Running

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Baby

Skier

Conclusion&FutureWork

l  Weproposedalocaliza)onsystem-FasterR-CNN+LSTM+Re-scoring

l  Manualannota)on-31Kboundingboxes

l  Results-2ndamong3teams,bestresultatSi?ngDown-LSTMwith64unitswaseffec)veforSi?ngDown

l  Futurework-Findabeoerwaytolocalizeac)on

Recommended