論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」

論文輪読

Multi-view Face Detection Using Deep Convolutional Neural Networks

那須野薫

2015年6月19日

東京大学松尾研究室

About the paper and the motivation to itroduce

•  Paper –  Title：Multi-view Face Detection Using Deep Convolutional Neural Networks

–  Authors： •  Sachin Sudhakar Farfade,,, •  Yahoo

–  Cited Count：2 –  Published in ：20 Apr 2015 –  Link：http://arxiv.org/abs/1502.02766

•  Motivation –  interested in face image processing by DL –  it seems somewhat easy to implement –  it obtains almost state of the art performance

2015年6月19日東京大学松尾研究室那須野薫 2

東京大学松尾研究室那須野薫 2015年6月19日 3

Abstract •  They consider the problem of multi-view face detection. •  Current state-of-the-art approaches for this task

–  require annotation of facial landmarks or annotation of face poses. –  require training dozens of models to fully capture faces in all

orientations, e.g. 22 models in HeadHunter method.

•  Proposed method: Deep Dense Face Detector (DDFD) –  does not require pose/landmark annotation and is able to detect

faces in a wide range of orientations using a single model based on deep convolutional neural networks.

–  does not require additional components such as segmentation, bounding-box regression, or SVM classifiers.

–  similar or better performance compared to the previous methods which are more complex and require annotations

•  Key Findings –  1) their algorithm is able to detect faces from different angles and

can handle occlusion to some extent –  2) there seems to be a correlation between distribution of positive

examples in the training set and scores of the proposed face detector. •  using better sampling strategies and more sophisticated data augmentation techniques might result in better performance


What kinds of task is Multi-view face dection ? occlusion

illumina*on difference pose difference


Agenda

•  Introduction •  Proposed Method •  Experiments •  Conlusion and Future Work


Agenda



Face detection is more important these days

•  Millions of photos are uploaded everyday –  Dropbox, Facebook, Instagram, Google+, Flicker

•  When users want to search photos, usually they search by location, time, friends –  former two are embedded by time, GPS in camera –  last one is not easily available

•  This has made low complexity, rapid and accurate face detection an essential component for cloud based photo sharing/storage platforms.


Face detection has been an active research area. 3 types of major approach for face detection

•  Casecade-Based –  state of the art by 22 models

•  require orientation annotations •  computationally complex

–  using boosting

•  DPM-Based –  Deformable Parts Models technique

•  face definded as collections of its parts •  parts are defined via unupervised or supervised training •  latent SVM trained to find parts and geometric relationship

•  Neural-Network-Based


Key challenge in multi-view face detection

•  Learning algorithms such as Boosting or SVM and image features such as HOG or Haar wavelets are not strong enough to capture faces of different poses and thus the resulted classifiers are hopelessly inaccurate.

•  However, with recent advances in deep learning and GPU computation, it is possible to utilize the high capacity of deep convolutional neural networks for feature extraction/classification, and train a single model for the task of multi-view face detection.


Proposed method: Deep Dense Face Detector (DDFD) •  Proposed method: Deep Dense Face Detector (DDFD) –  does not require pose/landmark annotation and is able to detect faces in a wide range of orientations using a single model based on deep convolutional neural networks.

–  does not require additional components such as segmentation, bounding-box regression, or SVM classifiers.

–  similar or better performance compared to the previous methods which are more complex and require annotations

•  Key Findings –  1) their algorithm is able to detect faces from different angles and can handle occlusion to some extent

–  2) there seems to be a correlation between distribution of positive examples in the training set and scores of the proposed face detector. •  using better sampling strategies and more sophisticated data augmentation techniques might result in better performance


Agenda



提案手法

•  face detectorの学習方法について •  アイディア

1.  多視点からの顔検知を単一の分類器で行うため、大容量のDeepCNNを特徴抽出と分類に利用する。

2.  検知器の構造を単純にすることで、計算の複雑さを最小化する

•  流れ –  つくり方 –  作った検知器の分析


Imagenet学習済のAlexNetをfine-tuningする

•  訓練データ：AFLW(2.4万の顔) –  正例を増やすため、50%IOUの範囲でランダムにサブウィンドウを生成し、そのウィンドウも正例とすることで、合計約20万の正例を取得。

–  負例は合計2000万程度。 –  227×227にリサイズ

•  イテレーション：5万 •  バッチサイズ：128

–  32が正例、96が負例


入力画像の領域選択方法

•  Region-based approach –  追加でモジュールが必要になる

•  Sliding window approach –  シンプル（しかもこの論文ないではこっちの方が精度高い）

※An example of Region-‐based approach


学習済AlexNetを改造する

•  AlexNetと同様で全8層からなる –  AlexNetの全結合層をCNN層に変える –  効率化、ヒートマップの利用

•  検知された枠はnon-maximal supressionで選り分け •  R-CNNなどと異なり最終層にSVMとかは使わない

※AlexNetの文献より


境界ボックス回帰は行わない

•  境界ボックス回帰(Bounding-Box Regression) –  微調整

•  この研究では精度が下がったので使わない。

※OverFeat: Integrated Recogni*on, Localiza*on and Detec*on using Convolu*onal Networks


顔検知器の分析

•  DDFDの分析 •  訓練データ中の正例と確信度の関係を見る


顔検知器の分析

•  検知できていないのはほとんど隠れている2つだけ


顔検知器の分析　～確信度の分布～ •  性能が良い

–  顔が確信度がほぼ0 –  確信度はいずれも1にかなり近い –  >> 事後処理にSVMとかいらない

•  斜めや逆さの確信度がやや低いのは、回転に対して弱いというよりはデータセットのかたよりの影響か？(確信度はsigmoid関数で算出)


顔検知器の分析平面外の回転や遮蔽の確信度への影響

•  AFLWデータセット •  さっきのと同様で、平面内、平面外の回転や遮蔽があると確信度がやや下がる。


顔検知器の分析　データセットの分布に関する仮説検証 •  訓練データに偏りがあるのでファインチューニングしたCNNが直立し

た顔への確信度が高いのは驚くことではない。 •  CNNはsoft-max誤差関数を最小化するように学習するため、サンプ

リング方法が少なくない影響を与えてしまう。


顔検知器の分析　サンプリング方法について •  バッチサイズ=128。 •  正例が負例と比べて100倍少ないので、一様分布下では、2つ程度し

か正例が含まれずよくない。 •  そこで、バッチの1/4は正例となるように一様に正例の中からサンプ

リングした。

•  特に、遮蔽があった時に、検知できないがこれは、そもそも訓練データにそういったデータがないからでは？

•  もしかしたら、より洗練されたデータ拡張手法があれば精度上がるかも –  ただ、適当に人工的なノイズを加えるのは良くない –  なぜなら、それを学習しちゃうから


Agenda



実験

•  Caffeで実装 –  Alexnetのpretrainedモデルもウェブに存在

•  全結合層をconvnetに置換し、ヒートマップを出力 –  227×227 window with 32px stride

•  画像を大きくしたり小さくしたりして、227x227とサイズが異なる顔画像も検出するようにする


データセット •  Pascal Face

–  851画像の1,341の顔アノテーション –  最小顔サイズ35px

•  AFW –  205画像の473の顔アノテーション –  Flickr

•  FDDB –  2,846画像の5,172の顔アノテーション

•  パラメタ調整 –  Pascal Faceで行った

•  評価にはツールを使ったらしい


画像のスケール •  画像を5倍にする

–  227/5=45pxの顔まで検出できるように •  最小画像サイズが227px以下になるまで画像のスケールをfs倍していく –  fs ∈ {0.5^(1/2), 0.5^(1/3), 0.5^(1/5), 0.5^(1/7)} –  fsが小さい方が良い？ –  fs＝0.5^(1/3)を選択


Non-maximum suppression module(NMS) •  NMS-max

–  IOUが閾値以上の画像に対しては、スコア(≒確信度)が最大の枠以外は全て除去

•  NMS-avg –  確信度0.2未満の枠を全て除去。 –  OpenCVでかさなりの閾値に従い、枠をクラスタリング –  各クラスタ内の枠について、最大スコアの90%未満のスコアの枠を削除

–  残った枠について、その枠の位置を平均化して新しい枠の位置とする

–  最後に、スコアについてはそのクラスタ内のスコアの最大値を利用する。


Non-maximum suppression module(NMS) •  重なりの閾値に大きく影響を受ける。

–  NMS-max >> 0.3, NMS-avg >> 0.2 •  平均的にはNMS-avgの方が良さそう


境界ボックス回帰の有無について


境界ボックス回帰が上手く機能しないのは、訓練データとテストデータのミスマッチが原因？


R-CNNとの比較 •  R-CNN：state of the artの手法の1つ •  提案手法の方が優れている

–  R-CNNのselective searchがいけていないからか？


他のstate of the artの手法との比較 •  フェアな比較ではないが、、、

–  他の手法のなかのいくつかは顔以外の情報を使っているから •  他のstate of the art 以上か、それと同等の性能を発揮


Agenda



まとめ

•  Deep Dense Face Detectorを提案した。 –  アノテーションが要らない。 –  傾きや回転につよい –  シングルモデル –  境界ボックス回帰とかSVMとかも使わない、シンプル –  他の最新手法以上か、それと同等の性能を発揮

•  今後課題 –  訓練データのよりよい生成をがんばる

Appendix


Appendix Results Example


Appendix Results Example

Documents

論文輪読資料「Multi-view Face Detection Using Deep Convolutional Neural Networks」