機械学習特論 - kanamori lab · 機械学習の基礎：問題設定：予測誤差，訓練誤差，ベイズ・ルール，モデル選択統計的学習理論：VC次元，Sauerの補題，Rademacher複雑度，代替損失

機械学習特論

• 担当：金森敬文 (e-mail: [email protected])

http://www.kana-lab.c.titech.ac.jp/

(日本語ページに講義情報を掲載)

• 講義資料・成績評価

∗ 成績：講義中に出題するレポートの結果を総合して評価する．

• 講義の予定：詳細は講義サイトを参照のこと．

∗ 確率論の復習，確率不等式∗ 機械学習の基礎：問題設定：予測誤差，訓練誤差，ベイズ・ルール，モデル選択∗ 統計的学習理論：VC次元，Sauerの補題，Rademacher複雑度，代替損失

1/30

• 参考文献 (論文はgoogleなどでタイトルを検索すればpdfを入手可)

∗ 講義全般：・ [書籍] Shai Shalev-Shwartz, et al., Understanding Machine Learning: From

Theory to Algorithms, Cambridge University Press, 2014. (ウェブからDL可能)

・ [サーベイ論文] O. Bousquet, S. Boucheron, G. Lugosi, “Introduction to Statistical

Learning Theory”, 2004.

・ [書籍] Mohri, et al., Foundations of Machine Learning, MIT Press, 2012．

• その他，参考になりそうな文献

∗ [書籍] 金森，統計的学習理論 (機械学習プロフェッショナルシリーズ)，講談社，2015．

2/30

— 機械学習の枠組—

3/30

観測されたデータ =⇒ 有用な情報を取り出す

• 講義では主に回帰分析・判別分析を扱う：データ (x1, y1), . . . , (xn, yn) から x と y の間の関係を推定する．

• その他の問題設定

∗ 次元削減：高次元データを，情報量を保ちつつ低次元に圧縮．∗ クラスタリング：データをいくつかのグループに分ける．

4/30

統計的データ解析と確率論

• 観測データは複雑（ノイズの影響など）

• 確率的なモデリングが有効

モデリング = [確定した構造] + [ランダムな構造]

• 確率論を基礎にして，データを解析する

∗ ただし本講義では，厳密な測度論的な取り扱いはしない

5/30

問題設定x を入力すると y が出力される：

例： x −→ ?? −→ y

• データ (x1, y1), . . . , (xn, yn) が観測されている．

• 新たな入力 x に対する出力 y を予測

判別 (classification)： y が有限集合に値をとる．

回帰 (regression)： y は実数値をとる．

6/30

4 2 0 2 4

42

02

4

C=1

X1

X2

• 2値判別

∗ xi ∈ Rd, yi ∈ +1,−1∗ データから+1 と −1 の境界を推定．∗ 境界にもとづいて，新たな入力 x に対する y の値を予測

7/30

— 機械学習の応用例 —

8/30

機械学習：データからパターンを発見，将来の予測などに役立てる．

例：文字認識，画像認識，音声認識など．

• スパム（spam:迷惑メール）フィルター

∗ x：メール（テキストデータ），y ∈ spam, non-spam

∗ サンプル：沢山の（過去の）スパムメールと普通のメール

∗ 将来のメールがスパムメールかどうかを判定して，仕分けする

9/30

• 文字認識：画像から文字を識別する．例：x ∈ R64, y ∈ 0, 1, 2, . . . , 9.

2 4 6 8

24

68

→ 0,

2 4 6 8

24

68

→ 3,

2 4 6 8

24

68

→ 7 or 9?

郵便番号の読み取りなどに実用化されている．

• 顔検出：顔がある位置を検出．デジカメなどに搭載．

148 Viola and Jones

5. Results

This section describes the final face detection system.The discussion includes details on the structure andtraining of the cascaded detector as well as results ona large real-world testing set.

5.1. Training Dataset

The face training set consisted of 4916 hand labeledfaces scaled and aligned to a base resolution of 24 by24 pixels. The faces were extracted from images down-loaded during a random crawl of the World Wide Web.Some typical face examples are shown in Fig. 8. Thetraining faces are only roughly aligned. This was doneby having a person place a bounding box around eachface just above the eyebrows and about half-way be-tween the mouth and the chin. This bounding box wasthen enlarged by 50% and then cropped and scaled to24 by 24 pixels. No further alignment was done (i.e.the eyes are not aligned). Notice that these examplescontain more of the head than the examples used by

Figure 8. Example of frontal upright face images used for training.

Rowley et al. (1998) or Sung and Poggio (1998). Ini-tial experiments also used 16 by 16 pixel training im-ages in which the faces were more tightly cropped,but got slightly worse results. Presumably the 24 by24 examples include extra visual information such asthe contours of the chin and cheeks and the hair linewhich help to improve accuracy. Because of the natureof the features used, the larger sized sub-windows donot slow performance. In fact, the additional informa-tion contained in the larger sub-windows can be usedto reject non-faces earlier in the detection cascade.

5.2. Structure of the Detector Cascade

The final detector is a 38 layer cascade of classifierswhich included a total of 6060 features.

The first classifier in the cascade is constructed us-ing two features and rejects about 50% of non-faceswhile correctly detecting close to 100% of faces. Thenext classifier has ten features and rejects 80% of non-faces while detecting almost 100% of faces. The nexttwo layers are 25-feature classifiers followed by three50-feature classifiers followed by classifiers with a

152 Viola and Jones

Figure 10. Output of our face detector on a number of test images from the MIT + CMU test set.

6. Conclusions

We have presented an approach for face detectionwhich minimizes computation time while achievinghigh detection accuracy. The approach was used to con-struct a face detection system which is approximately15 times faster than any previous approach. Preliminaryexperiments, which will be described elsewhere, showthat highly efficient detectors for other objects, such aspedestrians or automobiles, can also be constructed inthis way.

This paper brings together new algorithms, represen-tations, and insights which are quite generic and maywell have broader application in computer vision andimage processing.

The first contribution is a new a technique for com-puting a rich set of image features using the integralimage. In order to achieve true scale invariance, almostall face detection systems must operate on multipleimage scales. The integral image, by eliminating theneed to compute a multi-scale image pyramid, reducesthe initial image processing required for face detection

学習データ（正例）検出結果P. Viola & M. J. Jones, Robust Real-Time Face Detection Journal International Journal of Computer Vision, 2004

10/30

— 確率論の復習 —

11/30

• 確率変数 (r.v.)：ランダムな値をとる変数 X（通常は大文字で書く）• 標本空間 Ω に値をとる確率変数 X に関する確率 Pr( · ) の定義：

1. 事象 A ⊂ Ω に対して 0 ≤ Pr(X ∈ A) ≤ 1．2. 全事象 Ω の確率は 1．Pr(X ∈ Ω) = 1

3. 互いに排反な事象 Ai, i = 1, 2, 3, . . .

に対して

Pr(X ∈ ∪iAi) =∑i

Pr(X ∈ Ai).

（互いに排反： i = j に対して Ai ∩ Aj = ϕ）

!

A1 A2

（簡単のため Pr(X ∈ A) を Pr(A), P (A) と表すこともある）

Example 1 (サイコロの例). Ω = 1, 2, 3, 4, 5, 6, X = サイコロの目．A = 2, 4, 6とすると Pr(X ∈ A) はサイコロを振って偶数の目がでる確率．公平なサイコロならPr(X ∈ A) = 1/2.

12/30

確率の計算公理（だけ）から確認できる

• Pr(A) + Pr(Ac) = 1, Ac : Aの補集合

• 単調性： A ⊂ B ⊂ Ω =⇒ Pr(A) ≤ Pr(B)．A ⊂ B のとき B = A ∪ (B ∩Ac) （互いに排反）．∴ Pr(B) = Pr(A) + Pr(B ∩Ac) ≥ Pr(A).

• 加法定理：Pr(A ∪B) = Pr(A) + Pr(B)− Pr(A ∩B),

Pr(∪iAi) ≤∑

iPr(Ai).

13/30

連続値をとる確率変数 (例：正規分布)

確率変数 X が 1次元正規分布にしたがう：

Pr(a ≤ X ≤ b) =

∫ b

a

1√2πσ2

e−(x−µ)2

2σ2 dx, (Ω = R)

このとき X ∼ N(µ, σ2) と表す．

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

density

−4 −2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

normal density

x

p(x)

N(0,1)

N(2,4)

の面積 = Pr(1 ≤ X ≤ 2)

14/30

確率密度関数n個の確率変数：X1, X2, . . . , Xn．集合 A ⊂ Rn．X = (X1, X2, . . . , Xn) が集合 A に含まれる確率が次式で定まるとする：

Pr(X ∈ A) =

∫A

p(x1, . . . , xn)dx1 · · · dxn

• (同時)確率密度関数：p(x) = p(x1, . . . , xn) ≥ 0,∫

Rnp(x)dx = 1. (略して確率密度，密度)

• 周辺密度：p1(x1) =

∫Rn−1

p(x1, . . . , xn)dx2 · · · dxn など．

離散確率変数 (Aが加算集合)のとき∫A

· · · dx を∑x∈A

· · · に置き換える．

15/30

確率変数の期待値・分散X = (X1, . . . , Xn) の密度を p(x1, . . . , xn) とする．

• Xi の期待値：「散らばりの中心」を表す．

E[Xi] =

∫Rd

xi p(x1, . . . , xd)dx1 · · · dxd =

∫Rd

xi p(x)dx ∈ R

• X = (X1, . . . , Xd)T の期待値：

E[X] =

E[X1]...

E[Xd]

∈ Rd

• 1次元確率変数 Xi の分散：「散らばりの大きさ」を表す．

V[Xi]定義= E[(Xi − E[Xi])

2] = E[X2i ]− E[Xi]

2

16/30

独立性，独立同一分布n個の確率変数：X1, X2, . . . , Xn．

• X1, . . . , Xnが独立⇐⇒ 同時密度関数が積に分解

p(x1, . . . , xn) = p1(x1)p2(x2) · · · pn(xn)

• X1, . . . , Xnが独立に同一の分布にしたがう：

p(x1, . . . , xn) = q(x1)q(x2) · · · q(xn), (q = p1 = · · · = pn)

X,Y は独立のとき以下の等式が成立：

E[XY ] = E[X]E[Y ],

V[X + Y ] = V[X] + V[Y ] note: E[X + Y ] = E[X] + E[Y ] は常に成立．

17/30

• X1, . . . , Xn が独立に同一の分布 P にしたがうとき：

X1, . . . , Xn ∼i.i.d. P または (X1, . . . , Xn) ∼ Pn と書く．

(例えば X1, . . . , Xn ∼ N(0, 1) と書く)

このとき．X1, . . . , Xn の期待値や分散は全て等しい：

E[X1] = · · · = E[Xn], V[X1] = · · · = V[Xn].

• X1, . . . , Xn ∼i.i.d. P，µ = E[Xi], σ2 = V [Xi] のとき

Y =1

n

n∑i=1

Xi =⇒ E[Y ] = µ, V[Y ] =σ2

n

note: E[aX + b] = aE[X] + b, V[aX + b] = a2V[X]

18/30

条件付き確率・条件付き確率密度

• X,Y に関する確率 Pr(X ∈ A, Y ∈ B) が与えられているとき「X ∈ A の条件のもとで Y ∈ B」となる確率：

Pr(Y ∈ B | X ∈ A) =Pr(X ∈ A, Y ∈ B)

Pr(X ∈ A)

X ∈ A

Y ∈ B

19/30

• 密度 p(x, y) のもとで，x が与えられたときの y の条件付き密度

p(y|x) := p(x, y)∫p(x, y)dy

=p(x, y)

pX(x). (pX(x)：x の周辺密度)

性質： ∀x, y, p(y|x) ≥ 0,

∫p(y|x)dy = 1．

「X ∈ [x, x+ dx] の条件下で Y ∈ [y, y + dy] となる確率」

=Pr(X ∈ [x, x+ dx], Y ∈ [y, y + dy])

Pr(X ∈ [x, x+ dx])

=p(x, y)dxdy

pX(x)dx= p(y|x)dy

20/30

ベイズの定理

Pr(X ∈ A|Y ∈ B) =Pr(Y ∈ B|X ∈ A)Pr(X ∈ A)

Pr(Y ∈ B) 証明：

Pr(X ∈ A|Y ∈ B)Pr(Y ∈ B) = Pr(X ∈ A, Y ∈ B)

= Pr(Y ∈ B|X ∈ A)Pr(X ∈ A).

解釈：X を原因，Y を結果と考えると・・・

• Pr(Y |X)：原因 X から結果 Y への関係

• Pr(X|Y )：結果 Y を見て，原因 X について推論

21/30

混合分布：条件付き確率密度の例X は R 上の確率変数，Y は 0, 1 上の確率変数．

Pr(Y = 0) = q, Pr(Y = 1) = 1− q

Xの条件付き密度 : p(x|Y = 0) = p0(x), p(x|Y = 1) = p1(x)

このとき X の周辺密度 p(x) は

p(x) = q · p0(x) + (1− q) · p1(x). (p0とp1の混合分布)

note: p(x) =

∫p(x|y)p(y)dy．この例ではY は離散なので

積分が和になる．

22/30

ベイズの定理より

Pr(Y = 1|X ∈ [x, x+ dx]) =p(X ∈ [x, x+ dx] |Y = 1)Pr(Y = 1)

p(X ∈ [x, x+ dx])

=(1− q)p1(x)dx

qp0(x) + (1− q)p1(x)dx

=1

q

1− q· p0(x)p1(x)

+ 1

簡単のため Pr(Y = 1|X ∈ [x, x+ dx]) を p(1|x) と書く．

r(x) =q

1− q· p0(x)p1(x)

とおくと

p(1 |x) = 1

r(x) + 1, p(0 |x) = r(x)

r(x) + 1

23/30

漸近理論：大数の法則X1, . . . , Xn ∼i.i.d . P として E(Xi) = µ とおく．

• 大数の法則：Xn定義=

1

n

n∑i=1

Xi とすると

∀ε > 0, limn→∞

Pr(|Xn − µ| > ε) = 0

∗ nが十分大きいと，高い確率で Xn は µ にほぼ等しい．

定義：確率変数列 Znn∈N が a ∈ R に確率収束する

∀ε > 0, Pr(|Zn − a| > ε) → 0 (n → ∞)

24/30

• 大数の法則より Xnp−→ µ．

• f(x) を連続関数とするとき，普通の極限と類似の関係が成立．

Xnp−→ µ ならば f(Xn)

p−→ f(µ)

Example 2 (大数の法則の例).表の確率が0.7のコイン．n回振って表が出た割合をプロット

0 200 400 600 800

0.5

0.6

0.7

0.8

0.9

1.0

Law of Large Numbers

n

average

25/30

確率不等式統計的学習理論で使う主な確率不等式

• Jensen 不等式

• Markov 不等式

• Hoeffding’s lemma, Hoeffding’s inequality

• Massart’s lemma

• Azuma’s inequlity, McDiarmid inequality: Hoeffding’s の一般化

26/30

Jensen’s ineq.

凸関数の定義：

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y), x, y ∈ Rd, α ∈ [0, 1].

Theorem 1 (Jensen’s ineq.).凸関数f : Rd → R，d次元確率変数 X に対して次式が成立：

f(E[X]) ≤ E[f(X)]

27/30

Markov/Chebyshev 不等式

• r.v. X ≥ 0 に対して

Pr(X ≥ ε) ≤ E[X]

ε

• Pr(0 ≤ X ≤ 1) = 1 のとき，0 ≤ ε < E[X] を満たすε に対して

Pr(X ≥ ε) ≥ E[X]− ε

1− ε

• Chebyshev 不等式

Pr|X − E[X]| ≥ ε ≤ V[X]

ε2

28/30

Lemma 1 (Hoeffding’s lemma).

確率変数Xが E[X] = 0, a ≤ X ≤ b を満たすとき，任意の t > 0 に対して

E[etX] ≤ et2(a−b)2/8.

Lemma 2 (Hoeffding’s inequality).確率変数X1, . . . , Xnは独立に同一の分布にしたがい，Xi は確率 1 で有界区間 [ai, bi] に値をとるとする．S =

∑ni=1Xiとすると，任意の ε > 0

に対して

Pr(S − E[S] ≥ ε) ≤ exp

− 2ε2∑n

i=1(bi − ai)2

,

Pr(S − E[S] ≤ −ε) ≤ exp

− 2ε2∑n

i=1(bi − ai)2

.

29/30

Lemma 3 (Massart’s lemma). A ⊂ Rm, |A| < ∞，Rm上の2-ノルムを∥ · ∥としてr = maxx∈A ∥x∥．σ1, . . . , σm ∼i.i.d. P, P (σ = +1) = P (σ = −1) = 1/2. このとき

Eσ

[1

msupx∈A

m∑i=1

σixi

]≤

r√2 log |A|m

, x = (x1, . . . , xm) が成立．

Lemma 4 (McDiarmid’s inequality).集合Xに値をとる独立な確率変数をX1, . . . , Xnとする．また関数f : Xn → Rに対して定数c1, . . . , cnが存在して，任意のx1, . . . , xn, x

′i ∈ Xに

対して以下が成り立つとする．

|f(x1, . . . , xi−1, xi, xi+1, . . . , xn)− f(x1, . . . , xi−1, x′i, xi+1, . . . , xn)| ≤ ci, i = 1, . . . , n

このとき次式が成立：

Pr(f(X1, . . . , Xn)− E[f(X1, . . . , Xn)] ≥ ε

)≤ exp

− 2ε2∑n

i=1 c2i

,

Pr(f(X1, . . . , Xn)− E[f(X1, . . . , Xn)] ≤ −ε

)≤ exp

− 2ε2∑n

i=1 c2i

.

30/30

Documents

機械学習特論 - kanamori lab · 機械学習の基礎：問題設定：予測誤差，訓練誤差，ベイズ・ルール，モデル選択 統計的学習理論：VC次元，Sauerの補題，Rademacher複雑度，代替損失

機械学習特論 - kanamori lab · 機械学習の基礎：問題設定：予測誤差，訓練誤差，ベイズ・ルール，モデル選択統計的学習理論：VC次元，Sauerの補題，Rademacher複雑度，代替損失