37
Semi-Supervised Semi-Supervised Learning Learning D. Zhou, O Bousquet, T. Navin Lan, D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff Presents: Tal Babaioff

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

Semi-Supervised Semi-Supervised LearningLearning

D. Zhou, O Bousquet, T. Navin Lan,D. Zhou, O Bousquet, T. Navin Lan,

J. Weston, B. SchokopfJ. Weston, B. Schokopf Presents: Tal Babaioff Presents: Tal Babaioff

Page 2: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 22

Semi Supervised LearningSemi Supervised Learning

Use small number of labeled data to label Use small number of labeled data to label large amount of cheap unlabeled data.large amount of cheap unlabeled data.

Basic idea: similar examples should be Basic idea: similar examples should be given the same classification.given the same classification.

Typical example :Typical example :web page classification: unlimited amount of web page classification: unlimited amount of cheap unlabeled data, while labeling is cheap unlabeled data, while labeling is expensive.expensive.

Page 3: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 33

The Cluster AssumptionThe Cluster Assumption

The basic assumption of most Semi-Supervised learning algorithms:

Two points that are connected by a path going through high density regions should have the same label.

Page 4: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 44

ExampleExample

Page 5: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 55

Basic ApproachesBasic Approaches

Using a weighted graph with weights Using a weighted graph with weights representing point similarity:representing point similarity:

K nearest neighbors – the most naive approach.K nearest neighbors – the most naive approach. Random walk on graphRandom walk on graph: :

a particle start from unlabeled node i and move a particle start from unlabeled node i and move to node j with probability Pto node j with probability P ijij, The walk continues , The walk continues

until the particle hits a labeled node. until the particle hits a labeled node.

The classification of node i is based on the label The classification of node i is based on the label with maximum probability to hit. with maximum probability to hit.

Page 6: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 66

Basic ApproachesBasic Approaches

An electrical network:An electrical network: lets connect all the lets connect all the points labeled 1 to a positive voltage points labeled 1 to a positive voltage source, and all points labeled 0 to a source, and all points labeled 0 to a negative one. The graph edges are negative one. The graph edges are resistors with conductance W. resistors with conductance W.

Each unlabeled point classification will be Each unlabeled point classification will be determined from the amount of voltage in determined from the amount of voltage in the complete electricthe complete electric network.network.

Page 7: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 77

Other ApproachesOther Approaches

Harmonic energy Harmonic energy minimizationminimization: :

use a Gaussian field use a Gaussian field over a continuous over a continuous state space, with state space, with weights given as a weights given as a similarity function similarity function between points.between points.

Page 8: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 88

The Consistency AssumptionThe Consistency Assumption

Points in the same local high density region are more similar to each other (and thus likely to have the same label) then to points outside this region (local consistency).

Points on the same global structure (a cluster or a manifold) are more similar to each other than to points outside of this structure (global consistency).

Page 9: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 99

Consistency Assumption ExampleConsistency Assumption Example

Page 10: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1010

Consistency Assumption ExampleConsistency Assumption Example

Page 11: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1111

Formal RepresentationFormal Representation

X = {xX = {x11..x..xll,x,xl+1l+1..x..xnn} } R Rmm

Label set L = {1,..c}Label set L = {1,..c} The first l points have labeled yThe first l points have labeled yi i {1,..c}{1,..c}

For points with i>l yFor points with i>l yi i is unknown.is unknown.

The error is checked on the unlabeled The error is checked on the unlabeled examples only.examples only.

Page 12: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1212

Basic Ideas For The AlgorithmBasic Ideas For The Algorithm

Define a similarity function that changes slowly locally in high density regions and changes globally on the manifold which the data points lie.

Define an activation network represented as a graph with weights determined by the similarity of each two points.

Page 13: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1313

Basic Ideas For The AlgorithmBasic Ideas For The Algorithm

Use the labeled points as sources to pump the different classes labels via the graph, and use the new labeled points as additional source until a stable stage has been reached.

The label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

Page 14: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1414

Algorithm : Data StructureAlgorithm : Data Structure

Given a set of points: X = {xGiven a set of points: X = {x11..x..xll,x,xl+1l+1..x..xnn} }

The first l points have labeled YThe first l points have labeled Y i i {1,..c} the rest {1,..c} the rest

are unlabeled.are unlabeled. The classification will be presented on an [n x c] The classification will be presented on an [n x c]

non negative matrix F. non negative matrix F.

The classification of point xThe classification of point x ii will be will be

yyii = argmax = argmax j<c j<c FFijij..

Let YLet YF be a [n x c] matrix with elements F be a [n x c] matrix with elements YYijij =1 if point i has a label y =1 if point i has a label y ii = j or 0 otherwise. = j or 0 otherwise.

Page 15: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1515

The Consistency AlgorithmThe Consistency Algorithm

1. Form the affinity matrix W defined by Wij = exp(-||xi-xj||2 /22) if i j and Wii = 0.

2. Compute the matrix S defined by S = D-½ W D- ½ D is a diagonal matrix with its (i,i) element equal to the sum of the i-th row of W. The eigenvalues of S represents the spectral clusters of the data.

Page 16: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1616

The Consistency AlgorithmThe Consistency Algorithm

3. Iterate F(t+1) = SF(t) + (1-)Y until convergence. (0, 1).

4. Let F* denote the limit of the sequence {F(t)}. Label the unlabeled point xi by yi = argmax j≤c F*ij

Page 17: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1717

Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence

Show the algorithm convergence to:F* = (1-)(I -S)-1Y

Without loss of generality, let F(0) = Y. F(t+1) = SF(t) + (1-)Y

And therefore

F(t) = (S)tY+ (1-)t-1i=0

(S)iY.

Page 18: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1818

Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence

Show the algorithm convergence to:F* = (1-)(I -S)-1Y

F(t) = (S)tY+ (1-)t-1i=0

(S)iY.Since:

0< <1and the eigenvalues of S is in [-1, 1]:lim t→ (S)t-1 = 0 lim t→ i=0

t-1 (S)i = (I -S)-1

Hence: F* = lim t→ F(t) = (1-)(I -S)-1Y

Page 19: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1919

Regularization FrameworkRegularization Framework Define a cost function for the iteration stage:

smoothness constraint: a good classifying function should not change too much between nearby points.

The classifiying function is

Page 20: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2020

Regularization FrameworkRegularization Framework

fitting constraint: a good classifying function should not change too much from the initial label assignment.

>0 : Trade off between constraints

Page 21: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

2121 Semi-Supervised LearningSemi-Supervised Learning13/01/200513/01/2005

Regularization FrameworkRegularization Framework

Page 22: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2222

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 23: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2323

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 24: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2424

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 25: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2525

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 26: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2626

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 27: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2727

Results Results Two Moon Toy Problem Two Moon Toy Problem

Page 28: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2828

Results: Digit RecognitionResults: Digit Recognition

Run the algorithm over USPS database Run the algorithm over USPS database with digits 1, 2, 3, 4.with digits 1, 2, 3, 4.

Class sizes are 1269, 929, 824, 852 (Total Class sizes are 1269, 929, 824, 852 (Total 3874).3874).

The test errors are averaged over 30 trials.The test errors are averaged over 30 trials. The samples were chosen so that they The samples were chosen so that they

contain at least one labeled point of each contain at least one labeled point of each class.class.

Page 29: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2929

Results: Digit RecognitionResults: Digit Recognition

Number of Number of labeled datalabeled data

448812 12 16 16

K-nearest K-nearest neighborneighbor

25.45 (25.45 (1.30)1.30)20.23 (20.23 (1.17)1.17)15.52 (15.52 (0.87)0.87)13.87 (13.87 (0.70)0.70)

Normal SVMNormal SVM25.89 (25.89 (1.30)1.30)16.89 (16.89 (1.14)1.14)11.15 (11.15 (0.53)0.53)10.10 (10.10 (0.52)0.52)

Cluster KernelCluster Kernel16.24 (16.24 (1.40)1.40)10.94 (10.94 (1.03)1.03)7.52 (7.52 (0.29)0.29)7.20 (7.20 (0.33)0.33)

LP (label LP (label propagation)propagation)

62.24 (62.24 (1.60)1.60)59.12 (59.12 (2.07)2.07)52.54 (52.54 (2.15)2.15)46.78 (46.78 (2.07)2.07)

Consistency Consistency algorithmalgorithm

8.02 (8.02 (1.51)1.51)4.11 (4.11 (0.42)0.42)2.76 (2.76 (0.15)0.15)2.73 (2.73 (0.27)0.27)

Page 30: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3030

Results: Digit RecognitionResults: Digit Recognition

Number of Number of labeled datalabeled data

2020242428 28 3232

K-nearest K-nearest neighborneighbor

12.08 (12.08 (0.51)0.51)10.93 (10.93 (0.50)0.50)10.03 (10.03 (0.370.379.30 (9.30 (0.68)0.68)

Normal SVMNormal SVM8.71 (8.71 (0.35)0.35)7.91 (7.91 (0.29)0.29)7.52 (7.52 (0.38)0.38)7.37 (7.37 (0.31)0.31)

Cluster KernelCluster Kernel6.55 (6.55 (0.22)0.22)6.13 (6.13 (0.20)0.20)6.04 (6.04 (0.250.255.97 (5.97 (0.24)0.24)

LP (label LP (label propagation)propagation)

44.88 (44.88 (2.33)2.33)39.28 (39.28 (2.12)2.12)35.18 (35.18 (1.55)1.55)30.67 (30.67 (1.67)1.67)

Consistency Consistency algorithmalgorithm

2.19 (2.19 (0.11)0.11)2.04 (2.04 (0.10)0.10)1.97 (1.97 (0.11)0.11)1.79 (1.79 (0.27)0.27)

Page 31: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3131

Results: Digit RecognitionResults: Digit Recognition

ResultsResults averaged over 100 trialsaveraged over 100 trials

Page 32: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3232

Results: Text classificationResults: Text classification

Use Mac & Windows subsets from 20 Use Mac & Windows subsets from 20 newsgroups data set. newsgroups data set.

There are 961 and 985 examples in the There are 961 and 985 examples in the two classes with 7511 dimensions.two classes with 7511 dimensions.

Page 33: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3333

Results: Text ClassificationResults: Text Classification

Number of Number of labeled datalabeled data

22448 8 1616

K-nearest K-nearest neighborneighbor

37.40 (37.40 (0.0)0.0)35.97 (35.97 (0.50)0.50)32.80 (32.80 (0.62)0.62)29.92 (29.92 (0.68)0.68)

Normal SVMNormal SVM50.17 (50.17 (0.06)0.06)35.66 (35.66 (1.12)1.12)32.17 (32.17 (1.43)1.43)28.92 (28.92 (1.39)1.39)

Cluster KernelCluster Kernel45.28 (45.28 (1.47)1.47)23.89 (23.89 (2.06)2.06)15.13 (15.13 (0.99)0.99)11.04 (11.04 (0.45)0.45)

LP (label LP (label propagation)propagation)

49.11 (49.11 (1.10)1.10)49.08 (49.08 (0.11)0.11)49.29 (49.29 (0.08)0.08)49.04 (49.04 (0.07)0.07)

Consistency Consistency algorithmalgorithm

24.05 (24.05 (1.96)1.96)20.83 (20.83 (1.60)1.60)15.56 (15.56 (0.79)0.79)13.52 (13.52 (0.64)0.64)

Page 34: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3434

Results: Text Classification 2Results: Text Classification 2

Use the topic “rec” which contains Use the topic “rec” which contains autosautos, , motorcyclesmotorcycles, , baseballbaseball and and hockeyhockey subsets. subsets.

Preprocessing:Preprocessing: Remove ending from all words (like ing, ed,…)Remove ending from all words (like ing, ed,…) Don’t pass words on the SMART list (the, of …)Don’t pass words on the SMART list (the, of …) Ignore the headersIgnore the headers Use only words that appear in 5 or more articles.Use only words that appear in 5 or more articles.

Data base size: Data base size: 3970 document vectors in a 8014-dimensional space

Page 35: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3535

Results: Text Classification 2Results: Text Classification 2

Page 36: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3636

References:References:

Learning with Local and Global Consistency: Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Scholkopf

http://www.kyb.mpg.de/publications/pdfs/pdf2333.pdf

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions:

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty http://www.http://www.hplhpl.hp.com/conferences/icml2003/papers/132.pdf.hp.com/conferences/icml2003/papers/132.pdf

Page 37: Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

The EndThe End