Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff

Semi-Supervised Semi-Supervised LearningLearning

D. Zhou, O Bousquet, T. Navin Lan,D. Zhou, O Bousquet, T. Navin Lan,

J. Weston, B. SchokopfJ. Weston, B. Schokopf Presents: Tal Babaioff Presents: Tal Babaioff

13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 22

Semi Supervised LearningSemi Supervised Learning

Use small number of labeled data to label Use small number of labeled data to label large amount of cheap unlabeled data.large amount of cheap unlabeled data.

Basic idea: similar examples should be Basic idea: similar examples should be given the same classification.given the same classification.

Typical example :Typical example :web page classification: unlimited amount of web page classification: unlimited amount of cheap unlabeled data, while labeling is cheap unlabeled data, while labeling is expensive.expensive.


The Cluster AssumptionThe Cluster Assumption

The basic assumption of most Semi-Supervised learning algorithms:

Two points that are connected by a path going through high density regions should have the same label.


ExampleExample


Basic ApproachesBasic Approaches

Using a weighted graph with weights Using a weighted graph with weights representing point similarity:representing point similarity:

K nearest neighbors – the most naive approach.K nearest neighbors – the most naive approach. Random walk on graphRandom walk on graph: :

a particle start from unlabeled node i and move a particle start from unlabeled node i and move to node j with probability Pto node j with probability P ijij, The walk continues , The walk continues

until the particle hits a labeled node. until the particle hits a labeled node.

The classification of node i is based on the label The classification of node i is based on the label with maximum probability to hit. with maximum probability to hit.


Basic ApproachesBasic Approaches

An electrical network:An electrical network: lets connect all the lets connect all the points labeled 1 to a positive voltage points labeled 1 to a positive voltage source, and all points labeled 0 to a source, and all points labeled 0 to a negative one. The graph edges are negative one. The graph edges are resistors with conductance W. resistors with conductance W.

Each unlabeled point classification will be Each unlabeled point classification will be determined from the amount of voltage in determined from the amount of voltage in the complete electricthe complete electric network.network.


Other ApproachesOther Approaches

Harmonic energy Harmonic energy minimizationminimization: :

use a Gaussian field use a Gaussian field over a continuous over a continuous state space, with state space, with weights given as a weights given as a similarity function similarity function between points.between points.


The Consistency AssumptionThe Consistency Assumption

Points in the same local high density region are more similar to each other (and thus likely to have the same label) then to points outside this region (local consistency).

Points on the same global structure (a cluster or a manifold) are more similar to each other than to points outside of this structure (global consistency).


Consistency Assumption ExampleConsistency Assumption Example


Consistency Assumption ExampleConsistency Assumption Example


Formal RepresentationFormal Representation

X = {xX = {x11..x..xll,x,xl+1l+1..x..xnn} } R Rmm

Label set L = {1,..c}Label set L = {1,..c} The first l points have labeled yThe first l points have labeled yi i {1,..c}{1,..c}

For points with i>l yFor points with i>l yi i is unknown.is unknown.

The error is checked on the unlabeled The error is checked on the unlabeled examples only.examples only.


Basic Ideas For The AlgorithmBasic Ideas For The Algorithm

Define a similarity function that changes slowly locally in high density regions and changes globally on the manifold which the data points lie.

Define an activation network represented as a graph with weights determined by the similarity of each two points.


Basic Ideas For The AlgorithmBasic Ideas For The Algorithm

Use the labeled points as sources to pump the different classes labels via the graph, and use the new labeled points as additional source until a stable stage has been reached.

The label of each unlabeled point is set to be the class of which it has received most information during the iteration process.


Algorithm : Data StructureAlgorithm : Data Structure

Given a set of points: X = {xGiven a set of points: X = {x11..x..xll,x,xl+1l+1..x..xnn} }

The first l points have labeled YThe first l points have labeled Y i i {1,..c} the rest {1,..c} the rest

are unlabeled.are unlabeled. The classification will be presented on an [n x c] The classification will be presented on an [n x c]

non negative matrix F. non negative matrix F.

The classification of point xThe classification of point x ii will be will be

yyii = argmax = argmax j<c j<c FFijij..

Let YLet YF be a [n x c] matrix with elements F be a [n x c] matrix with elements YYijij =1 if point i has a label y =1 if point i has a label y ii = j or 0 otherwise. = j or 0 otherwise.


The Consistency AlgorithmThe Consistency Algorithm

1. Form the affinity matrix W defined by Wij = exp(-||xi-xj||2 /22) if i j and Wii = 0.

2. Compute the matrix S defined by S = D-½ W D- ½ D is a diagonal matrix with its (i,i) element equal to the sum of the i-th row of W. The eigenvalues of S represents the spectral clusters of the data.


The Consistency AlgorithmThe Consistency Algorithm

3. Iterate F(t+1) = SF(t) + (1-)Y until convergence. (0, 1).

4. Let F* denote the limit of the sequence {F(t)}. Label the unlabeled point xi by yi = argmax j≤c F*ij


Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence

Show the algorithm convergence to:F* = (1-)(I -S)-1Y

Without loss of generality, let F(0) = Y. F(t+1) = SF(t) + (1-)Y

And therefore

F(t) = (S)tY+ (1-)t-1i=0

(S)iY.


Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence

Show the algorithm convergence to:F* = (1-)(I -S)-1Y

F(t) = (S)tY+ (1-)t-1i=0

(S)iY.Since:

0< <1and the eigenvalues of S is in [-1, 1]:lim t→ (S)t-1 = 0 lim t→ i=0

t-1 (S)i = (I -S)-1

Hence: F* = lim t→ F(t) = (1-)(I -S)-1Y


Regularization FrameworkRegularization Framework Define a cost function for the iteration stage:

smoothness constraint: a good classifying function should not change too much between nearby points.

The classifiying function is


Regularization FrameworkRegularization Framework

fitting constraint: a good classifying function should not change too much from the initial label assignment.

>0 : Trade off between constraints

2121 Semi-Supervised LearningSemi-Supervised Learning13/01/200513/01/2005

Regularization FrameworkRegularization Framework


Results Results Two Moon Toy Problem Two Moon Toy Problem












Results: Digit RecognitionResults: Digit Recognition

Run the algorithm over USPS database Run the algorithm over USPS database with digits 1, 2, 3, 4.with digits 1, 2, 3, 4.

Class sizes are 1269, 929, 824, 852 (Total Class sizes are 1269, 929, 824, 852 (Total 3874).3874).

The test errors are averaged over 30 trials.The test errors are averaged over 30 trials. The samples were chosen so that they The samples were chosen so that they

contain at least one labeled point of each contain at least one labeled point of each class.class.



Number of Number of labeled datalabeled data

448812 12 16 16

K-nearest K-nearest neighborneighbor

25.45 (25.45 (1.30)1.30)20.23 (20.23 (1.17)1.17)15.52 (15.52 (0.87)0.87)13.87 (13.87 (0.70)0.70)

Normal SVMNormal SVM25.89 (25.89 (1.30)1.30)16.89 (16.89 (1.14)1.14)11.15 (11.15 (0.53)0.53)10.10 (10.10 (0.52)0.52)

Cluster KernelCluster Kernel16.24 (16.24 (1.40)1.40)10.94 (10.94 (1.03)1.03)7.52 (7.52 (0.29)0.29)7.20 (7.20 (0.33)0.33)

LP (label LP (label propagation)propagation)

62.24 (62.24 (1.60)1.60)59.12 (59.12 (2.07)2.07)52.54 (52.54 (2.15)2.15)46.78 (46.78 (2.07)2.07)

Consistency Consistency algorithmalgorithm

8.02 (8.02 (1.51)1.51)4.11 (4.11 (0.42)0.42)2.76 (2.76 (0.15)0.15)2.73 (2.73 (0.27)0.27)




2020242428 28 3232


12.08 (12.08 (0.51)0.51)10.93 (10.93 (0.50)0.50)10.03 (10.03 (0.370.379.30 (9.30 (0.68)0.68)


Cluster KernelCluster Kernel6.55 (6.55 (0.22)0.22)6.13 (6.13 (0.20)0.20)6.04 (6.04 (0.250.255.97 (5.97 (0.24)0.24)


44.88 (44.88 (2.33)2.33)39.28 (39.28 (2.12)2.12)35.18 (35.18 (1.55)1.55)30.67 (30.67 (1.67)1.67)


2.19 (2.19 (0.11)0.11)2.04 (2.04 (0.10)0.10)1.97 (1.97 (0.11)0.11)1.79 (1.79 (0.27)0.27)



ResultsResults averaged over 100 trialsaveraged over 100 trials


Results: Text classificationResults: Text classification

Use Mac & Windows subsets from 20 Use Mac & Windows subsets from 20 newsgroups data set. newsgroups data set.

There are 961 and 985 examples in the There are 961 and 985 examples in the two classes with 7511 dimensions.two classes with 7511 dimensions.


Results: Text ClassificationResults: Text Classification


22448 8 1616


37.40 (37.40 (0.0)0.0)35.97 (35.97 (0.50)0.50)32.80 (32.80 (0.62)0.62)29.92 (29.92 (0.68)0.68)


Cluster KernelCluster Kernel45.28 (45.28 (1.47)1.47)23.89 (23.89 (2.06)2.06)15.13 (15.13 (0.99)0.99)11.04 (11.04 (0.45)0.45)


49.11 (49.11 (1.10)1.10)49.08 (49.08 (0.11)0.11)49.29 (49.29 (0.08)0.08)49.04 (49.04 (0.07)0.07)


24.05 (24.05 (1.96)1.96)20.83 (20.83 (1.60)1.60)15.56 (15.56 (0.79)0.79)13.52 (13.52 (0.64)0.64)


Results: Text Classification 2Results: Text Classification 2

Use the topic “rec” which contains Use the topic “rec” which contains autosautos, , motorcyclesmotorcycles, , baseballbaseball and and hockeyhockey subsets. subsets.

Preprocessing:Preprocessing: Remove ending from all words (like ing, ed,…)Remove ending from all words (like ing, ed,…) Don’t pass words on the SMART list (the, of …)Don’t pass words on the SMART list (the, of …) Ignore the headersIgnore the headers Use only words that appear in 5 or more articles.Use only words that appear in 5 or more articles.

Data base size: Data base size: 3970 document vectors in a 8014-dimensional space


Results: Text Classification 2Results: Text Classification 2


References:References:

Learning with Local and Global Consistency: Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Scholkopf

http://www.kyb.mpg.de/publications/pdfs/pdf2333.pdf

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions:

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty http://www.http://www.hplhpl.hp.com/conferences/icml2003/papers/132.pdf.hp.com/conferences/icml2003/papers/132.pdf

The EndThe End

Documents

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff