View
216
Download
0
Embed Size (px)
Citation preview
Semi-Supervised Semi-Supervised LearningLearning
D. Zhou, O Bousquet, T. Navin Lan,D. Zhou, O Bousquet, T. Navin Lan,
J. Weston, B. SchokopfJ. Weston, B. Schokopf Presents: Tal Babaioff Presents: Tal Babaioff
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 22
Semi Supervised LearningSemi Supervised Learning
Use small number of labeled data to label Use small number of labeled data to label large amount of cheap unlabeled data.large amount of cheap unlabeled data.
Basic idea: similar examples should be Basic idea: similar examples should be given the same classification.given the same classification.
Typical example :Typical example :web page classification: unlimited amount of web page classification: unlimited amount of cheap unlabeled data, while labeling is cheap unlabeled data, while labeling is expensive.expensive.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 33
The Cluster AssumptionThe Cluster Assumption
The basic assumption of most Semi-Supervised learning algorithms:
Two points that are connected by a path going through high density regions should have the same label.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 44
ExampleExample
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 55
Basic ApproachesBasic Approaches
Using a weighted graph with weights Using a weighted graph with weights representing point similarity:representing point similarity:
K nearest neighbors – the most naive approach.K nearest neighbors – the most naive approach. Random walk on graphRandom walk on graph: :
a particle start from unlabeled node i and move a particle start from unlabeled node i and move to node j with probability Pto node j with probability P ijij, The walk continues , The walk continues
until the particle hits a labeled node. until the particle hits a labeled node.
The classification of node i is based on the label The classification of node i is based on the label with maximum probability to hit. with maximum probability to hit.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 66
Basic ApproachesBasic Approaches
An electrical network:An electrical network: lets connect all the lets connect all the points labeled 1 to a positive voltage points labeled 1 to a positive voltage source, and all points labeled 0 to a source, and all points labeled 0 to a negative one. The graph edges are negative one. The graph edges are resistors with conductance W. resistors with conductance W.
Each unlabeled point classification will be Each unlabeled point classification will be determined from the amount of voltage in determined from the amount of voltage in the complete electricthe complete electric network.network.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 77
Other ApproachesOther Approaches
Harmonic energy Harmonic energy minimizationminimization: :
use a Gaussian field use a Gaussian field over a continuous over a continuous state space, with state space, with weights given as a weights given as a similarity function similarity function between points.between points.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 88
The Consistency AssumptionThe Consistency Assumption
Points in the same local high density region are more similar to each other (and thus likely to have the same label) then to points outside this region (local consistency).
Points on the same global structure (a cluster or a manifold) are more similar to each other than to points outside of this structure (global consistency).
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 99
Consistency Assumption ExampleConsistency Assumption Example
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1010
Consistency Assumption ExampleConsistency Assumption Example
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1111
Formal RepresentationFormal Representation
X = {xX = {x11..x..xll,x,xl+1l+1..x..xnn} } R Rmm
Label set L = {1,..c}Label set L = {1,..c} The first l points have labeled yThe first l points have labeled yi i {1,..c}{1,..c}
For points with i>l yFor points with i>l yi i is unknown.is unknown.
The error is checked on the unlabeled The error is checked on the unlabeled examples only.examples only.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1212
Basic Ideas For The AlgorithmBasic Ideas For The Algorithm
Define a similarity function that changes slowly locally in high density regions and changes globally on the manifold which the data points lie.
Define an activation network represented as a graph with weights determined by the similarity of each two points.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1313
Basic Ideas For The AlgorithmBasic Ideas For The Algorithm
Use the labeled points as sources to pump the different classes labels via the graph, and use the new labeled points as additional source until a stable stage has been reached.
The label of each unlabeled point is set to be the class of which it has received most information during the iteration process.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1414
Algorithm : Data StructureAlgorithm : Data Structure
Given a set of points: X = {xGiven a set of points: X = {x11..x..xll,x,xl+1l+1..x..xnn} }
The first l points have labeled YThe first l points have labeled Y i i {1,..c} the rest {1,..c} the rest
are unlabeled.are unlabeled. The classification will be presented on an [n x c] The classification will be presented on an [n x c]
non negative matrix F. non negative matrix F.
The classification of point xThe classification of point x ii will be will be
yyii = argmax = argmax j<c j<c FFijij..
Let YLet YF be a [n x c] matrix with elements F be a [n x c] matrix with elements YYijij =1 if point i has a label y =1 if point i has a label y ii = j or 0 otherwise. = j or 0 otherwise.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1515
The Consistency AlgorithmThe Consistency Algorithm
1. Form the affinity matrix W defined by Wij = exp(-||xi-xj||2 /22) if i j and Wii = 0.
2. Compute the matrix S defined by S = D-½ W D- ½ D is a diagonal matrix with its (i,i) element equal to the sum of the i-th row of W. The eigenvalues of S represents the spectral clusters of the data.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1616
The Consistency AlgorithmThe Consistency Algorithm
3. Iterate F(t+1) = SF(t) + (1-)Y until convergence. (0, 1).
4. Let F* denote the limit of the sequence {F(t)}. Label the unlabeled point xi by yi = argmax j≤c F*ij
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1717
Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence
Show the algorithm convergence to:F* = (1-)(I -S)-1Y
Without loss of generality, let F(0) = Y. F(t+1) = SF(t) + (1-)Y
And therefore
F(t) = (S)tY+ (1-)t-1i=0
(S)iY.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1818
Consistency Algorithm – Consistency Algorithm – ConvergenceConvergence
Show the algorithm convergence to:F* = (1-)(I -S)-1Y
F(t) = (S)tY+ (1-)t-1i=0
(S)iY.Since:
0< <1and the eigenvalues of S is in [-1, 1]:lim t→ (S)t-1 = 0 lim t→ i=0
t-1 (S)i = (I -S)-1
Hence: F* = lim t→ F(t) = (1-)(I -S)-1Y
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 1919
Regularization FrameworkRegularization Framework Define a cost function for the iteration stage:
smoothness constraint: a good classifying function should not change too much between nearby points.
The classifiying function is
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2020
Regularization FrameworkRegularization Framework
fitting constraint: a good classifying function should not change too much from the initial label assignment.
>0 : Trade off between constraints
2121 Semi-Supervised LearningSemi-Supervised Learning13/01/200513/01/2005
Regularization FrameworkRegularization Framework
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2222
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2323
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2424
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2525
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2626
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2727
Results Results Two Moon Toy Problem Two Moon Toy Problem
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2828
Results: Digit RecognitionResults: Digit Recognition
Run the algorithm over USPS database Run the algorithm over USPS database with digits 1, 2, 3, 4.with digits 1, 2, 3, 4.
Class sizes are 1269, 929, 824, 852 (Total Class sizes are 1269, 929, 824, 852 (Total 3874).3874).
The test errors are averaged over 30 trials.The test errors are averaged over 30 trials. The samples were chosen so that they The samples were chosen so that they
contain at least one labeled point of each contain at least one labeled point of each class.class.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 2929
Results: Digit RecognitionResults: Digit Recognition
Number of Number of labeled datalabeled data
448812 12 16 16
K-nearest K-nearest neighborneighbor
25.45 (25.45 (1.30)1.30)20.23 (20.23 (1.17)1.17)15.52 (15.52 (0.87)0.87)13.87 (13.87 (0.70)0.70)
Normal SVMNormal SVM25.89 (25.89 (1.30)1.30)16.89 (16.89 (1.14)1.14)11.15 (11.15 (0.53)0.53)10.10 (10.10 (0.52)0.52)
Cluster KernelCluster Kernel16.24 (16.24 (1.40)1.40)10.94 (10.94 (1.03)1.03)7.52 (7.52 (0.29)0.29)7.20 (7.20 (0.33)0.33)
LP (label LP (label propagation)propagation)
62.24 (62.24 (1.60)1.60)59.12 (59.12 (2.07)2.07)52.54 (52.54 (2.15)2.15)46.78 (46.78 (2.07)2.07)
Consistency Consistency algorithmalgorithm
8.02 (8.02 (1.51)1.51)4.11 (4.11 (0.42)0.42)2.76 (2.76 (0.15)0.15)2.73 (2.73 (0.27)0.27)
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3030
Results: Digit RecognitionResults: Digit Recognition
Number of Number of labeled datalabeled data
2020242428 28 3232
K-nearest K-nearest neighborneighbor
12.08 (12.08 (0.51)0.51)10.93 (10.93 (0.50)0.50)10.03 (10.03 (0.370.379.30 (9.30 (0.68)0.68)
Normal SVMNormal SVM8.71 (8.71 (0.35)0.35)7.91 (7.91 (0.29)0.29)7.52 (7.52 (0.38)0.38)7.37 (7.37 (0.31)0.31)
Cluster KernelCluster Kernel6.55 (6.55 (0.22)0.22)6.13 (6.13 (0.20)0.20)6.04 (6.04 (0.250.255.97 (5.97 (0.24)0.24)
LP (label LP (label propagation)propagation)
44.88 (44.88 (2.33)2.33)39.28 (39.28 (2.12)2.12)35.18 (35.18 (1.55)1.55)30.67 (30.67 (1.67)1.67)
Consistency Consistency algorithmalgorithm
2.19 (2.19 (0.11)0.11)2.04 (2.04 (0.10)0.10)1.97 (1.97 (0.11)0.11)1.79 (1.79 (0.27)0.27)
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3131
Results: Digit RecognitionResults: Digit Recognition
ResultsResults averaged over 100 trialsaveraged over 100 trials
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3232
Results: Text classificationResults: Text classification
Use Mac & Windows subsets from 20 Use Mac & Windows subsets from 20 newsgroups data set. newsgroups data set.
There are 961 and 985 examples in the There are 961 and 985 examples in the two classes with 7511 dimensions.two classes with 7511 dimensions.
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3333
Results: Text ClassificationResults: Text Classification
Number of Number of labeled datalabeled data
22448 8 1616
K-nearest K-nearest neighborneighbor
37.40 (37.40 (0.0)0.0)35.97 (35.97 (0.50)0.50)32.80 (32.80 (0.62)0.62)29.92 (29.92 (0.68)0.68)
Normal SVMNormal SVM50.17 (50.17 (0.06)0.06)35.66 (35.66 (1.12)1.12)32.17 (32.17 (1.43)1.43)28.92 (28.92 (1.39)1.39)
Cluster KernelCluster Kernel45.28 (45.28 (1.47)1.47)23.89 (23.89 (2.06)2.06)15.13 (15.13 (0.99)0.99)11.04 (11.04 (0.45)0.45)
LP (label LP (label propagation)propagation)
49.11 (49.11 (1.10)1.10)49.08 (49.08 (0.11)0.11)49.29 (49.29 (0.08)0.08)49.04 (49.04 (0.07)0.07)
Consistency Consistency algorithmalgorithm
24.05 (24.05 (1.96)1.96)20.83 (20.83 (1.60)1.60)15.56 (15.56 (0.79)0.79)13.52 (13.52 (0.64)0.64)
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3434
Results: Text Classification 2Results: Text Classification 2
Use the topic “rec” which contains Use the topic “rec” which contains autosautos, , motorcyclesmotorcycles, , baseballbaseball and and hockeyhockey subsets. subsets.
Preprocessing:Preprocessing: Remove ending from all words (like ing, ed,…)Remove ending from all words (like ing, ed,…) Don’t pass words on the SMART list (the, of …)Don’t pass words on the SMART list (the, of …) Ignore the headersIgnore the headers Use only words that appear in 5 or more articles.Use only words that appear in 5 or more articles.
Data base size: Data base size: 3970 document vectors in a 8014-dimensional space
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3535
Results: Text Classification 2Results: Text Classification 2
13/01/200513/01/2005 Semi-Supervised LearningSemi-Supervised Learning 3636
References:References:
Learning with Local and Global Consistency: Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Scholkopf
http://www.kyb.mpg.de/publications/pdfs/pdf2333.pdf
Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions:
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty http://www.http://www.hplhpl.hp.com/conferences/icml2003/papers/132.pdf.hp.com/conferences/icml2003/papers/132.pdf
The EndThe End