Special Topic on Image Retrieval

1

Special Topic on Image Retrieval

2014-02

课件下载地址： http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.html

2

Text-based Image Retrieval

3

4

Retrieval Model Overview

• Occurrence based models– Boolean Retrieval Model– Vector Space Model

• Probability models– BM25 – Language Model

• Hyperlink-based models– PageRank

• Machine learning based models– Learning to Rank

5

Boolean Retrieval

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:– Boolean Queries use AND, OR and NOT to join

query terms• Views each document as a set of words• Is precise: document matches condition or not.

– Perhaps the simplest model to build an IR system on• Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean:

– Email, library catalog, Mac OS X Spotlight

6

7

• Linear scanning the texts for each query

Boolean Retrieval

• Index the documents– Term-document incidence matrix

8

Boolean RetrievalTerm-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if doc contains word, 0 otherwiseBrutus AND Caesar BUT NOT

Calpurnia

http://www.rhymezone.com/shakespeare/

http://www.rhymezone.com/shakespeare/

Boolean RetrievalIncidence vectors

• So we have a 0/1 vector for each term.• To answer query: take the vectors for Brutus,

Caesar and Calpurnia (complemented) bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

9

10

Answers to query

• Antony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the

Capitol; Brutus killed me.

11

Bigger collections

• Consider N = 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

12

Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.Why?• But it has no more than one billion 1’s.

– matrix is extremely sparse.

• What’s a better representation?

We only record the 1 positions

Inverted index

• For each term t, we must store a list of all documents that contain t.– Identify each by a docID, a document serial

number

13

Dictionary Postings

Sorted by docID

PostingPosting

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

14

Tokenizer

Token stream Friends Romans Countrymen

Inverted index construction

Linguistic modules

Modified tokensfriend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed

Friends, Romans, countrymen.

15

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

16

Indexer steps: Sort

• Sort by terms– And then docID

Core indexing step

17

Indexer steps: Dictionary & Postings

• Multiple term entries in a single document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is added.

Why frequency?Will discuss later.

18

Where do we pay in storage?

Pointers

Terms and

counts

Lists of docIDs

19

Query processing: AND

• Consider processing the query:Brutus AND Caesar– Locate Brutus in the Dictionary;

• Retrieve its postings.

– Locate Caesar in the Dictionary;• Retrieve its postings.

– “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

20

Intersecting two postings lists(a “merge” algorithm)

21

The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If list lengths are x and y, merge takes O(x+y) operations.Crucial: postings sorted by docID.

22

Ranked retrieval

• Thus far, our queries have all been Boolean.– Documents either match or don’t.

• Good for expert users with precise understanding of their needs and the collection.– Also good for applications: Applications can easily consume

1000s of results.• Not good for the majority of users.

– Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

– Most users don’t want to wade through 1000s of results.• This is particularly true of web search.

23

Problem with Boolean search:feast or famine

• Boolean queries often result in either too few (=0) or too many (1000s) results.

• Query 1: “standard user dlink 650” → 200,000 hits

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a manageable number of hits.– AND gives too few; OR gives too many

24

Ranked retrieval models

• Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

• Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

• In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

25

Feast or famine: not a problem in ranked retrieval

• When a system produces a ranked result set, large result sets are not an issue– Indeed, the size of the result set is not an issue– We just show the top k results– We don’t overwhelm the user

– Premise: the ranking algorithm works

26

Scoring as the basis of ranked retrieval

• We wish to return in order the documents most likely to be useful to the searcher

• How can we rank-order the documents in the collection with respect to a query?

• Assign a score – say in [0, 1] – to each document

• This score measures how well document and query “match”.

27

Query-document matching scores

• We need a way of assigning a score to a query/document pair

• Let’s start with a one-term query• If the query term does not occur in the

document: score should be 0• The more frequent the query term in the

document, the higher the score (should be)• We will look at a number of alternatives for this.

28

Binary term-document incidence matrix


Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1



mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Each document is represented by a binary vector {0,1}∈ |V|

29

Term-document count matrices

• Consider the number of occurrences of a term in a document: – Each document is a count vector in ℕv: a column

below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1



mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

30

Bag of words model

• Vector representation doesn’t consider the ordering of words in a document

• John is quicker than Mary and Mary is quicker than John have the same vectors

• This is called the bag of words model.

31

Term frequency tf

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing query-document match scores. But how?

• Raw term frequency is not what we want:– A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.– But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

NB: frequency = count in IR

32

Log-frequency weighting

• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.• Score for a document-query pair: sum over terms t in both q

and d:

score

• The score is 0 if none of the query terms is present in the document.

101 log tf , if tf 0

0, otherwiset,d t,d

t,dw

t,dt q dw

33

Document frequency

• Rare terms are more informative than frequent terms– Recall stop words

• Consider a term in the query that is rare in the collection (e.g., arachnocentric)

• A document containing this term is very likely to be relevant to the query arachnocentric

• We want a high weight for rare terms like arachnocentric.– We will use document frequency (df) to capture this.

34

Document frequency, continued

• Frequent terms are less informative than rare terms• Consider a query term that is frequent in the collection

(e.g., high, increase, line)• A document containing such a term is more likely to be

relevant than a document that doesn’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for

words like high, increase, and line• But lower weights than for rare terms.• We will use document frequency (df) to capture this.

35

idf weight

• dft is the document frequency of t: the number of documents that contain t– dft is an inverse measure of the informativeness of t

– dft N

• We define the idf (inverse document frequency) of t by

– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

)/df( log idf 10 tt N

36

idf example, suppose N = 1 million

term dft idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

There is one idf value for each term t in a collection.

)/df( log idf 10 tt N

37

tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval– Note: the “-” in tf-idf is a hyphen, not a minus sign!– Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

38

Score for a document given a query

• There are many variants– How “tf” is computed (with/without logs)– Whether the terms in the query are also weighted– …

Score(q,d) tf.idft,dtqd

39

Effect of idf on ranking

• Does idf have an effect on ranking for one-term queries, like– iPhone

• idf has no effect on ranking one term queries– idf affects the ranking of documents for queries with

at least two terms– For the query capricious person, idf weighting

makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

40

Collection vs. Document frequency

• The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

• Example:

• Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

41

Binary → count → weight matrix


Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

42

Vector Space Model

• Document as vectors– So we have a |V|-dimensional vector space– Terms are axes of the space– Documents are points or vectors in this space

• Queries as vectors– Do the same for queries: represent them as vectors in the

space• Rank documents according to their proximity to the

query in this space– proximity = similarity of vectors– proximity ≈ inverse of distance

43

cosine(query,document)

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

44

Cosine similarity illustrated

45

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?

Term frequencies (counts)

Note: To simplify this example, we don’t do idf weighting.

46

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69

48

Summary – vector space ranking

• Represent the query as a weighted tf-idf vector• Represent each document as a weighted tf-idf

vector• Compute the cosine similarity score for the

query vector and each document vector• Rank documents with respect to the query by

score• Return the top K to the user

49

Probability Models

• Retrieval as classification

50

Bayes Classifier

• Bayes Decision Rule– A document D is relevant if P(R|D) >P(NR|D)

• Estimating probabilities– use Bayes Rule

– classify a document as relevant if

51

Estimating P(D|R)

• Assume independence

• Binary independence model– document represented by a vector of binary

features indicating term occurrence (or non‐occurrence)

– pi is probability that term i occurs (i.e., has value 1) in relevant document, si is probability of occurrence in non relevant document‐

52

Binary Independence Model

53

Binary Independence Model

• Scoring function is

• Query provides information about relevant documents

• If we assume pi constant, si

approximated by entire collection, get idf‐like weight

54

Contingency Table

Gives scoring function:

55

BM25

• Popular and effective ranking algorithm based on binary independence model– adds document and query term weights

– k1, k2 and K are parameters whose values are set empirically

– dl is doc length– Typical value for k1 is 1.2, k2 varies from 0 to 1000,

b = 0.75

56

BM25 Example

• Query with two terms, “president lincoln”, (qf= 1)• No relevance information (r and R are zero)• N= 500,000 documents• “president” occurs in 40,000 documents (n1= 40, 000)

• “lincoln” occurs in 300 documents (n2= 300)

• “president” occurs 15 times in doc (f1= 15)

• “lincoln” occurs 25 times (f2= 25)• document length is 90% of the average length:

dl/avdl= 0.9• k1= 1.2, b = 0.75, and k2= 100• K = 1.2 (0.25 + 0.75 0.9) = 1.11∙ ∙

57

BM25 Example

58

BM25 Example

• Effect of term frequencies

59

Retrieval Model Overview

• Occurrence based models– Boolean Retrieval – Vector Space Model

• Probability models– BM25 – Language Model

• Hyperlink-based models– PageRank

• Machine learning based models– Learning to Rank

60

PageRank

• Links can be viewed as information about the popularity of a web page

Larry Page and Sergey Brin

61

PageRank

• PageRank (PR) of page C = PR(A)/2 + PR(B)/1• More generally,

– where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v.

CS 331 - Data Mining 62

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m


Power Iteration

• Simple iterative scheme (aka relaxation)• Suppose there are N web pages• Initialize: r0 = [1,….,1]T

• Iterate: rk+1 = Mrk

• Stop when |rk+1 - rk|1 < – |x|1 = 1·i·N|xi| is the L1 norm – Can use any other vector norm e.g., Euclidean


Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

13/21/2

5/4 1 3/4

9/822/241/2

6/56/53/5

. . .

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

r = Mr

Matrix Notation• Matrix Notation

r = B r

• PageRank is embedded in the eigenvector of B (normalized adjacency matrix) associated with the eigenvalue 1.

• Conditional probability

otherwisea

uchifa

a

b

uv

wuw

uv

uv

,0

0][ ,

66


Spider traps

• A group of pages is a spider trap if there are no links from within the group to outside the group– Random surfer gets trapped

• Spider traps violate the conditions needed for the random walk theorem


Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

69

PageRank with Random Walk

• Taking random page jump into account, 1/3 chance of going to any page– PR(C) = λ/3 + (1 − λ) (PR(A)/2 + PR(B)/1)∙

• More generally,

– where N is the number of pages, λ typically 0.15


Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.80.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

+


Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

10.60.6

0.7870.5470.387

0.6480.4300.333

000

. . .

1/2 1/2 0 1/2 0 0 0 1/2 0

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 1/15

+ 0.8 0.2

Non-stochastic!

72

Dead ends

• Pages with no outlinks are “dead ends” for the random surfer– Nowhere to go on next step

• Prune and propagate– Preprocess the graph to eliminate dead-ends – Might require multiple passes– Compute page rank on reduced graph– Approximate values for dead ends by propagating

values from reduced graph

73

Learning to Rank

• Machine learning is an effective tool– To automatically tune parameters.– To combine multiple evidences.– To avoid over-fitting (by means of regularization, etc.)

• “Learning to Rank”– Use machine learning technologies to train the

ranking model– A hot research topic these years.

74

Typical Framework of Learning to Rank

Learning System

Ranking System

min Loss

Model),,( wdqf

),,(,

),,(,

),,(,

22

11

wdqfd

wdqfd

wdqfd

inin

ii

ii

1,

2,

4,

)1(

)1(2

)1(1

)1(

)1(nd

d

d

q

2,

3,

5,

)(

)(2

)(1

)(

)(m

n

m

m

m

md

d

d

q

Labels: multiple-level ratings for example

queries

docu

men

ts

Training Data

Test data

?),(......,

?),(?),,( 21

nd

dd

q

75

Sample Data from LETOR

Relevance label

A document

A feature

76

• Feature– tf– idf– tf-idf– Boolean model– Vector space model– BM25– ……

Learning to Rank

77

Connection with Conventional Machine Learning

• With certain assumptions, learning to rank can be transformed to a kind of conventional machine learning problem– Regression– Classification– Preference learning

• Ranking SVM

78

Ranking SVM

• Training data is

– r is partial rank information• if document da should be ranked higher than db, then

(da, db) ∈ r

– partial rank information comes from relevance judgments (allows multiple levels of relevance) or click data

• e.g., d1, d2and d3 are the documents in the first, second and third rank of the search output, only d3clicked on → (d3, d1) and (d3, d2) will be in desired ranking for this query

79

Ranking SVM• To break the assumption of absolute relevance, for

each query, transform its corresponding ranked list into document pairs

• Formalizing ranking as classification over document pairs

Transform

2,

3,

5,

)(

)(2

)(1

)(

)(i

n

i

i

i

id

d

d

q

)()(2

)()(1

)(2

)(1

)(

)()( ,,...,,,, i

n

ii

n

iii

i

ii dddddd

q

5 > 3 5 > 2 3 > 2

80

Ranking SVM

• Learning a linear ranking function – where

is a weight vector that is adjusted by learning–

is the vector representation of the features of document

– non linear functions also possible‐

• Weights represent importance of features– learned using training data– e.g.,

81

Ranking SVM

• Learn that satisfies as many of the following conditions as possible:

• Can be formulated as an optimization problemRef: Thorsten Joachims. Optimizing search engines using clickthrough data. ACM International

Conference on Knowledge Discovery and Data Mining (SIGKDD), 2002.

82

Ranking SVM

– ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples

– C is a parameter that is used to prevent overfitting

83

Support Vector Machines

84

History

• SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather popular since.

• Theoretically well motivated algorithm: developed from Statistical Learning Theory (Vapnik & Chervonenkis) since the 60s.

• Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, . . . )

85

Linear classifiers: Which Hyperplane?

• Lots of possible solutions for a, b, c.• Some methods find a separating hyperplane,

but not the optimal one [according to some criterion of expected goodness]– E.g., perceptron

• Support Vector Machine (SVM) finds an optimal solution.– Maximizes the distance between the

hyperplane and the “difficult points” close to decision boundary

– One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions

This line represents the decision boundary:

ax + by − c = 0

86

Another intuition

• If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased

87

Support Vector Machine (SVM)Support vectors

Maximizesmargin

• SVMs maximize the margin around the separating hyperplane.

• A.k.a. large margin classifiers

• The decision function is fully specified by a subset of training samples, the support vectors.

• Solving SVMs is a quadratic programming problem

• Currently widely seen as the best text classification method.

Narrowermargin

88

• w: decision hyperplane normal vector• xi: data point i

• yi: class of data point i (+1 or -1) Note: Not 1/0

• Classifier is: f(xi) = sign(wTxi + b)

• Functional margin of xi is: yi (wTxi + b)

– But note that we can increase this margin simply by scaling w, b….

• Functional margin of dataset is twice the minimum functional margin for any point– The factor of 2 comes from measuring the whole width

of the margin

Maximum Margin: Formalization

89

Geometric Margin

• Distance from example to the separator is

• Examples closest to the hyperplane are support vectors.

• Margin ρ of the separator is the width of separation between support vectors of classes.

w

xw byr

T

r

ρx

x′

w

Derivation of finding r:Dotted line x’−x is perpendicular todecision boundary so parallel to w.Unit vector is w/|w|, so line is rw/|w|.x’ = x – yrw/|w|. x’ satisfies wTx’+b = 0.So wT(x –yrw/|w|) + b = 0Recall that |w| = sqrt(wTw).So, solving for r gives:r = y(wTx + b)/|w|

90

Linear SVM MathematicallyThe linearly separable case

• Assuming all the data has a functional margin of at least 1, the following two constraints follow for a training set {(xi ,yi)}

• For support vectors, the inequality becomes an equality• Then, since each example’s distance from the hyperplane is

• The margin is:

wTxi + b ≥ 1 if yi = 1

wTxi + b ≤ -1 if yi = -1

w

2

w

xw byr

T

91

Linear Support Vector Machine (SVM)

• Hyperplane wT x + b = 0

• Extra scale constraint: mini=1,…,n |wTxi + b| = 1

• This implies: wT(xa–xb) = 2 ρ = 2/||w||2 wT x + b = 0

wTxa + b = 1

wTxb + b = -1

ρ

92

Linear SVMs Mathematically (cont.)

• Then we can formulate the quadratic optimization problem:

• A better formulation (min ||w|| = max 1/ ||w|| ):

Find w and b such that

is maximized; and for all {(xi , yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

w

2

Find w and b such that

Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

94

Solving the Optimization Problem

• This is now optimizing a quadratic function subject to linear constraints• Quadratic optimization problems are a well-known class of mathematical

programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs)

• The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

95

• By using Lagrange theory

2 T

1

1( , , ) - 1

2

n

i i ii

L b b

w α w y w x

1

0n

i i ii

L

w y xw

1

0 0n

i ii

L

b

y

TT

1 1 1 1 1

T

1 1 1

1( , , ) -

2

1

2

n n n n n

i i i i i i i i j j j i ii i i j i

n n n

i i j i j i ji i j

L b

w α y x y x y y x x

y y x x


96

• We get that

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

T

1 1 1

1

1max

2

s.t. 0, 0 1, ,

n n n

i i j i j i ji i j

n

i i ii

i n

αy y x x

y


97

The Optimization Problem Solution• The solution has the form:

• Each non-zero αi indicates that corresponding xi is a support vector.• Then the classifying function will have the form:

• Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.

• Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

f(x) = ΣαiyixiTx + b

98

Soft Margin Classification

• If the training data is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples.

• Allow some errors– Let some points be moved

to where they belong, at a cost

• Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin)

ξj

ξi

99

Soft Margin Classification Mathematically

• The old formulation:

• The new formulation incorporating slack variables:

• Parameter C can be viewed as a way to control overfitting – a regularization term

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}

yi (wTxi + b) ≥ 1

Find w and b such thatΦ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}

yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

100

Soft Margin Classification – Solution

• The dual problem for soft margin classification:

• Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem!

• Again, xi with non-zero αi will be support vectors.• Solution to the dual problem is:

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

w = Σαiyixi

b = yk - wTxk for any k s.t. 0 < αk < Cf(x) = Σαiyixi

Tx + b

w is not needed explicitly for classification!

101

Classification with SVMs

• Given a new point x, we can score its projection onto the hyperplane normal:– i.e., compute score: wTx + b = Σαiyixi

Tx + b– Can set confidence threshold t.

-10

1

Score > t: yes

Score < -t: no

Else: don’t know

102

Linear SVMs: Summary• The classifier is a separating hyperplane.

• The most “important” training points are the support vectors; they define the hyperplane.

• Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solution, training points appear only inside inner products:

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

103

Non-linear SVMs• Datasets that are linearly separable (with some noise) work out great:

• But what are we going to do if the dataset is just too hard?

• How about … mapping data to a higher-dimensional space:0 x

0 x

0

x2

x

104

Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x →ϕ(x)

105

º (º) (•)

º (º) (•) (•)

º (º) (•)

A mapping of the features can make the classification task more easy

106

The “Kernel Trick”

• The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj

• If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

• A kernel function is some function that corresponds to an inner product in some expanded feature space.

• Example: 2-dimensional vectors x=[x1 x2] T; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2 = (1 + xi1xj1 + xi2xj2) 2

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] T

= φ(xi) Tφ(xj) where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2] T

107

Support Vector Machine

• Giving samples , the optimal solution is obtained by solving the following optimization problem

• By mapping source feature to high dimensional space by a non-linear mapping we can solve SVM in the new space

n 1 1( , ), ,( , )n ny yx x

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

2

,

T

1min

2

s.t. ( ) 1, 1, ,

b

i i b i n

ww

y w x

x: ( ) x x

108

Support Vector Machine

• By using Lagrange theory

2 T

1

1( , , ) - ( ) 1

2

n

i i ii

L b b

w α w y w x

1

0 ( )n

i i ii

L

w y xw

1

0 0n

i ii

L

b

y

TT

1 1 1 1 1

T

1 1 1

1 1

1( , , ) ( ) ( ) - ( ) ( )

2

1 ( ) ( )

2

1 ( , )

2

n n n n n

i i i i i i i i j j j i ii i i j i

n n n

i i j i j i ji i j

n n

i i j i j i ji j

L b

k

w α y x y x y y x x

y y x x

y y x x1

n

i

109

Support vector machine

• The decision function f(x) we can just replace the dot products with kernels K(xi,xj):

T

1

1

( ) ( ) ( )

( , )

n

ii ii

n

ii ii

f y b

y K b

x x x

x x

1

( )n

i i ii

w y x

T( ) ( )f b x w x

Kernels

• Why use kernels?– Make non-separable problem separable.– Map data into better representational space

• Common kernels– Linear, Polynomial, Gaussian/Radial basis

function

110

Non Binary Classification with SVMs‐

• One against rest– Train “class c vs. not class c” SVM for each class– If there are K classes, must train K classifiers

• One against one– Train a binary classifier for every pair of classes– Must train K(K 1)/2 classifiers‐– Computationally expensive for large values of K

111

112

1-a-r (1-against-rest)

A

B

C

A 、C

BAB 、C

A 、 BC

, ,

1min ( ) ( ( ) )

2

. . (( ) ( ) ) 1 , if in the th class,

(( ) ( ) ) 1 , if not in the th class,

0.

i i i

i T i it

b t

i T i it t t

i T i it t t

it

C

s t b i

b i

w ξ

w w

w x x

w x x

Decision function

1

( ) sgn ( , )l

i i ii

f y K b

x x x

1

( ) ( , )l

i i ii

f y K b

x x x

3 classifiers:A vs BC 、 B vs AC 、 C vs AB

113

1-a-1 (1-against-1)

A

B

C

A B

AC

BC

, ,

1min ( ) ( ( ) )

2

s.t (( ) ( ) ) 1 , if in the th class,

(( ) ( ) ) 1 , if in the th class,

0.

ij ij ij

ij T ij ijt

b t

ij T ij ijt t t

ij T ij ijt t t

ijt

C

b i

b j

w ξ

w w

w x x

w x x

3 classifiers :A vs B 、 B vs C 、 C vs A

Decision function:combining multi-classifier by voting

SVM Tools

• Solving SVM optimization problem is not straightforward

• Many good software packages exist– Matlab SVM Toolbox– LIBSVM– SVM Light‐– …..

114

115

Ranking SVM

– ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples

– C is a parameter that is used to prevent overfitting

116

Ranking classification

q

1d

2d

3d

4d

5d

6d

7d

nd

…

3 2

7 2

7 4

7 5

7 6

,

,

,

,

d d

d d

d d

d d

d d

T Ti j i jd d d d w w

2

,1

1 min +

2

s.t. ,

+1- , 0 1, ,

q

kk

qi j

T T qi j k k

C

d d

d d k

w ξ

w

w w

，

T

T

T

1-Ti j kd d w

Ranking SVM