113
Special Topic on Image Retrieval 2014-02 1 课课课课http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.html

Special Topic on Image Retrieval

Embed Size (px)

DESCRIPTION

Special Topic on Image Retrieval. 2014-02. 课件下载地址: http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.html. Text-based Image Retrieval. Retrieval Model Overview. Occurrence based models Boolean Retrieval Model Vector Space Model Probability models BM25 Language Model Hyperlink-based models - PowerPoint PPT Presentation

Citation preview

Page 1: Special Topic on Image Retrieval

1

Special Topic on Image Retrieval

2014-02

课件下载地址: http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.html

Page 2: Special Topic on Image Retrieval

2

Text-based Image Retrieval

Page 3: Special Topic on Image Retrieval

3

Page 4: Special Topic on Image Retrieval

4

Retrieval Model Overview

• Occurrence based models– Boolean Retrieval Model– Vector Space Model

• Probability models– BM25 – Language Model

• Hyperlink-based models– PageRank

• Machine learning based models– Learning to Rank

Page 5: Special Topic on Image Retrieval

5

Boolean Retrieval

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:– Boolean Queries use AND, OR and NOT to join

query terms• Views each document as a set of words• Is precise: document matches condition or not.

– Perhaps the simplest model to build an IR system on• Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean:

– Email, library catalog, Mac OS X Spotlight

Page 6: Special Topic on Image Retrieval

6

Page 7: Special Topic on Image Retrieval

7

• Linear scanning the texts for each query

Boolean Retrieval

• Index the documents– Term-document incidence matrix

Page 8: Special Topic on Image Retrieval

8

Boolean RetrievalTerm-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if doc contains word, 0 otherwiseBrutus AND Caesar BUT NOT

Calpurnia

http://www.rhymezone.com/shakespeare/

Page 9: Special Topic on Image Retrieval

Boolean RetrievalIncidence vectors

• So we have a 0/1 vector for each term.• To answer query: take the vectors for Brutus,

Caesar and Calpurnia (complemented) bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

9

Page 10: Special Topic on Image Retrieval

10

Answers to query

• Antony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the

Capitol; Brutus killed me.

Page 11: Special Topic on Image Retrieval

11

Bigger collections

• Consider N = 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

Page 12: Special Topic on Image Retrieval

12

Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.Why?• But it has no more than one billion 1’s.

– matrix is extremely sparse.

• What’s a better representation?

We only record the 1 positions

Page 13: Special Topic on Image Retrieval

Inverted index

• For each term t, we must store a list of all documents that contain t.– Identify each by a docID, a document serial

number

13

Dictionary Postings

Sorted by docID

PostingPosting

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

Page 14: Special Topic on Image Retrieval

14

Tokenizer

Token stream Friends Romans Countrymen

Inverted index construction

Linguistic modules

Modified tokensfriend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed

Friends, Romans, countrymen.

Page 15: Special Topic on Image Retrieval

15

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Page 16: Special Topic on Image Retrieval

16

Indexer steps: Sort

• Sort by terms– And then docID

Core indexing step

Page 17: Special Topic on Image Retrieval

17

Indexer steps: Dictionary & Postings

• Multiple term entries in a single document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is added.

Why frequency?Will discuss later.

Page 18: Special Topic on Image Retrieval

18

Where do we pay in storage?

Pointers

Terms and

counts

Lists of docIDs

Page 19: Special Topic on Image Retrieval

19

Query processing: AND

• Consider processing the query:Brutus AND Caesar– Locate Brutus in the Dictionary;

• Retrieve its postings.

– Locate Caesar in the Dictionary;• Retrieve its postings.

– “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

Page 20: Special Topic on Image Retrieval

20

Intersecting two postings lists(a “merge” algorithm)

Page 21: Special Topic on Image Retrieval

21

The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If list lengths are x and y, merge takes O(x+y) operations.Crucial: postings sorted by docID.

Page 22: Special Topic on Image Retrieval

22

Ranked retrieval

• Thus far, our queries have all been Boolean.– Documents either match or don’t.

• Good for expert users with precise understanding of their needs and the collection.– Also good for applications: Applications can easily consume

1000s of results.• Not good for the majority of users.

– Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

– Most users don’t want to wade through 1000s of results.• This is particularly true of web search.

Page 23: Special Topic on Image Retrieval

23

Problem with Boolean search:feast or famine

• Boolean queries often result in either too few (=0) or too many (1000s) results.

• Query 1: “standard user dlink 650” → 200,000 hits

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a manageable number of hits.– AND gives too few; OR gives too many

Page 24: Special Topic on Image Retrieval

24

Ranked retrieval models

• Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

• Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

• In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

Page 25: Special Topic on Image Retrieval

25

Feast or famine: not a problem in ranked retrieval

• When a system produces a ranked result set, large result sets are not an issue– Indeed, the size of the result set is not an issue– We just show the top k results– We don’t overwhelm the user

– Premise: the ranking algorithm works

Page 26: Special Topic on Image Retrieval

26

Scoring as the basis of ranked retrieval

• We wish to return in order the documents most likely to be useful to the searcher

• How can we rank-order the documents in the collection with respect to a query?

• Assign a score – say in [0, 1] – to each document

• This score measures how well document and query “match”.

Page 27: Special Topic on Image Retrieval

27

Query-document matching scores

• We need a way of assigning a score to a query/document pair

• Let’s start with a one-term query• If the query term does not occur in the

document: score should be 0• The more frequent the query term in the

document, the higher the score (should be)• We will look at a number of alternatives for this.

Page 28: Special Topic on Image Retrieval

28

Binary term-document incidence matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Each document is represented by a binary vector {0,1}∈ |V|

Page 29: Special Topic on Image Retrieval

29

Term-document count matrices

• Consider the number of occurrences of a term in a document: – Each document is a count vector in ℕv: a column

below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Page 30: Special Topic on Image Retrieval

30

Bag of words model

• Vector representation doesn’t consider the ordering of words in a document

• John is quicker than Mary and Mary is quicker than John have the same vectors

• This is called the bag of words model.

Page 31: Special Topic on Image Retrieval

31

Term frequency tf

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing query-document match scores. But how?

• Raw term frequency is not what we want:– A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.– But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

NB: frequency = count in IR

Page 32: Special Topic on Image Retrieval

32

Log-frequency weighting

• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.• Score for a document-query pair: sum over terms t in both q

and d:

score

• The score is 0 if none of the query terms is present in the document.

101 log tf , if tf 0

0, otherwiset,d t,d

t,dw

t,dt q dw

Page 33: Special Topic on Image Retrieval

33

Document frequency

• Rare terms are more informative than frequent terms– Recall stop words

• Consider a term in the query that is rare in the collection (e.g., arachnocentric)

• A document containing this term is very likely to be relevant to the query arachnocentric

• We want a high weight for rare terms like arachnocentric.– We will use document frequency (df) to capture this.

Page 34: Special Topic on Image Retrieval

34

Document frequency, continued

• Frequent terms are less informative than rare terms• Consider a query term that is frequent in the collection

(e.g., high, increase, line)• A document containing such a term is more likely to be

relevant than a document that doesn’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for

words like high, increase, and line• But lower weights than for rare terms.• We will use document frequency (df) to capture this.

Page 35: Special Topic on Image Retrieval

35

idf weight

• dft is the document frequency of t: the number of documents that contain t– dft is an inverse measure of the informativeness of t

– dft N

• We define the idf (inverse document frequency) of t by

– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

)/df( log idf 10 tt N

Page 36: Special Topic on Image Retrieval

36

idf example, suppose N = 1 million

term dft idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

There is one idf value for each term t in a collection.

)/df( log idf 10 tt N

Page 37: Special Topic on Image Retrieval

37

tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval– Note: the “-” in tf-idf is a hyphen, not a minus sign!– Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

Page 38: Special Topic on Image Retrieval

38

Score for a document given a query

• There are many variants– How “tf” is computed (with/without logs)– Whether the terms in the query are also weighted– …

Score(q,d) tf.idft,dtqd

Page 39: Special Topic on Image Retrieval

39

Effect of idf on ranking

• Does idf have an effect on ranking for one-term queries, like– iPhone

• idf has no effect on ranking one term queries– idf affects the ranking of documents for queries with

at least two terms– For the query capricious person, idf weighting

makes occurrences of capricious count for much more in the final document ranking than occurrences of person.

Page 40: Special Topic on Image Retrieval

40

Collection vs. Document frequency

• The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

• Example:

• Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Page 41: Special Topic on Image Retrieval

41

Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Page 42: Special Topic on Image Retrieval

42

Vector Space Model

• Document as vectors– So we have a |V|-dimensional vector space– Terms are axes of the space– Documents are points or vectors in this space

• Queries as vectors– Do the same for queries: represent them as vectors in the

space• Rank documents according to their proximity to the

query in this space– proximity = similarity of vectors– proximity ≈ inverse of distance

Page 43: Special Topic on Image Retrieval

43

cosine(query,document)

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

Page 44: Special Topic on Image Retrieval

44

Cosine similarity illustrated

Page 45: Special Topic on Image Retrieval

45

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?

Term frequencies (counts)

Note: To simplify this example, we don’t do idf weighting.

Page 46: Special Topic on Image Retrieval

46

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69

Page 47: Special Topic on Image Retrieval

48

Summary – vector space ranking

• Represent the query as a weighted tf-idf vector• Represent each document as a weighted tf-idf

vector• Compute the cosine similarity score for the

query vector and each document vector• Rank documents with respect to the query by

score• Return the top K to the user

Page 48: Special Topic on Image Retrieval

49

Probability Models

• Retrieval as classification

Page 49: Special Topic on Image Retrieval

50

Bayes Classifier

• Bayes Decision Rule– A document D is relevant if P(R|D) >P(NR|D)

• Estimating probabilities– use Bayes Rule

– classify a document as relevant if

Page 50: Special Topic on Image Retrieval

51

Estimating P(D|R)

• Assume independence

• Binary independence model– document represented by a vector of binary

features indicating term occurrence (or non‐occurrence)

– pi is probability that term i occurs (i.e., has value 1) in relevant document, si is probability of occurrence in non relevant document‐

Page 51: Special Topic on Image Retrieval

52

Binary Independence Model

Page 52: Special Topic on Image Retrieval

53

Binary Independence Model

• Scoring function is

• Query provides information about relevant documents

• If we assume pi constant, si

approximated by entire collection, get idf‐like weight

Page 53: Special Topic on Image Retrieval

54

Contingency Table

Gives scoring function:

Page 54: Special Topic on Image Retrieval

55

BM25

• Popular and effective ranking algorithm based on binary independence model– adds document and query term weights

– k1, k2 and K are parameters whose values are set empirically

– dl is doc length– Typical value for k1 is 1.2, k2 varies from 0 to 1000,

b = 0.75

Page 55: Special Topic on Image Retrieval

56

BM25 Example

• Query with two terms, “president lincoln”, (qf= 1)• No relevance information (r and R are zero)• N= 500,000 documents• “president” occurs in 40,000 documents (n1= 40, 000)

• “lincoln” occurs in 300 documents (n2= 300)

• “president” occurs 15 times in doc (f1= 15)

• “lincoln” occurs 25 times (f2= 25)• document length is 90% of the average length:

dl/avdl= 0.9• k1= 1.2, b = 0.75, and k2= 100• K = 1.2 (0.25 + 0.75 0.9) = 1.11∙ ∙

Page 56: Special Topic on Image Retrieval

57

BM25 Example

Page 57: Special Topic on Image Retrieval

58

BM25 Example

• Effect of term frequencies

Page 58: Special Topic on Image Retrieval

59

Retrieval Model Overview

• Occurrence based models– Boolean Retrieval – Vector Space Model

• Probability models– BM25 – Language Model

• Hyperlink-based models– PageRank

• Machine learning based models– Learning to Rank

Page 59: Special Topic on Image Retrieval

60

PageRank

• Links can be viewed as information about the popularity of a web page

Larry Page and Sergey Brin

Page 60: Special Topic on Image Retrieval

61

PageRank

• PageRank (PR) of page C = PR(A)/2 + PR(B)/1• More generally,

– where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v.

Page 61: Special Topic on Image Retrieval

CS 331 - Data Mining 62

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Page 62: Special Topic on Image Retrieval

CS 331 - Data Mining 63

Power Iteration

• Simple iterative scheme (aka relaxation)• Suppose there are N web pages• Initialize: r0 = [1,….,1]T

• Iterate: rk+1 = Mrk

• Stop when |rk+1 - rk|1 < – |x|1 = 1·i·N|xi| is the L1 norm – Can use any other vector norm e.g., Euclidean

Page 63: Special Topic on Image Retrieval

CS 331 - Data Mining 64

Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

13/21/2

5/4 1 3/4

9/822/241/2

6/56/53/5

. . .

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

r = Mr

Page 64: Special Topic on Image Retrieval

Matrix Notation• Matrix Notation

r = B r

• PageRank is embedded in the eigenvector of B (normalized adjacency matrix) associated with the eigenvalue 1.

• Conditional probability

otherwisea

uchifa

a

b

uv

wuw

uv

uv

,0

0][ ,

66

Page 65: Special Topic on Image Retrieval

CS 331 - Data Mining 67

Spider traps

• A group of pages is a spider trap if there are no links from within the group to outside the group– Random surfer gets trapped

• Spider traps violate the conditions needed for the random walk theorem

Page 66: Special Topic on Image Retrieval

CS 331 - Data Mining 68

Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 67: Special Topic on Image Retrieval

69

PageRank with Random Walk

• Taking random page jump into account, 1/3 chance of going to any page– PR(C) = λ/3 + (1 − λ) (PR(A)/2 + PR(B)/1)∙

• More generally,

– where N is the number of pages, λ typically 0.15

Page 68: Special Topic on Image Retrieval

CS 331 - Data Mining 70

Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.80.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

+

Page 69: Special Topic on Image Retrieval

CS 331 - Data Mining 71

Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

10.60.6

0.7870.5470.387

0.6480.4300.333

000

. . .

1/2 1/2 0 1/2 0 0 0 1/2 0

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 1/15

+ 0.8 0.2

Non-stochastic!

Page 70: Special Topic on Image Retrieval

72

Dead ends

• Pages with no outlinks are “dead ends” for the random surfer– Nowhere to go on next step

• Prune and propagate– Preprocess the graph to eliminate dead-ends – Might require multiple passes– Compute page rank on reduced graph– Approximate values for dead ends by propagating

values from reduced graph

Page 71: Special Topic on Image Retrieval

73

Learning to Rank

• Machine learning is an effective tool– To automatically tune parameters.– To combine multiple evidences.– To avoid over-fitting (by means of regularization, etc.)

• “Learning to Rank”– Use machine learning technologies to train the

ranking model– A hot research topic these years.

Page 72: Special Topic on Image Retrieval

74

Typical Framework of Learning to Rank

Learning System

Ranking System

min Loss

Model),,( wdqf

),,(,

),,(,

),,(,

22

11

wdqfd

wdqfd

wdqfd

inin

ii

ii

1,

2,

4,

)1(

)1(2

)1(1

)1(

)1(nd

d

d

q

2,

3,

5,

)(

)(2

)(1

)(

)(m

n

m

m

m

md

d

d

q

Labels: multiple-level ratings for example

queries

docu

men

ts

Training Data

Test data

?),(......,

?),(?),,( 21

nd

dd

q

Page 73: Special Topic on Image Retrieval

75

Sample Data from LETOR

Relevance label

A document

A feature

Page 74: Special Topic on Image Retrieval

76

• Feature– tf– idf– tf-idf– Boolean model– Vector space model– BM25– ……

Learning to Rank

Page 75: Special Topic on Image Retrieval

77

Connection with Conventional Machine Learning

• With certain assumptions, learning to rank can be transformed to a kind of conventional machine learning problem– Regression– Classification– Preference learning

• Ranking SVM

Page 76: Special Topic on Image Retrieval

78

Ranking SVM

• Training data is

– r is partial rank information• if document da should be ranked higher than db, then

(da, db) ∈ r

– partial rank information comes from relevance judgments (allows multiple levels of relevance) or click data

• e.g., d1, d2and d3 are the documents in the first, second and third rank of the search output, only d3clicked on → (d3, d1) and (d3, d2) will be in desired ranking for this query

Page 77: Special Topic on Image Retrieval

79

Ranking SVM• To break the assumption of absolute relevance, for

each query, transform its corresponding ranked list into document pairs

• Formalizing ranking as classification over document pairs

Transform

2,

3,

5,

)(

)(2

)(1

)(

)(i

n

i

i

i

id

d

d

q

)()(2

)()(1

)(2

)(1

)(

)()( ,,...,,,, i

n

ii

n

iii

i

ii dddddd

q

5 > 3 5 > 2 3 > 2

Page 78: Special Topic on Image Retrieval

80

Ranking SVM

• Learning a linear ranking function – where

is a weight vector that is adjusted by learning–

is the vector representation of the features of document

– non linear functions also possible‐

• Weights represent importance of features– learned using training data– e.g.,

Page 79: Special Topic on Image Retrieval

81

Ranking SVM

• Learn that satisfies as many of the following conditions as possible:

• Can be formulated as an optimization problemRef: Thorsten Joachims. Optimizing search engines using clickthrough data. ACM International

Conference on Knowledge Discovery and Data Mining (SIGKDD), 2002.

Page 80: Special Topic on Image Retrieval

82

Ranking SVM

– ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples

– C is a parameter that is used to prevent overfitting

Page 81: Special Topic on Image Retrieval

83

Support Vector Machines

Page 82: Special Topic on Image Retrieval

84

History

• SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather popular since.

• Theoretically well motivated algorithm: developed from Statistical Learning Theory (Vapnik & Chervonenkis) since the 60s.

• Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, . . . )

Page 83: Special Topic on Image Retrieval

85

Linear classifiers: Which Hyperplane?

• Lots of possible solutions for a, b, c.• Some methods find a separating hyperplane,

but not the optimal one [according to some criterion of expected goodness]– E.g., perceptron

• Support Vector Machine (SVM) finds an optimal solution.– Maximizes the distance between the

hyperplane and the “difficult points” close to decision boundary

– One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions

This line represents the decision boundary:

ax + by − c = 0

Page 84: Special Topic on Image Retrieval

86

Another intuition

• If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased

Page 85: Special Topic on Image Retrieval

87

Support Vector Machine (SVM)Support vectors

Maximizesmargin

• SVMs maximize the margin around the separating hyperplane.

• A.k.a. large margin classifiers

• The decision function is fully specified by a subset of training samples, the support vectors.

• Solving SVMs is a quadratic programming problem

• Currently widely seen as the best text classification method.

Narrowermargin

Page 86: Special Topic on Image Retrieval

88

• w: decision hyperplane normal vector• xi: data point i

• yi: class of data point i (+1 or -1) Note: Not 1/0

• Classifier is: f(xi) = sign(wTxi + b)

• Functional margin of xi is: yi (wTxi + b)

– But note that we can increase this margin simply by scaling w, b….

• Functional margin of dataset is twice the minimum functional margin for any point– The factor of 2 comes from measuring the whole width

of the margin

Maximum Margin: Formalization

Page 87: Special Topic on Image Retrieval

89

Geometric Margin

• Distance from example to the separator is

• Examples closest to the hyperplane are support vectors.

• Margin ρ of the separator is the width of separation between support vectors of classes.

w

xw byr

T

r

ρx

x′

w

Derivation of finding r:Dotted line x’−x is perpendicular todecision boundary so parallel to w.Unit vector is w/|w|, so line is rw/|w|.x’ = x – yrw/|w|. x’ satisfies wTx’+b = 0.So wT(x –yrw/|w|) + b = 0Recall that |w| = sqrt(wTw).So, solving for r gives:r = y(wTx + b)/|w|

Page 88: Special Topic on Image Retrieval

90

Linear SVM MathematicallyThe linearly separable case

• Assuming all the data has a functional margin of at least 1, the following two constraints follow for a training set {(xi ,yi)}

• For support vectors, the inequality becomes an equality• Then, since each example’s distance from the hyperplane is

• The margin is:

wTxi + b ≥ 1 if yi = 1

wTxi + b ≤ -1 if yi = -1

w

2

w

xw byr

T

Page 89: Special Topic on Image Retrieval

91

Linear Support Vector Machine (SVM)

• Hyperplane wT x + b = 0

• Extra scale constraint: mini=1,…,n |wTxi + b| = 1

• This implies: wT(xa–xb) = 2 ρ = 2/||w||2 wT x + b = 0

wTxa + b = 1

wTxb + b = -1

ρ

Page 90: Special Topic on Image Retrieval

92

Linear SVMs Mathematically (cont.)

• Then we can formulate the quadratic optimization problem:

• A better formulation (min ||w|| = max 1/ ||w|| ):

Find w and b such that

is maximized; and for all {(xi , yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

w

2

Find w and b such that

Φ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

Page 91: Special Topic on Image Retrieval

94

Solving the Optimization Problem

• This is now optimizing a quadratic function subject to linear constraints• Quadratic optimization problems are a well-known class of mathematical

programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs)

• The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

Page 92: Special Topic on Image Retrieval

95

• By using Lagrange theory

2 T

1

1( , , ) - 1

2

n

i i ii

L b b

w α w y w x

1

0n

i i ii

L

w y xw

1

0 0n

i ii

L

b

y

TT

1 1 1 1 1

T

1 1 1

1( , , ) -

2

1

2

n n n n n

i i i i i i i i j j j i ii i i j i

n n n

i i j i j i ji i j

L b

w α y x y x y y x x

y y x x

Solving the Optimization Problem

Page 93: Special Topic on Image Retrieval

96

• We get that

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

T

1 1 1

1

1max

2

s.t. 0, 0 1, ,

n n n

i i j i j i ji i j

n

i i ii

i n

αy y x x

y

Solving the Optimization Problem

Page 94: Special Topic on Image Retrieval

97

The Optimization Problem Solution• The solution has the form:

• Each non-zero αi indicates that corresponding xi is a support vector.• Then the classifying function will have the form:

• Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.

• Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

f(x) = ΣαiyixiTx + b

Page 95: Special Topic on Image Retrieval

98

Soft Margin Classification

• If the training data is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples.

• Allow some errors– Let some points be moved

to where they belong, at a cost

• Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin)

ξj

ξi

Page 96: Special Topic on Image Retrieval

99

Soft Margin Classification Mathematically

• The old formulation:

• The new formulation incorporating slack variables:

• Parameter C can be viewed as a way to control overfitting – a regularization term

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}

yi (wTxi + b) ≥ 1

Find w and b such thatΦ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}

yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Page 97: Special Topic on Image Retrieval

100

Soft Margin Classification – Solution

• The dual problem for soft margin classification:

• Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem!

• Again, xi with non-zero αi will be support vectors.• Solution to the dual problem is:

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

w = Σαiyixi

b = yk - wTxk for any k s.t. 0 < αk < Cf(x) = Σαiyixi

Tx + b

w is not needed explicitly for classification!

Page 98: Special Topic on Image Retrieval

101

Classification with SVMs

• Given a new point x, we can score its projection onto the hyperplane normal:– i.e., compute score: wTx + b = Σαiyixi

Tx + b– Can set confidence threshold t.

-10

1

Score > t: yes

Score < -t: no

Else: don’t know

Page 99: Special Topic on Image Retrieval

102

Linear SVMs: Summary• The classifier is a separating hyperplane.

• The most “important” training points are the support vectors; they define the hyperplane.

• Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solution, training points appear only inside inner products:

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

Page 100: Special Topic on Image Retrieval

103

Non-linear SVMs• Datasets that are linearly separable (with some noise) work out great:

• But what are we going to do if the dataset is just too hard?

• How about … mapping data to a higher-dimensional space:0 x

0 x

0

x2

x

Page 101: Special Topic on Image Retrieval

104

Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x →ϕ(x)

Page 102: Special Topic on Image Retrieval

105

º (º) (•)

º (º) (•) (•)

º (º) (•)

A mapping of the features can make the classification task more easy

Page 103: Special Topic on Image Retrieval

106

The “Kernel Trick”

• The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj

• If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

• A kernel function is some function that corresponds to an inner product in some expanded feature space.

• Example: 2-dimensional vectors x=[x1 x2] T; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2 = (1 + xi1xj1 + xi2xj2) 2

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] T

= φ(xi) Tφ(xj) where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2] T

Page 104: Special Topic on Image Retrieval

107

Support Vector Machine

• Giving samples , the optimal solution is obtained by solving the following optimization problem

• By mapping source feature to high dimensional space by a non-linear mapping we can solve SVM in the new space

n 1 1( , ), ,( , )n ny yx x

2

,

T

1min

2

s.t. 1, 1, ,

b

i i b i n

ww

y w x

2

,

T

1min

2

s.t. ( ) 1, 1, ,

b

i i b i n

ww

y w x

x: ( ) x x

Page 105: Special Topic on Image Retrieval

108

Support Vector Machine

• By using Lagrange theory

2 T

1

1( , , ) - ( ) 1

2

n

i i ii

L b b

w α w y w x

1

0 ( )n

i i ii

L

w y xw

1

0 0n

i ii

L

b

y

TT

1 1 1 1 1

T

1 1 1

1 1

1( , , ) ( ) ( ) - ( ) ( )

2

1 ( ) ( )

2

1 ( , )

2

n n n n n

i i i i i i i i j j j i ii i i j i

n n n

i i j i j i ji i j

n n

i i j i j i ji j

L b

k

w α y x y x y y x x

y y x x

y y x x1

n

i

Page 106: Special Topic on Image Retrieval

109

Support vector machine

• The decision function f(x) we can just replace the dot products with kernels K(xi,xj):

T

1

1

( ) ( ) ( )

( , )

n

ii ii

n

ii ii

f y b

y K b

x x x

x x

1

( )n

i i ii

w y x

T( ) ( )f b x w x

Page 107: Special Topic on Image Retrieval

Kernels

• Why use kernels?– Make non-separable problem separable.– Map data into better representational space

• Common kernels– Linear, Polynomial, Gaussian/Radial basis

function

110

Page 108: Special Topic on Image Retrieval

Non Binary Classification with SVMs‐

• One against rest– Train “class c vs. not class c” SVM for each class– If there are K classes, must train K classifiers

• One against one– Train a binary classifier for every pair of classes– Must train K(K 1)/2 classifiers‐– Computationally expensive for large values of K

111

Page 109: Special Topic on Image Retrieval

112

1-a-r (1-against-rest)

A

B

C

A 、C

BAB 、C

A 、 BC

, ,

1min ( ) ( ( ) )

2

. . (( ) ( ) ) 1 , if in the th class,

(( ) ( ) ) 1 , if not in the th class,

0.

i i i

i T i it

b t

i T i it t t

i T i it t t

it

C

s t b i

b i

w ξ

w w

w x x

w x x

Decision function

1

( ) sgn ( , )l

i i ii

f y K b

x x x

1

( ) ( , )l

i i ii

f y K b

x x x

3 classifiers:A vs BC 、 B vs AC 、 C vs AB

Page 110: Special Topic on Image Retrieval

113

1-a-1 (1-against-1)

A

B

C

A B

AC

BC

, ,

1min ( ) ( ( ) )

2

s.t (( ) ( ) ) 1 , if in the th class,

(( ) ( ) ) 1 , if in the th class,

0.

ij ij ij

ij T ij ijt

b t

ij T ij ijt t t

ij T ij ijt t t

ijt

C

b i

b j

w ξ

w w

w x x

w x x

3 classifiers :A vs B 、 B vs C 、 C vs A

Decision function:combining multi-classifier by voting

Page 111: Special Topic on Image Retrieval

SVM Tools

• Solving SVM optimization problem is not straightforward

• Many good software packages exist– Matlab SVM Toolbox– LIBSVM– SVM Light‐– …..

114

Page 112: Special Topic on Image Retrieval

115

Ranking SVM

– ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples

– C is a parameter that is used to prevent overfitting

Page 113: Special Topic on Image Retrieval

116

Ranking classification

q

1d

2d

3d

4d

5d

6d

7d

nd

3 2

7 2

7 4

7 5

7 6

,

,

,

,

d d

d d

d d

d d

d d

T Ti j i jd d d d w w

2

,1

1 min +

2

s.t. ,

+1- , 0 1, ,

q

kk

qi j

T T qi j k k

C

d d

d d k

w ξ

w

w w

T

T

T

1-Ti j kd d w

Ranking SVM