Iterative residual rescaling: An analysis and generalization of LSI

1

Iterative residual rescaling: An analysis and generalization of LSI

Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI. In the 24th Annual Int

ernational ACM SIGIR Conference (SIGIR'2001), 2001.

Presenter: 游斯涵

2

Introduction

• The disadvantage of VSM:– Documents that do not share terms are mapped to orthogonal

vectors even if they are clearly related.

• LSI attempts to overcome this shortcomings by projecting the term-document matrix onto a lower-dimensional subspace.

}0,1,0{id

}1,0,1{jd

3

Introduction of IRR

• LSI

• IRR

Aterm

doc

Weight

SVD

U

VT

eigenvalue

eigenvector

eigenvector

rescaling

4

Frobenius norm and matrix 2-norm

• Frobenius norm:

• 2-norm:

m

i

n

jji

def

FXX

1 1

2],[ )(

)( & XrankhX nm

2122

max XyXy

def

5

Analyzing LSI

• Topic-based similarities– C: an n-document collection– D: m-by-n term-document matrix – k: underlying topics (k<n)– Relevance score:

for each document and each topic:

for each document:

True topic-based similarity between and

then we can get a n-by-n matrix S

),( dtrel

)(

2 1),(Ctopicst

dtrel

d d

)(

),(),(),(ctopicst

dtreldtrelddsim

),(],[ ddsimddS

doc

topic

Sdoc

topic

topic

doc

doc

doc

6

The optimum subspace

• Give a subspace of , and B form an orthonormal basis of

•

m

xBBxP T)(

)( m

B

x

xBxBxBxB

xBxB T

Tdef

),cos(),cos(

0)(,

)( ,

xPxfor

xxPxfor

7


• We have m-by-n term-document matrix D

• D= the projection of D onto

is

nddd ...... 21

)]( ... )([ 1 ndpdp

)()(

)()())(),(cos(

ji

jT

iji

dPdP

dPdPdPdP

8


• Deviation matrix:

find a subspace such that the entries of it are small.

• The optimum subspace

• Optimum error

)()()(, DPDPSdiff TDS

2,)(

)(minarg

DSDrange

opt diff

2, )( DSopt diff

if optimum error is high, then we cannot expect the optimum subspace to fully reveal the topic dominances.

9

The singular value decomposition and LSI

• SVD

• Gained on the left singular vector by following observation:

be the projection of onto the span of

let be the residual vector:

TVUZ

Left singular vector span Z’s rangeand

21 Z

)()(i

j dproj jd

juuu ,...,, 21

)( jir )()1(

ij

i dprojd

10

Analysis of LSI

11

Non-uniformity and LSI

• A crucial quantity in our analysis is the dominance of a given topic t

t

Cd

t dtrel 2),(

12


• Topic mingling

• If the topic mingling is high means the similarity of each document with different topics is high, then the topics will be fairly difficult to distinguish.

'),(',

2)),'(),(()(ttCtopictt Cd

dtreldtrelC

Sdoc

doc

doc

topic

topic

doc

0)( case, documents topic-single thein C

13


• let be the ith largest singular value of . Then i )(DPopt

))((22 Coptii

Cd

t dtrel 2),(

)()()(, DPDPSdiff TDS

2, )( DSopt diff

'),(',

2)),'(),(()(ttCtopictt Cd

dtreldtrelC

n

di direl1

2),(

14


• Define

• We can get the ratio:

the more largest topic dominates the collection, the higher this ratio will tend to be.

opth of dimension theis h where and min1max

min

max

15


• Original error:

Let denote the VSM space

then as

• Root original error

VSM

2

max ~~AASE T

VSM

AAPVSM

~)

~(

maxmax

VSMEVSM

( Input error )

16


• Let be the h-dimension LSI subspace spanned by the first h left singular vectors of D

if

must be close to when the topic-document distribution is relatively uniform.

LSI

2min

min

min

max

2 )ˆˆ(1

ˆˆˆ

ˆ)),(tan(

vsm

vsmoptLSI

vsm̂ˆmin

LSIopt

17

Notation for related values

• is topic mingling• For

we write

the approximation becomes closer as the optimum error (or optimum error) becomes smaller.

0... and 0... 321321 nn yyyyxxxx

n

1

222 if i iiii yxyx

avgiiiiii

opt

i optoptE

n

yxEyxyx

n

1i

222max22

)( and max if

2,max )( DSdiffE

n

diffE FDSavg

)(,

18

Ando’s IRR algorithm

• IRR algorithm

n

j

iji

ij

xi rxru

1

2)()(

1)),cos((maxarg

2

DR~)1(

),( )()()1(i

iii uRprojRR

19

Introduction of IRR

20


}){,(

~

))),(,cos(),((maxarg

),(

)()()1(

)1(

1

2)(

2

)(

1

2

2

IRRi

iii

n

j

ij

ij

x

IRRi

q

uRprojRR

DR

qrpowxqrpowu

rrqrpow

find the max x which

approximate R

21


] ... [ ,: D

onto projection its fromSubtract

,...,2,1:for

))),ˆcos(ˆ((maxarg:

:r̂

,...,2,1:for

,...,2,1:for

:

:),(

21IRR

1

21:

lT

ii

n

i iixxj

i

q

ii

bbbBwhereDBB

br

ni

xrrb

rr

ni

lj

DR

lqIRR

22

Auto-scale method

• Automatic scaling factor determination:2)()(

n

DDDf F

Tdef

k

tt

k

tt

k

t tt

n

d

n

d

n

d

k

t

Cdd CtopictFF

T

dtreldtreldtrel

dtreldtrel

dtreldtrelSDD

1

4

22

1

22

1 '

2

1

2

1

2

n

1d 1'

2

1

2

', )(

22

)()(

)),'(),(()),((

))',(),((

))',(),((~~

n

di direl1

2),(

When approximately single-topicdoc

topic

23

Auto-scale method

• Implement auto-scale

We set q to a linear function of f(D)

)(Dfq

24

Dimension selection

• Stopping criterion:

residual ratio (effective for both LSI and IRR)

n

RF

j 2)(

25

Evaluation Matrix

• Kappa average precision:– Pair-wise average precision:

the measured similarity for any two intra-topic documents( share at least one topic) should be higher than for any two cross-topic documents which have no topics in common.

j

jipofpprec i

j

that such pairs topicintra #)(

chance

chancepprecpprec j

j

1

)()(

Denote the document pair with the jth largest measure cosine

pairsdocument of #

pairs topic-intra of #

Non intra-topic probability

26

Evaluation Matrix

• Clustering:

let C be a cluster-topic contingency table

is the number of documents in cluster i that relevance to topic j.

define:

],[ jiC

n

NCS ji ji

, ,)(

column androw its both in

maximum ],[ ],[, uniquetheisjiCifjiCN ji

othewise 0, jiN

27

Experimental setting

• (1)Choose two TREC topics (can choose more than two)• (2)Specified seven distribution type:

– (25,25), (30,20), (35,15), (40,10), (43,7), (45,5), (46,4)– Each document was relevant to exactly one of the pre-select

topics.

• (3)Extracted single-word stemmed terms using TALENT and removed stop-words.

• (4)Create term-document matrix, and length-normalized the document vector.

• (5)implement AUTO-SCALE, set0 3.5 where)( Dfq

28

Controlled-distribution results

• The chosen scaling factor increases on average as the non-uniformity goes up.

min

max

29


Highest S(C)

lowest S(C)

30


31

Conclusion

• Provided a new theoretical analysis of LSI.• Showing a precise relationship between LSI’s

performance and the uniformity of the underlying topic-document distribution.

• Extend Ando’s IRR algorithm.• IRR provide a very good performance in comparison to

LSI.

32

IRR on summarization

term

doc

turn to term

sentence

IRRU

VT

Put all document as a query to countthe similarity