32
1 Iterative residual rescaling: An analysis and generalization of LSI Rie Kubota Ando & Lillian Lee. Iterative resi dual rescaling: An analysis and generalizatio n of LSI. In the 24th Annual International ACM SIGIR C onference (SIGIR'2001), 2001. Presenter: 游游游

Iterative residual rescaling: An analysis and generalization of LSI

  • Upload
    percy

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Iterative residual rescaling: An analysis and generalization of LSI. Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI . In the 24th Annual International ACM SIGIR Conference (SIGIR'2001), 2001. Presenter: 游斯涵. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Iterative residual rescaling: An analysis and generalization of LSI

1

Iterative residual rescaling: An analysis and generalization of LSI

Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI. In the 24th Annual Int

ernational ACM SIGIR Conference (SIGIR'2001), 2001.

Presenter: 游斯涵

Page 2: Iterative residual rescaling: An analysis and generalization of LSI

2

Introduction

• The disadvantage of VSM:– Documents that do not share terms are mapped to orthogonal

vectors even if they are clearly related.

• LSI attempts to overcome this shortcomings by projecting the term-document matrix onto a lower-dimensional subspace.

}0,1,0{id

}1,0,1{jd

Page 3: Iterative residual rescaling: An analysis and generalization of LSI

3

Introduction of IRR

• LSI

• IRR

Aterm

doc

Weight

SVD

U

VT

eigenvalue

eigenvector

eigenvector

rescaling

Page 4: Iterative residual rescaling: An analysis and generalization of LSI

4

Frobenius norm and matrix 2-norm

• Frobenius norm:

• 2-norm:

m

i

n

jji

def

FXX

1 1

2],[ )(

)( & XrankhX nm

2122

max XyXy

def

Page 5: Iterative residual rescaling: An analysis and generalization of LSI

5

Analyzing LSI

• Topic-based similarities– C: an n-document collection– D: m-by-n term-document matrix – k: underlying topics (k<n)– Relevance score:

for each document and each topic:

for each document:

True topic-based similarity between and

then we can get a n-by-n matrix S

),( dtrel

)(

2 1),(Ctopicst

dtrel

d d

)(

),(),(),(ctopicst

dtreldtrelddsim

),(],[ ddsimddS

doc

topic

Sdoc

topic

topic

doc

doc

doc

Page 6: Iterative residual rescaling: An analysis and generalization of LSI

6

The optimum subspace

• Give a subspace of , and B form an orthonormal basis of

m

xBBxP T)(

)( m

B

x

xBxBxBxB

xBxB T

Tdef

),cos(),cos(

0)(,

)( ,

xPxfor

xxPxfor

Page 7: Iterative residual rescaling: An analysis and generalization of LSI

7

The optimum subspace

• We have m-by-n term-document matrix D

• D= the projection of D onto

is

nddd ...... 21

)]( ... )([ 1 ndpdp

)()(

)()())(),(cos(

ji

jT

iji

dPdP

dPdPdPdP

Page 8: Iterative residual rescaling: An analysis and generalization of LSI

8

The optimum subspace

• Deviation matrix:

find a subspace such that the entries of it are small.

• The optimum subspace

• Optimum error

)()()(, DPDPSdiff TDS

2,)(

)(minarg

DSDrange

opt diff

2, )( DSopt diff

if optimum error is high, then we cannot expect the optimum subspace to fully reveal the topic dominances.

Page 9: Iterative residual rescaling: An analysis and generalization of LSI

9

The singular value decomposition and LSI

• SVD

• Gained on the left singular vector by following observation:

be the projection of onto the span of

let be the residual vector:

TVUZ

Left singular vector span Z’s rangeand

21 Z

)()(i

j dproj jd

juuu ,...,, 21

)( jir )()1(

ij

i dprojd

Page 10: Iterative residual rescaling: An analysis and generalization of LSI

10

Analysis of LSI

Page 11: Iterative residual rescaling: An analysis and generalization of LSI

11

Non-uniformity and LSI

• A crucial quantity in our analysis is the dominance of a given topic t

t

Cd

t dtrel 2),(

Page 12: Iterative residual rescaling: An analysis and generalization of LSI

12

Non-uniformity and LSI

• Topic mingling

• If the topic mingling is high means the similarity of each document with different topics is high, then the topics will be fairly difficult to distinguish.

'),(',

2)),'(),(()(ttCtopictt Cd

dtreldtrelC

Sdoc

doc

doc

topic

topic

doc

0)( case, documents topic-single thein C

Page 13: Iterative residual rescaling: An analysis and generalization of LSI

13

Non-uniformity and LSI

• let be the ith largest singular value of . Then i )(DPopt

))((22 Coptii

Cd

t dtrel 2),(

)()()(, DPDPSdiff TDS

2, )( DSopt diff

'),(',

2)),'(),(()(ttCtopictt Cd

dtreldtrelC

n

di direl1

2),(

Page 14: Iterative residual rescaling: An analysis and generalization of LSI

14

Non-uniformity and LSI

• Define

• We can get the ratio:

the more largest topic dominates the collection, the higher this ratio will tend to be.

opth of dimension theis h where and min1max

min

max

Page 15: Iterative residual rescaling: An analysis and generalization of LSI

15

Non-uniformity and LSI

• Original error:

Let denote the VSM space

then as

• Root original error

VSM

2

max ~~AASE T

VSM

AAPVSM

~)

~(

maxmax

VSMEVSM

( Input error )

Page 16: Iterative residual rescaling: An analysis and generalization of LSI

16

Non-uniformity and LSI

• Let be the h-dimension LSI subspace spanned by the first h left singular vectors of D

if

must be close to when the topic-document distribution is relatively uniform.

LSI

2min

min

min

max

2 )ˆˆ(1

ˆˆˆ

ˆ)),(tan(

vsm

vsmoptLSI

vsm̂ˆmin

LSIopt

Page 17: Iterative residual rescaling: An analysis and generalization of LSI

17

Notation for related values

• is topic mingling• For

we write

the approximation becomes closer as the optimum error (or optimum error) becomes smaller.

0... and 0... 321321 nn yyyyxxxx

n

1

222 if i iiii yxyx

avgiiiiii

opt

i optoptE

n

yxEyxyx

n

1i

222max22

)( and max if

2,max )( DSdiffE

n

diffE FDSavg

)(,

Page 18: Iterative residual rescaling: An analysis and generalization of LSI

18

Ando’s IRR algorithm

• IRR algorithm

n

j

iji

ij

xi rxru

1

2)()(

1)),cos((maxarg

2

DR~)1(

),( )()()1(i

iii uRprojRR

Page 19: Iterative residual rescaling: An analysis and generalization of LSI

19

Introduction of IRR

Page 20: Iterative residual rescaling: An analysis and generalization of LSI

20

Ando’s IRR algorithm

}){,(

~

))),(,cos(),((maxarg

),(

)()()1(

)1(

1

2)(

2

)(

1

2

2

IRRi

iii

n

j

ij

ij

x

IRRi

q

uRprojRR

DR

qrpowxqrpowu

rrqrpow

find the max x which

approximate R

Page 21: Iterative residual rescaling: An analysis and generalization of LSI

21

Ando’s IRR algorithm

] ... [ ,: D

onto projection its fromSubtract

,...,2,1:for

))),ˆcos(ˆ((maxarg:

:r̂

,...,2,1:for

,...,2,1:for

:

:),(

21IRR

1

21:

lT

ii

n

i iixxj

i

q

ii

bbbBwhereDBB

br

ni

xrrb

rr

ni

lj

DR

lqIRR

Page 22: Iterative residual rescaling: An analysis and generalization of LSI

22

Auto-scale method

• Automatic scaling factor determination:2)()(

n

DDDf F

Tdef

k

tt

k

tt

k

t tt

n

d

n

d

n

d

k

t

Cdd CtopictFF

T

dtreldtreldtrel

dtreldtrel

dtreldtrelSDD

1

4

22

1

22

1 '

2

1

2

1

2

n

1d 1'

2

1

2

', )(

22

)()(

)),'(),(()),((

))',(),((

))',(),((~~

n

di direl1

2),(

When approximately single-topicdoc

topic

Page 23: Iterative residual rescaling: An analysis and generalization of LSI

23

Auto-scale method

• Implement auto-scale

We set q to a linear function of f(D)

)(Dfq

Page 24: Iterative residual rescaling: An analysis and generalization of LSI

24

Dimension selection

• Stopping criterion:

residual ratio (effective for both LSI and IRR)

n

RF

j 2)(

Page 25: Iterative residual rescaling: An analysis and generalization of LSI

25

Evaluation Matrix

• Kappa average precision:– Pair-wise average precision:

the measured similarity for any two intra-topic documents( share at least one topic) should be higher than for any two cross-topic documents which have no topics in common.

j

jipofpprec i

j

that such pairs topicintra #)(

chance

chancepprecpprec j

j

1

)()(

Denote the document pair with the jth largest measure cosine

pairsdocument of #

pairs topic-intra of #

Non intra-topic probability

Page 26: Iterative residual rescaling: An analysis and generalization of LSI

26

Evaluation Matrix

• Clustering:

let C be a cluster-topic contingency table

is the number of documents in cluster i that relevance to topic j.

define:

],[ jiC

n

NCS ji ji

, ,)(

column androw its both in

maximum ],[ ],[, uniquetheisjiCifjiCN ji

othewise 0, jiN

Page 27: Iterative residual rescaling: An analysis and generalization of LSI

27

Experimental setting

• (1)Choose two TREC topics (can choose more than two)• (2)Specified seven distribution type:

– (25,25), (30,20), (35,15), (40,10), (43,7), (45,5), (46,4)– Each document was relevant to exactly one of the pre-select

topics.

• (3)Extracted single-word stemmed terms using TALENT and removed stop-words.

• (4)Create term-document matrix, and length-normalized the document vector.

• (5)implement AUTO-SCALE, set0 3.5 where)( Dfq

Page 28: Iterative residual rescaling: An analysis and generalization of LSI

28

Controlled-distribution results

• The chosen scaling factor increases on average as the non-uniformity goes up.

min

max

Page 29: Iterative residual rescaling: An analysis and generalization of LSI

29

Controlled-distribution results

Highest S(C)

lowest S(C)

Page 30: Iterative residual rescaling: An analysis and generalization of LSI

30

Controlled-distribution results

Page 31: Iterative residual rescaling: An analysis and generalization of LSI

31

Conclusion

• Provided a new theoretical analysis of LSI.• Showing a precise relationship between LSI’s

performance and the uniformity of the underlying topic-document distribution.

• Extend Ando’s IRR algorithm.• IRR provide a very good performance in comparison to

LSI.

Page 32: Iterative residual rescaling: An analysis and generalization of LSI

32

IRR on summarization

term

doc

turn to term

sentence

IRRU

VT

Put all document as a query to countthe similarity