View
214
Download
0
Embed Size (px)
Citation preview
2011.02.09 - SLIDE 1IS 240 – Spring 2011
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 7: Vector (cont.)
2011.02.09 - SLIDE 2IS 240 – Spring 2011
Mini-TREC
• Need to start thinking about groups– Today
• Systems– SMART (not recommended…)
• ftp://ftp.cs.cornell.edu/pub/smart– MG (We have a special version if interested)
• http://www.mds.rmit.edu.au/mg/welcome.html– Cheshire II & 3
• II = http://cheshire.berkeley.edu• 3 = http://cheshire3.sourceforge.org
– Zprise (Older search system from NIST)• http://www.itl.nist.gov/iaui/894.02/works/zp2/zp2.html
– IRF (new Java-based IR framework from NIST)• http://www.itl.nist.gov/iaui/894.02/projects/irf/irf.html
– Lemur• http://www-2.cs.cmu.edu/~lemur
– Lucene (Java-based Text search engine)• http://jakarta.apache.org/lucene/docs/index.html
– Others?? (See http://searchtools.com )
2011.02.09 - SLIDE 3IS 240 – Spring 2011
Mini-TREC
• Proposed Schedule– February 9 – Database and previous Queries– March 2 – report on system acquisition and
setup– March 9, New Queries for testing…– April 18, Results due– April 20, Results and system rankings– April 27 Group reports and discussion
2011.02.09 - SLIDE 5IS 240 – Spring 2011
IR Models
• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
2011.02.09 - SLIDE 6IS 240 – Spring 2011
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary or weighted vectors
of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
2011.02.09 - SLIDE 7IS 240 – Spring 2011
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
2011.02.09 - SLIDE 8IS 240 – Spring 2011
We Can Plot the Vectors
Star
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
2011.02.09 - SLIDE 9IS 240 – Spring 2011
Documents in 3D Space
Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning
2011.02.09 - SLIDE 10IS 240 – Spring 2011
Document Space has High Dimensionality
• What happens beyond 2 or 3 dimensions?
• Similarity still has to do with how many tokens are shared in common.
• More terms -> harder to understand which subsets of words are shared among similar documents.
• We will look in detail at ranking methods• Approaches to handling high
dimensionality: Clustering and LSI (later)
2011.02.09 - SLIDE 11IS 240 – Spring 2011
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf*idf– Recall the Zipf distribution– Want to weight terms highly if they are
• Frequent in relevant documents … BUT• Infrequent in the collection as a whole
• Automatically derived thesaurus terms
2011.02.09 - SLIDE 12IS 240 – Spring 2011
Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1
2011.02.09 - SLIDE 13IS 240 – Spring 2011
Raw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1
2011.02.09 - SLIDE 14IS 240 – Spring 2011
Assigning Weights
• tf*idf measure:– Term frequency (tf)– Inverse document frequency (idf)
• A way to deal with some of the problems of the Zipf distribution
• Goal: Assign a tf*idf weight to each term in each document
2011.02.09 - SLIDE 15IS 240 – Spring 2011
Simple tf*idf
)/log(* kikik nNtfw =
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
⎟⎠⎞⎜
⎝⎛=
=====
nNidf
CnCNCidf
DtfDkT
kk
kk
kk
ikik
ik
2011.02.09 - SLIDE 16IS 240 – Spring 2011
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
For a collectionof 10000 documents(N = 10000)
2011.02.09 - SLIDE 17IS 240 – Spring 2011
Word Frequency vs. Resolving Power
The most frequent words are not the most descriptive.
(from van Rijsbergen 79)
2011.02.09 - SLIDE 18IS 240 – Spring 2011
Non-Boolean IR
• Need to measure some similarity between the query and the document
• The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query
• We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level
2011.02.09 - SLIDE 19IS 240 – Spring 2011
Non-Boolean?
• To measure similarity we…– Need to consider the characteristics of the
document and the query– Make the assumption that similarity of
language use between the query and the document implies similarity of topic and hence, potential relevance.
2011.02.09 - SLIDE 20IS 240 – Spring 2011
Similarity Measures (Set-based)
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
∩×
∩∪∩+∩
∩ Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Assuming that Q and D are the sets of terms associated with a Query and Document:
2011.02.09 - SLIDE 21IS 240 – Spring 2011
tf x idf normalization
• Normalize the term weights (so longer documents are not unfairly given more weight)– normalize usually means force all values to
fall within a certain range, usually between 0 and 1, inclusive.
∑ =
=t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
2011.02.09 - SLIDE 22IS 240 – Spring 2011
Vector space similarity
• Use the weights to compare the documents
terms.) thehting when weigdone tion was(Normaliza
product.inner normalizedor cosine, thecalled also is This
),(
:is documents twoof similarity theNow,
1∑=
∗=t
kjkikji wwDDsim
2011.02.09 - SLIDE 23IS 240 – Spring 2011
Vector Space Similarity Measure
• combine tf x idf into a measure
)()(
),(
:comparison similarity in the normalize could weotherwise
),( :normalized are weights term theif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
∑∑
∑
∑
==
=
=
∗
∗=
∗=
==
=
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
wwDQsim
wwDQsim
wwwwQ
wwwD
2011.02.09 - SLIDE 24IS 240 – Spring 2011
Weighting schemes
• We have seen something of– Binary– Raw term weights– TF*IDF
• There are many other possibilities– IDF alone– Normalized term frequency
2011.02.09 - SLIDE 25IS 240 – Spring 2011
Term Weights in SMART
• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.
• Designed for laboratory experiments in IR – Easy to mix and match different weighting
methods– Really terrible user interface– Intended for use by code hackers
2011.02.09 - SLIDE 26IS 240 – Spring 2011
Term Weights in SMART
• In SMART weights are decomposed into three factors:
norm
collectfreqw kkd
kd
∗=
2011.02.09 - SLIDE 27IS 240 – Spring 2011
SMART Freq Components
⎪⎪⎪
⎭
⎪⎪⎪
⎬
⎫
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
+
+=
1)ln(
)max(2
1
2
1)max(
}1,0{
kd
kd
kd
kd
kd
kd
freq
freqfreq
freq
freq
freq
Binary
maxnorm
augmented
log
2011.02.09 - SLIDE 28IS 240 – Spring 2011
Collection Weighting in SMART
⎪⎪⎪⎪⎪
⎭
⎪⎪⎪⎪⎪
⎬
⎫
⎪⎪⎪⎪⎪
⎩
⎪⎪⎪⎪⎪
⎨
⎧
−
⎟⎟⎠
⎞⎜⎜⎝
⎛
=
k
k
k
k
k
k
Doc
Doc
DocNDocDoc
NDoc
Doc
NDoc
collect
1
log
log
log
2
Inverse
squared
probabilistic
frequency
2011.02.09 - SLIDE 29IS 240 – Spring 2011
Term Normalization in SMART
( )⎪⎪⎪
⎭
⎪⎪⎪
⎬
⎫
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
=∑∑∑
jvector
vectorj
vectorj
vectorj
w
w
w
w
norm
max
4
2
sum
cosine
fourth
max
2011.02.09 - SLIDE 30IS 240 – Spring 2011
Lucene Algorithm
• The open-source Lucene system is a vector based system that differs from SMART-like systems in the ways the TF*IDF measures are normalized
2011.02.09 - SLIDE 31IS 240 – Spring 2011
Lucene
• The basic Lucene algorithm is:
• Where is the length normalized query
– and normd,t is the term normalization (square root of the number of tokens in the same document field as t)
– overlap(q,d) is the proportion of query terms matched in the document
– boostt is a user specified term weight enhancement
∑ ⋅⎟⎟⎠
⎞⎜⎜⎝
⎛⋅
⋅⋅
⋅=
tt
td
ttd
q
ttq
q
dqoverlapboost
norm
idftf
norm
idftfdqScore
),(),(
,
,,
q
ttq
norm
idftf ⋅,
2011.02.09 - SLIDE 32IS 240 – Spring 2011
How To Process a Vector Query
• Assume that the database contains an inverted file like the one we discussed earlier…– Why an inverted file?– Why not a REAL vector file?
• What information should be stored about each document/term pair?– As we have seen SMART gives you choices
about this…
2011.02.09 - SLIDE 33IS 240 – Spring 2011
Simple Example System
• Collection frequency is stored in the dictionary
• Raw term frequency is stored in the inverted file postings list
• Formula for term ranking
⎟⎟⎠
⎞⎜⎜⎝
⎛⋅=
⋅=∑=
kkk
M
kikqki
n
Ntfw
wwDQsim
log
),(1
2011.02.09 - SLIDE 34IS 240 – Spring 2011
Processing a Query
• For each term in the query– Count number of times the term occurs – this
is the tf for the query term– Find the term in the inverted dictionary file
and get:• nk : the number of documents in the collection with
this term• Loc : the location of the postings list in the inverted
file• Calculate Query Weight: wqk • Retrieve nk entries starting at Loc in the postings
file
2011.02.09 - SLIDE 35IS 240 – Spring 2011
Processing a Query
• Alternative strategies…– First retrieve all of the dictionary entries
before getting any postings information• Why?
– Just process each term in sequence
• How can we tell how many results there will be? – It is possible to put a limitation on the number
of items returned• How might this be done?
2011.02.09 - SLIDE 36IS 240 – Spring 2011
Processing a Query
• Like Hashed Boolean OR:– Put each document ID from each postings list into
hash table• If match increment counter (optional)
– If first doc, set a WeightSUM variable to 0
• Calculate Document weight wik for the current term
• Multiply Query weight and Document weight and add it to WeightSUM
• Scan hash table contents and add to new list – including document ID and WeightSUM
• Sort by WeightSUM and present in sorted order
2011.02.09 - SLIDE 38IS 240 – Spring 2011
Computing Cosine Similarity Scores
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
==
===
αα
QDD
2α
1α 1D
Q2D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
2011.02.09 - SLIDE 39IS 240 – Spring 2011
What’s Cosine anyway?
One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arcendpoint. As a result of this definition, the cosine function is periodic with period 2pi.
From http://mathworld.wolfram.com/Cosine.html
2011.02.09 - SLIDE 41IS 240 – Spring 2011
Computing a similarity score
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
==
++
+=
==
DQsim
DQ
2011.02.09 - SLIDE 42IS 240 – Spring 2011
Vector Space with Term Weights and Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.80.60.40.20 1.0
D2
D1
Q
1α
2α
Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
∑ ∑∑
= =
==t
j
t
j dq
t
j dq
i
ijj
ijj
ww
wwDQsim
1 1
22
1
)()(),(
Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)
98.042.0
64.0
])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0()2,(
2222
==
+⋅+
⋅+⋅=DQsim
74.058.0
56.),( 1 ==DQsim
2011.02.09 - SLIDE 43IS 240 – Spring 2011
Problems with Vector Space
• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the
same regardless of model
• Terms are not really orthogonal dimensions– Terms are not independent of all other terms
2011.02.09 - SLIDE 44IS 240 – Spring 2011
Vector Space Refinements
• As we saw earlier, the SMART system included a variety of weighting methods that could be combined into a single vector model algorithm
• Vector space has proven very effective in most IR evaluations (or used to be)
• Salton in a short article in SIGIR Forum (Fall 1981) outlined a “Blueprint” for automatic indexing and retrieval using vector space that has been, to a large extent, followed by everyone doing vector IR
2011.02.09 - SLIDE 45IS 240 – Spring 2011
Vector Space Refinements
• Amit Singhal (one of Salton’s students) found that the normalization of document length usually performed in the “standard tfidf” tended to overemphasize short documents
• He and Chris Buckley came up with the idea of adjusting the normalization document length to better correspond to observed relevance patterns
• The “Pivoted Document Length Normalization” provided a valuable enhancement to the performance of vector space systems
2011.02.09 - SLIDE 46IS 240 – Spring 2011
Pivoted Normalization
Probability ofRetrieval
Probability ofRelevance
Pivot
Document Length
Prob
abil
ity
2011.02.09 - SLIDE 47IS 240 – Spring 2011
Pivoted Normalization
NewNormalization
OldNormalization
Pivot
Old Normalization Factor
Fina
l Nor
mal
izat
ion
Fact
or
2011.02.09 - SLIDE 48IS 240 – Spring 2011
Pivoted Normalization
• Using pivoted normalization the new tfidf weight for a document can be written as:
zationoldnormalipivotslope
slopeidftf
zationoldnormalislopepivotslope
pivotslopeidftf
zationoldnormalislopepivotslope
idftf
××−
+
⋅
×+×−×−×⋅
×+×−⋅
)0.1(0.1
or
)0.1()0.1(
constant a by gmultiplyin
)0.1(
2011.02.09 - SLIDE 49IS 240 – Spring 2011
Pivoted Normalization
• Training from past relevance data, and assuming that the slope is going to be consistent with new results, we can adjust to better fit the relevance curve for document size normalization
2011.02.09 - SLIDE 50IS 240 – Spring 2011
Today
• Clustering
• Automatic Classification
• Cluster-enhanced search
2011.02.09 - SLIDE 51IS 240 – Spring 2011
Overview
• Introduction to Automatic Classification and Clustering
• Classification of Classification Methods
• Classification Clusters and Information Retrieval in Cheshire II
• DARPA Unfamiliar Metadata Project
2011.02.09 - SLIDE 52IS 240 – Spring 2011
Classification
• The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated.
• In document classification the items are grouped together because they are likely to be wanted together– For example, items about the same topic.
2011.02.09 - SLIDE 53IS 240 – Spring 2011
Automatic Indexing and Classification
• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.
• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.
• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.– An established classification scheme and set of
documents already indexed by that scheme.
2011.02.09 - SLIDE 54IS 240 – Spring 2011
Background and Origins
• Early suggestion by Fairthorne – “The Mathematics of Classification”
• Early experiments by Maron (1961) and Borko and Bernick(1963)
• Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s).
• Early IR clustering work more concerned with efficiency issues than semantic issues.
2011.02.09 - SLIDE 55IS 240 – Spring 2011
Document Space has High Dimensionality
• What happens beyond three dimensions?
• Similarity still has to do with how many tokens are shared in common.
• More terms -> harder to understand which subsets of words are shared among similar documents.
• One approach to handling high dimensionality: Clustering
2011.02.09 - SLIDE 57IS 240 – Spring 2011
Cluster Hypothesis
• The basic notion behind the use of classification and clustering methods:
• “Closely associated documents tend to be relevant to the same requests.”– C.J. van Rijsbergen
2011.02.09 - SLIDE 58IS 240 – Spring 2011
Classification of Classification Methods
• Class Structure– Intellectually Formulated
• Manual assignment (e.g. Library classification)• Automatic assignment (e.g. Cheshire Classification
Mapping)
– Automatically derived from collection of items• Hierarchic Clustering Methods (e.g. Single Link)• Agglomerative Clustering Methods (e.g. Dattola)• Hybrid Methods (e.g. Query Clustering)
2011.02.09 - SLIDE 59IS 240 – Spring 2011
Classification of Classification Methods
• Relationship between properties and classes– monothetic– polythetic
• Relation between objects and classes– exclusive– overlapping
• Relation between classes and classes– ordered– unordered
Adapted from Sparck Jones
2011.02.09 - SLIDE 60IS 240 – Spring 2011
Properties and Classes
• Monothetic– Class defined by a set of properties that are both necessary and
sufficient for membership in the class
• Polythetic– Class defined by a set of properties such that to be a member of
the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.
2011.02.09 - SLIDE 61IS 240 – Spring 2011
Monothetic vs. Polythetic
A B C D E F G H 1 + + +2 + + +3 + + +4 + + +5 + + +6 + + +7 + + +8 + + +
Polythetic
Monothetic
Adapted from van Rijsbergen, ‘79
2011.02.09 - SLIDE 62IS 240 – Spring 2011
Exclusive Vs. Overlapping
• Item can either belong exclusively to a single class
• Items can belong to many classes, sometimes with a “membership weight”
2011.02.09 - SLIDE 63IS 240 – Spring 2011
Ordered Vs. Unordered
• Ordered classes have some sort of structure imposed on them– Hierarchies are typical of ordered classes
• Unordered classes have no imposed precedence or structure and each class is considered on the same “level”– Typical in agglomerative methods
2011.02.09 - SLIDE 64IS 240 – Spring 2011
Text Clustering
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
Term 1
Term 2
2011.02.09 - SLIDE 65IS 240 – Spring 2011
Text Clustering
Term 1
Term 2
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu
2011.02.09 - SLIDE 66IS 240 – Spring 2011
Text Clustering
• Finds overall similarities among groups of documents
• Finds overall similarities among groups of tokens
• Picks out some themes, ignores others
2011.02.09 - SLIDE 67IS 240 – Spring 2011
Coefficients of Association
• Simple
• Dice’s coefficient
• Jaccard’s coefficient
• Cosine coefficient
• Overlap coefficient
|)||,min(|||
||||
||
||||
||||
||2
||
BABA
BA
BA
BABA
BA
BA
BA
I
I
UI
I
I
+
+
2011.02.09 - SLIDE 68IS 240 – Spring 2011
Pair-wise Document Similarity
How to compute document similarity?
DOCID nova galaxy heat hollywood film role diet furA 1 3 1B 5 2C 2 1 5D 4 1
2011.02.09 - SLIDE 69IS 240 – Spring 2011
Pair-wise Document Similarity(no normalization for simplicity)
∑=
∗=
=
=
t
iii
t
t
wwDDsim
wwwD
wwwD
12121
2,22212
1,12111
),(
...,,
...,,
9)11()42(),(
0),(
0),(
0),(
0),(
11)32()51(),(
=∗+∗=====
=∗+∗=
DCsimDBsimCBsimDAsimCAsimBAsim
DOCID nova galaxy heat hollywood film role diet furA 1 3 1B 5 2C 2 1 5D 4 1
2011.02.09 - SLIDE 71IS 240 – Spring 2011
Document/Document Matrix
....
.....
.....
....
....
...
21
2212
1121
21
nnn
t
t
t
ddD
ddD
ddD
DDD
jiij DDd to of similarity=
2011.02.09 - SLIDE 72IS 240 – Spring 2011
Clustering Methods
• Hierarchical
• Agglomerative
• Hybrid
• Automatic Class Assignment
2011.02.09 - SLIDE 73IS 240 – Spring 2011
Hierarchic Agglomerative Clustering
• Basic method:• 1) Calculate all of the interdocument similarity
coefficients• 2) Assign each document to its own cluster• 3) Fuse the most similar pair of current clusters• 4) Update the similarity matrix by deleting the
rows for the fused clusters and calculating entries for the row and column representing the new cluster (centroid)
• 5) Return to step 3 if there is more than one cluster left