View
214
Download
0
Embed Size (px)
Citation preview
Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User InformationCarlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected] Advisor: Elizabeth León Ph.D. [email protected]
Visiting scholar of Modern Heuristic Research Group LISI-MIDAS: Universidad Nacional de Colombia Sede BogotáGTI : Universidad del CaucaIdaho Falls, October 5, 2011
Agenda
Preliminaries
Latent Semantic Indexing
Web Clustering Engines
Proposed Model
Preliminaries
UserRetrievalProcess
Documents
Results
Query
FeedbackVisualization and browsing
Information Retrieval System
Indexes
IndexingProcess
ExtendedQuery
Auto complete
Preliminaries
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval
Classic Models
boolean vector space probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Information Retrieval Models
PreliminariesClassic Models – Basic Concepts
Each document is represented by a set of representative keywords or index terms
An index term is a document word useful for remembering the document main themes
Usually, index terms are nouns because nouns have meaning by themselves
However, some search engines assume that all words are index terms (full text representation)
Not all terms are equally useful for representing the document contents, e.g. less frequent terms allow identifying a narrower set of documents
The importance of the index terms is represented by weights associated to them
PreliminariesIndexing Process
Documentrecognition of structure
Structure
Tokenization
Filters
Stop words rem.
Noun groups rem.
Stemming
Vocabulary rest.
Key words
Full text representation
PreliminariesIndexing Process - Sample
WASHINGTON The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again
washington the house of representatives on tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again
washington house representatives tuesday passed bill puts government stable financial footing weeks resolve battle spending flare
washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl spend flare
WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again.
Original
Tokens
Filters
Stop
Stem
PreliminariesIndexing Process - Sample
TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race in a move that sets up a battle between Mitt Romney and Rick Perry
trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might make a late leap into the 2012 republican presidential race in a move that sets up a battle between mitt romney and rick perry
trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012 republican presidential race move sets battle mitt romney rick perry
trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012 republican presidenti race move set battl mitt romnei rick perri
Original
Tokens
Filters
Stop
Stem
TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race, in a move that sets up a battle between Mitt Romney and Rick Perry.
Preliminaries
Term-Document Matrix (TDM)
ji
jiji n
N
f
fw
1log
)max(,
,
t1 t2 … tj … tF
d1 1 3 4 2
d2 2
…
di 0 fi,j
…
dN 0
Observed Frequency
4
max(fi)
2 nj
TF-IDF or Term-Document Matrix
Stored in an Inverted Index
Preliminaries
M
i
qi
M
i
di
qi
M
idi
WW
WWqdSim
1
,2
1
,2
,1
,
,
Cosine Similarity
Preliminaries
t1
t2
t3 d1
d2
d3d4
d5
d6d7
q
t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking
d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,85224481 3
d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,31454287 6
d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,94924327 2
d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,31454287 7
d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 1 1
d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,61042281 4
d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,52314318 5ni 5 4 3
idfi 0,3364722 0,5596158 0,8472979
N 7 max freql,q |q|
q 1 1 1 1 q 0,5047084 0,8394237 1,2709468 1,60457732
Sample 1: Vector Space Model
Preliminaries
t1
t2
t3 d1
d2
d3d4
d5
d6d7
q
t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking
d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,88229947 3
d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,19256666 6
d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,97544391 2
d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,19256666 7
d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 0,98650404 1
d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,48349989 4
d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,44838373 5ni 5 4 3
idfi 0,3364722 0,5596158 0,8472979
N 7 max freql,q |q|
q 1 2 3 3 q 0,2803935 0,6528851 1,2709468 1,45608558
Sample 2: Vector Space Model
PreliminariesVector Space Model
Advantages:• Simple model based on
linear algebra• Term weights• Allows computing a
continuous degree of similarity between queries and documents
• Allows ranking documents according to their possible relevance
• Allows partial matching
Limitations:• Long documents are poorly represented
because they have poor similarity values (a small scalar product and a large dimensionality)
• Word substrings might result in a "false positive match"
• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
• The order in which the terms appear in the document is lost in the vector space representation.
• Assumes terms are statistically independent
Latent Semantic Indexing
It is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text
SVD:◦Also, it can be used to reduce noise in
the data (SVD moves data to a reduced dimension)
Latent Semantic Indexing
Let A denote an m × n matrix of real-valued data and rank r, where without loss of generality m ≥ n, and therefore r ≤ n.
Where:◦ The columns of U are called the left singular and form an
orthonormal basis for original columns U is the eigenvectors of DDT (orthogonal)
◦ The rows of VT contain the elements of the right singular vectors and form an orthonormal basis for original rows V is the eigenvectors of DTD (orthogonal)
◦ Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n
nxnT
nxnmxnmxn VUA **
SVD
Latent Semantic Indexing
3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3
0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32
34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35
0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11
88x
810xA 810xU
88xTV
Docs
Terms
mxn
nxn
nxn
mxn
Latent Semantic Indexing
Using SVD to reduce noise◦Take r instead of n in matrix Ʃ◦What value of r? e.g. 90% of
Frobenius norm
In this case r=5, where r < n (n=8)
8555510810 ** xT
xxx VUA
Latent Semantic Indexing
3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3
0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32
34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35
0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11
55x
810xA 510xU
85xTV
Docs
Terms
mxr
rxr
rxn
mxn
Latent Semantic Indexing
Sum ← 0
For i ← 0 to n do
Sum ← Sum + Ʃ(i,i)
End for
Percentage← Sum* 0.9 // 90% of Frobenius Norm
r ← 0
Temp ← 0
For i ← 0 to n do
Temp ← temp + S(i, i)
r ← r + 1
IF temp ≥ Percentage then
break
end if
End for
Return r Value of r?
Latent Semantic Indexing
Retrieved documents in latent space◦Documents in the latent space:
◦Terms in latent space:
1**' rxrnxrmxnmxr VAD
1**' rxrmxrTmxnnxr UAT
Latent Semantic Indexing
Query in the latent space:
Cosine similarity
111 **' rxrnxrxnxr Vqq
r
i
qi
r
i
di
qi
r
idi
WW
WW
qdSim
1
,2
1
,2
,1
,
,
Web Clustering Engines
Web Clustering Engines
The search aspects where WCE can be most useful in complementing the output of plain search engines are:◦Fast subtopic retrieval: documents can be
accessed in logarithmic rather than linear time◦Topic exploration.: Clusters provides a high-
level view of the whole query topic including terms for query reformulation (particularly useful for informational searches in unknown or dynamic domains)
◦Alleviating information overlook: Users may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages
Web Clustering Engines
WDC pose new requirements and challenges to clustering technology:
◦Meaningful labels◦Computational efficiency (response
time)◦Short input data description (snippets)◦Unknown number of clusters◦Work with noise data◦Overlapping clusters
Search results acquisitions
Preprocesing
Cluster construction and labeling
Visualization
Query
Snippets
Features
Clusters
General Model
Search results acquisitions
Preprocesing
Cluster construction and labeling
Visualization
Query
Snippets
Features
Clusters
Proposed Model
Query Expansion
Concepts instead of Terms
Evolutionary approach: Online and Offline
Feedback
Taxonomy, Ontologies and User Information
Query Expansion Process
1. A registered user requests a query (based on keywords in a common graphics interface like Google). He/she receives help on-line (auto complete) based on his/her user profile
General Taxonomy of Knowledge
User Profile
SpecificOntology
Query by keywords
1. Pre-processing and semantic relationship
0 … *
Auto completeDropdown List
2. Related Concepts with user profile
3. External service
Inverted Index of Concepts
User
Query by keywords
1
Query Expansion Process
ExtendedQuery
Concepts, relations (is-a, is-part-of) and instances
Query Expansion Process (B)
1. GTK and Specific ontologies are multilingual (collaborative edition process)
2. User profile has:• Nodes from GTK used for the user• A relation with the Inverted Index of concepts
(ontologies), to support rating process:• Manage concepts that have been previously
evaluated for an ontology specific (good/bad)
General Taxonomy of Knowledge
User
Query by keywords
1
Query Expansion Process
ExtendedQuery
Term-Document Matrix - Observed Frequency - TDM-OF Building Process
Extended query: Original keyword+ other concepts + selected nodes from GTK (ontologies)
In parallel, each web search results is processed:
1. Pre-processing • Tokenization• Filters (Special characters and lower case)• Stop words removal• Define the language• Stemming (English/ Spanish)
2. For each document, accumulate the observed frequency of each term
3. Mark the document as processed
Inde
pend
ent T
hrea
ds
Term-Document Matrix (Observed
Frequency)
2
GoogleAPI
Yahoo!API
BingAPI
TDM-OF Building Process
Concept-Document Matrix - Observed Frequency - CDM-OF Building Process
CDM-OFBuilding Process
Concept-Document Matrix (Observed
Frequency)
In parallel, for each document marked as processed:
1. Join terms belonging to the same concept in the selected specific ontologies (from extended query)
2. Accumulate the observed frequency for terms who joined in the same concept
3. End this process when all web search results are processed - thread synchronization -
SpecificOntology
Thread Synchronization
3
Concept-Document Matrix (CDM) Building Process
Concept-Document Matrix (CDM)
4
CDM-OFBuilding Process 1. Calculate weigh (TF-IDF) of concepts in documents
jji
jiji n
N
freq
freqw
1log
)max( ,
,,
Clustering Process
Three own algorithms1. A Hybridization of the Global-Best Harmony Search,
with the K-means algorithm2. A Memetic Algorithm with Niching Techniques
(restricted competition replacement and restrictive mating)
3. A Memetic Algorithm (Roulette wheel, K-means, and Replace the worst)
All Algorithms:4. Define the number of clusters automatically (BIC)5. Can use a standard Term-Document Matrix (TDM),
Frequent Term-Document Matrix (FTDM), Concept-Document Matrix (CDM) or Frequent Concept-Document Matrix (FTDM)
6. Test with data sets based on Reuters-21578 and DMOZ
7. Test by users
5
ClusteringProcess
Clustered Documents
Labeling Process
Statistically Representative Terms:1. Initialize algorithm parameters2. Building of the "Others” label and cluster3. Candidate label induction4. Eliminate repeated terms5. Visual improving of labels
Frequent Phrases:6. Conversion of the representation7. Document concatenation8. Complete phrase discovery9. Final selection10.Building of the "Others” label and cluster11. Cluster label induction
Overlapping clusters6
LabelingProcess
Clustered Documents and Labeled
Visualization and Rating Process
On experimentation → for each cluster, the user answered whether or not: • (Q1) the cluster label is in general representative of
the cluster (much, little, or nothing)• (Q2) the cluster is useful, moderately useful or
useless. Then, for each document in each cluster, the user answered whether or not: • (Q3) the document matches with the cluster (very
well matching, moderately matching, or not-matching)• (Q4) the document relevance (location) in the cluster
was adequate (adequate, moderately suitable, or inadequate).
Visualization and RatingProcess
Clustered Documents and Labeled
UserProfile
Visualization and Rating Process
On production → the user can answer if each document is useful (relevant) or not
Visualization and RatingProcess
Clustered Documents and Labeled
UserProfile
General Taxonomy of KnowledgeUser
Profile
SpecificOntology
0 … *
Inverted Index of Concepts
Proposed model
Collaborative Editing Process of Ontologies
WordNet
General Taxonomy of Knowledge
UserProfile Specific
Ontology
0 … *
3. Supported by general ontologies
Inverted Index of Concepts
Editor
1. Select node (ontology associated)
2. Edit the ontologyConcepts, synonyms in different languages, relations, instances
4. Supported by concepts used for user
Can be automatically
5. Update Index automatically when save
Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User Information
Carlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected]
Questions?