Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information Carlos Cobos-Lozada MSc. Ph.D. (c) [email protected]@unicauca.edu.co

Model of Web Clustering Engine

Enrichment with a Taxonomy,

Ontologies and User InformationCarlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected] Advisor: Elizabeth León Ph.D. [email protected]

Visiting scholar of Modern Heuristic Research Group LISI-MIDAS: Universidad Nacional de Colombia Sede BogotáGTI : Universidad del CaucaIdaho Falls, October 5, 2011

mailto:[email protected]



Agenda

Preliminaries

Latent Semantic Indexing

Web Clustering Engines

Proposed Model

Preliminaries

UserRetrievalProcess

Documents

Results

Query

FeedbackVisualization and browsing

Information Retrieval System

Indexes

IndexingProcess

ExtendedQuery

Auto complete

Preliminaries

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval

Classic Models

boolean vector space probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Information Retrieval Models

PreliminariesClassic Models – Basic Concepts

Each document is represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document main themes

Usually, index terms are nouns because nouns have meaning by themselves

However, some search engines assume that all words are index terms (full text representation)

Not all terms are equally useful for representing the document contents, e.g. less frequent terms allow identifying a narrower set of documents

The importance of the index terms is represented by weights associated to them

PreliminariesIndexing Process

Documentrecognition of structure

Structure

Tokenization

Filters

Stop words rem.

Noun groups rem.

Stemming

Vocabulary rest.

Key words

Full text representation

PreliminariesIndexing Process - Sample

WASHINGTON The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again

washington the house of representatives on tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again

washington house representatives tuesday passed bill puts government stable financial footing weeks resolve battle spending flare

washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl spend flare

WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again.

Original

Tokens

Filters

Stop

Stem

PreliminariesIndexing Process - Sample

TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race in a move that sets up a battle between Mitt Romney and Rick Perry

trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might make a late leap into the 2012 republican presidential race in a move that sets up a battle between mitt romney and rick perry

trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012 republican presidential race move sets battle mitt romney rick perry

trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012 republican presidenti race move set battl mitt romnei rick perri

Original

Tokens

Filters

Stop

Stem

TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race, in a move that sets up a battle between Mitt Romney and Rick Perry.

Preliminaries

Term-Document Matrix (TDM)

ji

jiji n

N

f

fw

1log

)max(,

,

t1 t2 … tj … tF

d1 1 3 4 2

d2 2

…

di 0 fi,j

…

dN 0

Observed Frequency

4

max(fi)

2 nj

TF-IDF or Term-Document Matrix

Stored in an Inverted Index

Preliminaries

M

i

qi

M

i

di

qi

M

idi

WW

WWqdSim

1

,2

1

,2

,1

,

,

Cosine Similarity

Preliminaries

t1

t2

t3 d1

d2

d3d4

d5

d6d7

q

t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking

d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,85224481 3

d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,31454287 6

d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,94924327 2

d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,31454287 7

d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 1 1

d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,61042281 4

d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,52314318 5ni 5 4 3

idfi 0,3364722 0,5596158 0,8472979

N 7 max freql,q |q|

q 1 1 1 1 q 0,5047084 0,8394237 1,2709468 1,60457732

Sample 1: Vector Space Model

Preliminaries

t1

t2

t3 d1

d2

d3d4

d5

d6d7

q

t1 t2 t3 max freql,j t1 t2 t3 |dj| sim(dj,q) ranking

d1 1 0 1 1 d1 0,3364722 0 0,8472979 0,9116618 0,88229947 3

d2 1 0 0 1 d2 0,3364722 0 0 0,33647224 0,19256666 6

d3 0 1 1 1 d3 0 0,5596158 0,8472979 1,01542282 0,97544391 2

d4 1 0 0 1 d4 0,3364722 0 0 0,33647224 0,19256666 7

d5 1 1 1 1 d5 0,3364722 0,5596158 0,8472979 1,06971822 0,98650404 1

d6 1 1 0 1 d6 0,3364722 0,5596158 0 0,65298039 0,48349989 4

d7 0 1 0 1 d7 0 0,5596158 0 0,55961579 0,44838373 5ni 5 4 3

idfi 0,3364722 0,5596158 0,8472979

N 7 max freql,q |q|

q 1 2 3 3 q 0,2803935 0,6528851 1,2709468 1,45608558

Sample 2: Vector Space Model

PreliminariesVector Space Model

Advantages:• Simple model based on

linear algebra• Term weights• Allows computing a

continuous degree of similarity between queries and documents

• Allows ranking documents according to their possible relevance

• Allows partial matching

Limitations:• Long documents are poorly represented

because they have poor similarity values (a small scalar product and a large dimensionality)

• Word substrings might result in a "false positive match"

• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

• The order in which the terms appear in the document is lost in the vector space representation.

• Assumes terms are statistically independent


It is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text

SVD:◦Also, it can be used to reduce noise in

the data (SVD moves data to a reduced dimension)


Let A denote an m × n matrix of real-valued data and rank r, where without loss of generality m ≥ n, and therefore r ≤ n.

Where:◦ The columns of U are called the left singular and form an

orthonormal basis for original columns U is the eigenvectors of DDT (orthogonal)

◦ The rows of VT contain the elements of the right singular vectors and form an orthonormal basis for original rows V is the eigenvectors of DTD (orthogonal)

◦ Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n

nxnT

nxnmxnmxn VUA **

SVD


3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3

0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32

34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35

0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11

88x

810xA 810xU

88xTV

Docs

Terms

mxn

nxn

nxn

mxn


Using SVD to reduce noise◦Take r instead of n in matrix Ʃ◦What value of r? e.g. 90% of

Frobenius norm

In this case r=5, where r < n (n=8)

8555510810 ** xT

xxx VUA


3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3

0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32

34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35

0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11

55x

810xA 510xU

85xTV

Docs

Terms

mxr

rxr

rxn

mxn


Sum ← 0

For i ← 0 to n do

Sum ← Sum + Ʃ(i,i)

End for

Percentage← Sum* 0.9 // 90% of Frobenius Norm

r ← 0

Temp ← 0

For i ← 0 to n do

Temp ← temp + S(i, i)

r ← r + 1

IF temp ≥ Percentage then

break

end if

End for

Return r Value of r?


Retrieved documents in latent space◦Documents in the latent space:

◦Terms in latent space:

1**' rxrnxrmxnmxr VAD

1**' rxrmxrTmxnnxr UAT


Query in the latent space:

Cosine similarity

111 **' rxrnxrxnxr Vqq

r

i

qi

r

i

di

qi

r

idi

WW

WW

qdSim

1

,2

1

,2

,1

,

,



The search aspects where WCE can be most useful in complementing the output of plain search engines are:◦Fast subtopic retrieval: documents can be

accessed in logarithmic rather than linear time◦Topic exploration.: Clusters provides a high-

level view of the whole query topic including terms for query reformulation (particularly useful for informational searches in unknown or dynamic domains)

◦Alleviating information overlook: Users may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages


WDC pose new requirements and challenges to clustering technology:

◦Meaningful labels◦Computational efficiency (response

time)◦Short input data description (snippets)◦Unknown number of clusters◦Work with noise data◦Overlapping clusters

Search results acquisitions

Preprocesing

Cluster construction and labeling

Visualization

Query

Snippets

Features

Clusters

General Model

Search results acquisitions

Preprocesing

Cluster construction and labeling

Visualization

Query

Snippets

Features

Clusters

Proposed Model

Query Expansion

Concepts instead of Terms

Evolutionary approach: Online and Offline

Feedback

Taxonomy, Ontologies and User Information

Query Expansion Process

1. A registered user requests a query (based on keywords in a common graphics interface like Google). He/she receives help on-line (auto complete) based on his/her user profile

General Taxonomy of Knowledge

User Profile

SpecificOntology

Query by keywords

1. Pre-processing and semantic relationship

0 … *

Auto completeDropdown List

2. Related Concepts with user profile

3. External service

Inverted Index of Concepts

User

Query by keywords

1


ExtendedQuery

Concepts, relations (is-a, is-part-of) and instances

http://code.google.com/p/ajax-autocomplete-aspnet/



Query Expansion Process (B)

1. GTK and Specific ontologies are multilingual (collaborative edition process)

2. User profile has:• Nodes from GTK used for the user• A relation with the Inverted Index of concepts

(ontologies), to support rating process:• Manage concepts that have been previously

evaluated for an ontology specific (good/bad)


User

Query by keywords

1


ExtendedQuery

Term-Document Matrix - Observed Frequency - TDM-OF Building Process

Extended query: Original keyword+ other concepts + selected nodes from GTK (ontologies)

In parallel, each web search results is processed:

1. Pre-processing • Tokenization• Filters (Special characters and lower case)• Stop words removal• Define the language• Stemming (English/ Spanish)

2. For each document, accumulate the observed frequency of each term

3. Mark the document as processed

Inde

pend

ent T

hrea

ds

Term-Document Matrix (Observed

Frequency)

2

GoogleAPI

Yahoo!API

BingAPI

TDM-OF Building Process

Concept-Document Matrix - Observed Frequency - CDM-OF Building Process

CDM-OFBuilding Process

Concept-Document Matrix (Observed

Frequency)

In parallel, for each document marked as processed:

1. Join terms belonging to the same concept in the selected specific ontologies (from extended query)

2. Accumulate the observed frequency for terms who joined in the same concept

3. End this process when all web search results are processed - thread synchronization -

SpecificOntology

Thread Synchronization

3

Concept-Document Matrix (CDM) Building Process

Concept-Document Matrix (CDM)

4

CDM-OFBuilding Process 1. Calculate weigh (TF-IDF) of concepts in documents

jji

jiji n

N

freq

freqw

1log

)max( ,

,,

Clustering Process

Three own algorithms1. A Hybridization of the Global-Best Harmony Search,

with the K-means algorithm2. A Memetic Algorithm with Niching Techniques

(restricted competition replacement and restrictive mating)

3. A Memetic Algorithm (Roulette wheel, K-means, and Replace the worst)

All Algorithms:4. Define the number of clusters automatically (BIC)5. Can use a standard Term-Document Matrix (TDM),

Frequent Term-Document Matrix (FTDM), Concept-Document Matrix (CDM) or Frequent Concept-Document Matrix (FTDM)

6. Test with data sets based on Reuters-21578 and DMOZ

7. Test by users

5

ClusteringProcess

Clustered Documents

Labeling Process

Statistically Representative Terms:1. Initialize algorithm parameters2. Building of the "Others” label and cluster3. Candidate label induction4. Eliminate repeated terms5. Visual improving of labels

Frequent Phrases:6. Conversion of the representation7. Document concatenation8. Complete phrase discovery9. Final selection10.Building of the "Others” label and cluster11. Cluster label induction

Overlapping clusters6

LabelingProcess

Clustered Documents and Labeled

Visualization and Rating Process

On experimentation → for each cluster, the user answered whether or not: • (Q1) the cluster label is in general representative of

the cluster (much, little, or nothing)• (Q2) the cluster is useful, moderately useful or

useless. Then, for each document in each cluster, the user answered whether or not: • (Q3) the document matches with the cluster (very

well matching, moderately matching, or not-matching)• (Q4) the document relevance (location) in the cluster

was adequate (adequate, moderately suitable, or inadequate).

Visualization and RatingProcess


UserProfile

Visualization and Rating Process

On production → the user can answer if each document is useful (relevant) or not

Visualization and RatingProcess


UserProfile

General Taxonomy of KnowledgeUser

Profile

SpecificOntology

0 … *


Proposed model

Collaborative Editing Process of Ontologies

WordNet


UserProfile Specific

Ontology

0 … *

3. Supported by general ontologies


Editor

1. Select node (ontology associated)

2. Edit the ontologyConcepts, synonyms in different languages, relations, instances

4. Supported by concepts used for user

Can be automatically

5. Update Index automatically when save

Model of Web Clustering Engine

Enrichment with a Taxonomy,

Ontologies and User Information

Carlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected]

Questions?