Upload
robyn-simon
View
244
Download
5
Embed Size (px)
Citation preview
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
Author: Satoshi Oyama
Takashi Kokubo
Toru lshida
國立雲林科技大學National Yunlin University of Science and Technology
Domain-Specific Web Search with Keyword Spices
Knowledge and Data Engineering, IEEE Transactions on , Jan. 2004 ,IEEE JNL
Intelligent Database Systems Lab
Outline Motivation Objective Introduction Domain-specific web search with keyword spices Algorithm for extracting keyword spices Experiments Conclusions Opinion
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.Motivation
naïve queries may find many irrelevant pages obtain more relevant pages
depend on much experience and skill previous, domain-specific collect and index
relevant page manually constructed: cost, scalable
Intelligent Database Systems Lab
Objective
Domain-specific search engines return: relevant to certain domains filter irrelevant web pages
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
1-1.Introduction
Domain-specific web search engines Looking for a recipe
Only input ‘beef’, find few recipes Input ‘beef pepper’, find other recipes
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.牛肉 牛肉、胡椒
Intelligent Database Systems Lab
1-2.IntroductionN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
1-3.IntroductionN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
1-4.IntroductionN.Y.U.S.T.
I.M.
Domain-specific search engines return: relevant to certain domains filter irrelevant web pages
download irrelevant and relevant, classify them Use Decision-Tree
Intelligent Database Systems Lab
2-1.Domain-Specific web search with keyword spices
Domain-Specific Web search as a Text Classification problem
Domain-Specific which collect sample web pages according to the assumption of user’s input
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
D : all web documents Dt: the set of documents relevant to a certain domain
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
set of all keywords in the domain be the hypothesis space composed of all Boolean
expressions is regarded as a Boolean variable A Boolean expression of keywords can be regarded as a
function from D to 1, keywords is contained in the document 0, otherwise
N.Y.U.S.T.
I.M.
Words in domain-specific
output
1 1 1 0 0 0 1
2 0 1 0 1 1 0
3 0 1 1 0 0 1
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
Finding hypothesis h that minimizes the error rate:
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
2-2.Collecting sample web pages by user’s input
It’s difficult with random sampling. assume all candidates keyword have the same probability
of occurrence in the “recipe domain”, input “beef,” “salmon(鮭魚 ),” “
potato,” etc. as sample keywords and download the same web pages for each keyword
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
2-2.Collecting sample web pages by user’s input
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
3-1.Identifying keyword spicesN.Y.U.S.T.
I.M.
classify sample pages into two classes T or F by hand a decision tree learning algorithm to discover keyword
spices each node is an attribute value of a branch indicates the value of the attribute each leaf is a class
No “tablespoon” , has “recipe”, no “home”, no “top, class T
Intelligent Database Systems Lab
3-1. Extracting keyword spicesN.Y.U.S.T.
I.M.
Words in domain-specific output
d1 1 1 0 0 0 1
d2 0 1 0 1 1 0
d3 0 1 1 0 0 1
Classified by humans
Web pages collected by user’s input keyword
Intelligent Database Systems Lab
3-1.Identifying keyword spicesN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Decision trees are very large. Too-complex queries can’t be accepted. overfitting problem
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Simplify the induced Boolean expression
1.For each conjunction c in h we remove
keywords (Boolean literals) from c to simplify.
2.We remove conjunctions from disjunctive
normal from h to simplify it.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Precision P and recall R are defined over validation
Harmonic mean of P and R
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
greater contribution to F
weighted harmonic mean of F
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
4.ExperimtentsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
4-1.Experimtents-extracting keyword spices
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
4-1.Experimtents-extracting keyword spices
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
4-1.Extracting keyword spices
sample pages were split randomly in the recipe domain
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
keyword spices discovered for a recipe search engines
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Intelligent Database Systems Lab
trade off between precision and recall
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Intelligent Database Systems Lab
When , keyword spices extracted for the domain of …
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Intelligent Database Systems Lab
to test queries in each domain
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Intelligent Database Systems Lab
precision values of the sample queries conjoined with “recipe”
keyword “recipe” finds fewer relevant than the query with keyword spice, for example: “beef recipe”
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Intelligent Database Systems Lab
precision values of the sample queries in the filtering model
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Intelligent Database Systems Lab
numbers of relevant pages returned by the …
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Intelligent Database Systems Lab
for example “shrimp”, must download 5 pages to obtain one result and so is quite inefficient
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Intelligent Database Systems Lab
5.Future Work
training examples classified by human cost
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
5.Future Work
1. Using a Web Directory as a Source for Training examples Web directories such as Yahoo, Open Direct
ory,…,… estimate bias
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
5.Future Work
2. Learning Classifiers from Partially Labeled Data Proposed an algorithm
augment a small to huge
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
6.Conclusion
keyword spices human
Cost, effective
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Opinion
dependent on human seriously assume all candidates keyword have the same
probability of occurrence ……
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Opinion
Pr(TL)?Pr(TL’)?
N.Y.U.S.T.
I.M.
)Pr()'Pr(
)'Pr(
)Pr(
)'Pr()|'Pr(
)'Pr()Pr(
)Pr(
)Pr(
)Pr()|Pr(
)|Pr()Pr()Pr(
)Pr()|Pr(
WiTLWiTL
WiTL
Wi
WiTLWiTL
WiTLWiTL
WiTL
Wi
WiTLWiTL
TLWiTLTL
TLWiTLWi
)'Pr()'Pr(
)Pr()Pr(
)'|Pr(
)|Pr(
TLTLWi
TLTLWi
TLWi
TLWi
Intelligent Database Systems Lab
Opinion
• Poster Probability Rule
X
N.Y.U.S.T.
I.M.
)|'Pr(
)|Pr(
)(lim
)(lim
)Pr(
)'Pr(
)Pr(
)'Pr(
)|'Pr(
)|Pr(
)'|Pr(
)|Pr(
0
1
WiTL
WiTL
xf
xf
TL
TL
TL
TL
WiTL
WiTL
TLWi
TLWi
x
x
assume all candidates keyword have the same probability of occurrence
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Keyword Spices Modified
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Information Retrieval
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Machine Learning (cluster,classify)
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Content Web Mining
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Dictionary which can represent a distance between Words
Intelligent Database Systems Lab
Advisor:Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology