Upload
faculty-of-computer-science
View
448
Download
1
Embed Size (px)
Citation preview
Identify Experts from a Domain of Interest
„„ Al. I. Cuza” University of IaAl. I. Cuza” University of Ia ss i, Romi, Rom aa niania
Faculty of Computer ScienceFaculty of Computer Science
Context Statistics CriES2010 Input data System components◦ Questions and answers pre-processing◦ Pre-processing of interest areas◦ Getting the list of experts
Results Conclusions
Yahoo! Answers – a collaborative community service, multilingual through which members can ask questions and can receive answers
Google Ad Planner traffic statistics for Y!A, December 2009:◦ 26,000,000 Unique visitors (users) (US)◦ 110,000,000 Total visits (US)
Y!A represents between 1.03% to 1.7% of Yahoo! traffic In present, the identification of experts is done semi-automatically
Automatic search of human expert in the multilingual context offered by Yahoo! Answers network
Participants start from a collection of questions and answers and they must identify the expert able to answer to a new question
Initial digraph
Initial Yahoo!answers collections
en fr ge sp
Eliminate stop words
Domains keywords
Initial users questions
Eliminate stop words
Questions keywords
Relevant words for questions
Relevant words for domains
Similarity score between questions and domains
Run 2 Run 1Run 0
Initially we divided the original XML (over 800 Mb) in 204 smaller files (the bigger file was “Other – Internet” ~ 80 Mb and the smaller one was the “MSN” ~ 670 bytes)
Examples of categories achieved:◦ Alergia, Alergias, Allergies◦ Astronomy◦ Biology◦ Mathematics◦ Monitors◦ Paranormal
For every question from a category, we process the information existing in the tags <title> and <description>
First we removed the stop-words and punctuation signs <topic lang="en">
<title>Do animals have feelings?</title> <description>can an animal feel regrets ,
compassion, sad, fear etc?</description> <category>Zoology</category>
<tokens>animals, feelings, animal, feel, regrets, compassion, sad, fear</tokens>
</topic>
For English topics we used WordNet:<topic lang="en"> <title>What is the origin of "foobar"?</title> <description>I want to know the meaning of the word and how
to explain to my friends.</description> <category>Programming&Design</category> (1) <tokens>origin,foobar,meaning,word,explain,friends
</tokens> (2) <synonyms>descent,extraction,origination,inception,
significance,signification,import,substance</synonyms></topic>
For other languages we used Google Translate service first and then English WordNet:
<topic lang="fr"><title>ki connaitre l'histoire de l'aspirine?</title><description/><category>Biologie</category><questioner>u8620</questioner><answerer>u313460</answerer>(1)<tokens> connaitre, histoire, aspirine</tokens>(2.1)<tokens_en>know,history,aspirin</tokens_en>(2.2)<synonyms_en>account,chronicle,story,acetylsalicylic
acid,Bayer,Empirin,St. Joseph</synonyms_en>(2.3)<synonyms>compte, chronique, l'histoire, l'acide
acétylsalicylique, Bayer, Empirin, Saint- Joseph</synonyms></topic>
For each new question we calculate a similarity score between it and existing answered questions from the same topic
The similarity score depend by common words from tags <tokens> and <synonyms>
The solution = first 10 experts selected in descending order of similarity scores
Similar to Run 1: For each new question we calculate a similarity score
between it and existing answered questions from the same topic
The solution = first 10 experts selected in descending order of similarity scores
Difference: The similarity score depend only by common words from tag <tokens>
In this case we used only the input digraph
<edge source="u765155" target="u52050"> <desc>1592994;Laptops & Notebooks</desc></edge>
For every topic and for every person we calculate the number of questions answered by that person in that topic (using “target” element)
Initial digraph
Run Id CharacteristicsStrict Lenient
P@10 MRR P@10 MRR
0 We eliminate stop words and we consider relevant keywords and their synonyms (using Google Translate and English WordNet)
0.52 0.80 0.82 0.94
1 We eliminate stop words and we consider only relevant keywords
0.47 0.77 0.77 0.93
2 We consider only the digraph provided by Yahoo 0.62 0.84 0.83 0.94
Runs 2 and 0 obtained good results (normal for run 0 and unexpected for run 2)
Problems related to execution time for our runs (few hours)
Future work is related to multilinguality:◦ In our approach Allergies, Allergien, Alergias,
Alergia represent different topics with different experts◦ We still search the algorithm to identify the best multilingual
expert