Identify Experts from a Domain of Interest

Identify Experts from a Domain of Interest

„„ Al. I. Cuza” University of IaAl. I. Cuza” University of Ia ss i, Romi, Rom aa niania

Faculty of Computer ScienceFaculty of Computer Science

Context Statistics CriES2010 Input data System components◦ Questions and answers pre-processing◦ Pre-processing of interest areas◦ Getting the list of experts

Results Conclusions

Yahoo! Answers – a collaborative community service, multilingual through which members can ask questions and can receive answers

Google Ad Planner traffic statistics for Y!A, December 2009:◦ 26,000,000 Unique visitors (users) (US)◦ 110,000,000 Total visits (US)

Y!A represents between 1.03% to 1.7% of Yahoo! traffic In present, the identification of experts is done semi-automatically

Automatic search of human expert in the multilingual context offered by Yahoo! Answers network

Participants start from a collection of questions and answers and they must identify the expert able to answer to a new question

Initial digraph

Initial Yahoo!answers collections

en fr ge sp

Eliminate stop words

Domains keywords

Initial users questions

Eliminate stop words

Questions keywords

Relevant words for questions

Relevant words for domains

Similarity score between questions and domains

Run 2 Run 1Run 0

Initially we divided the original XML (over 800 Mb) in 204 smaller files (the bigger file was “Other – Internet” ~ 80 Mb and the smaller one was the “MSN” ~ 670 bytes)

Examples of categories achieved:◦ Alergia, Alergias, Allergies◦ Astronomy◦ Biology◦ Mathematics◦ Monitors◦ Paranormal

For every question from a category, we process the information existing in the tags <title> and <description>

First we removed the stop-words and punctuation signs <topic lang="en">

<title>Do animals have feelings?</title> <description>can an animal feel regrets ,

compassion, sad, fear etc?</description> <category>Zoology</category>

<tokens>animals, feelings, animal, feel, regrets, compassion, sad, fear</tokens>

</topic>

For English topics we used WordNet:<topic lang="en"> <title>What is the origin of "foobar"?</title> <description>I want to know the meaning of the word and how

to explain to my friends.</description> <category>Programming&Design</category> (1) <tokens>origin,foobar,meaning,word,explain,friends

</tokens> (2) <synonyms>descent,extraction,origination,inception,

significance,signification,import,substance</synonyms></topic>

For other languages we used Google Translate service first and then English WordNet:

<topic lang="fr"><title>ki connaitre l'histoire de l'aspirine?</title><description/><category>Biologie</category><questioner>u8620</questioner><answerer>u313460</answerer>(1)<tokens> connaitre, histoire, aspirine</tokens>(2.1)<tokens_en>know,history,aspirin</tokens_en>(2.2)<synonyms_en>account,chronicle,story,acetylsalicylic

acid,Bayer,Empirin,St. Joseph</synonyms_en>(2.3)<synonyms>compte, chronique, l'histoire, l'acide

acétylsalicylique, Bayer, Empirin, Saint- Joseph</synonyms></topic>

For each new question we calculate a similarity score between it and existing answered questions from the same topic

The similarity score depend by common words from tags <tokens> and <synonyms>

The solution = first 10 experts selected in descending order of similarity scores

Similar to Run 1: For each new question we calculate a similarity score

between it and existing answered questions from the same topic

The solution = first 10 experts selected in descending order of similarity scores

Difference: The similarity score depend only by common words from tag <tokens>

In this case we used only the input digraph

<edge source="u765155" target="u52050"> <desc>1592994;Laptops & Notebooks</desc></edge>

For every topic and for every person we calculate the number of questions answered by that person in that topic (using “target” element)

Initial digraph

Run Id CharacteristicsStrict Lenient

P@10 MRR P@10 MRR

0 We eliminate stop words and we consider relevant keywords and their synonyms (using Google Translate and English WordNet)

0.52 0.80 0.82 0.94

1 We eliminate stop words and we consider only relevant keywords

0.47 0.77 0.77 0.93

2 We consider only the digraph provided by Yahoo 0.62 0.84 0.83 0.94

Runs 2 and 0 obtained good results (normal for run 0 and unexpected for run 2)

Problems related to execution time for our runs (few hours)

Future work is related to multilinguality:◦ In our approach Allergies, Allergien, Alergias,

Alergia represent different topics with different experts◦ We still search the algorithm to identify the best multilingual

expert

Technology

Identify Experts from a Domain of Interest