Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org)...

Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org)

Computer Science and Engineering Div., Waseda Univ. JAPAN

Have you ever experienced?

8 870 t 2 070?8,870 to 2,070?Where have 6,800 web pages gone?

Put a Query andClick “Search”Button

, p g g

How can we have reliable hit counts?How can we have reliable hit counts?

Click “Next” Button

W t t k l it2

We want to make clear it.

Agenda:

PART1:PART1: Importance of Search Engines’ hit counts

PART2PART2: Related Work

PART3: Experiment and Trustworthiness of hit countsExperiment and Trustworthiness of hit counts

PART4: C l iConclusions

Introduction

1．IMPORTANCE OF SEARCH ENGINES’ HIT COUNTS

Introduction

SEARCH ENGINES’ HIT COUNTS

Background – How important hit-count is?

Many researches based on Hit Count exist …

Researches using Hit Count

- Translation SupportAl-Onaizen, Y. and Knight, K.:”Translating named entities using monolingual and bilingual resources”, Proc. of the 40th Ann. Meeting on Association for

NLP researches using web data aremostly based on SE’s hit counts

- Calculate Similarity between Words or Sentences

and bilingual resources , Proc. of the 40 Ann. Meeting on Association for Computational Linguisitics, pp.400 – 408 (2002).mostly based on SE s hit counts

E l ti f th lit f b

Cilibrasi, R.L. and Vitanyi, P.M.B.:”The Google Similarity Distance”, IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.3, pp.370-383 (2007).

SE’s hit counts now become- Evaluation of the quality of web pages

Gelman, I.A. and Barletta, A.L.:”A “quick and dirty” website data quality indicator”, Proc. of the 2nd ACM Workshop on Information credibility on the web, pp.43-46

one of indispensable web resources

(2008)

Example “Hont? Search” System[4]by Y.Yamamoto et al.(Fact Search System)( y )

It utilizes the difference among hit counts.

Hit count v.s. the number of returned web documents

• Difference between the Hit Count on the 1st SERP and the number of returned web documents

number of returned hit c

number of returnedweb documents(MSN)

web documents(Google)

Experiment by using 1,000 querieson Nov 2008ou

77number of returned web documents(Yahoo! JAPAN)

on Nov.2008

How often does the relationship turn over?・Using 1,000 queries to compare the relationship, i.e. which has large hit count,for every pair of 1,000 queries

CASE1 i th hit t th 1 t SERP

12 11,1% CASE1: comparing the hit counts on the 1st SERP.

CASE2: comparing the numbers of returned web documents

5 9We should make clear the

t t thi f

5,9trustworthiness of search engines’ hit counts

0Google Yahoo! JAPAN Live Search

t ti f th b f lt f i

turn over ratio of the number of results of every pair

2．RELATED WORK

Related work

• Googleology is Bad Science– Kilgarriff, A. : “Googleology is Bad Science”, Computational Linguistic,

V l 33 I 1 Queries repeated the following day gave hit Vol.33, Issue 1, pp.147-151 (2007).

• Quantitative comparisons of search engine results

Que es epeated t e o o g day ga e tcounts varied by more than 10%.

Quantitative comparisons of search engine results– Thelwall, M. : “Quantitative comparisons of search engine results”, J. of

American for Information Science and Technology, Vol.59, Issue 11, pp 1702 1710 (2008)

He compared three search engines, i.e. difference among the three.pp.1702-1710 (2008)

• Investigation of the accuracy of search engine hit counts

difference among the three.

search engine hit counts– Uyar, A.:”Investigation of the accuracy of search engine hit counts”, J. of

Information Science, Vol.35, Issue 4, pp.469-480 (2009)He compared three search engines to find out

the search engines returning the mostthe search engines returning the most accurate hit count.

He assumed the number of returned web

documents is correct.

Related work

None of previous researches provideh t th li bl hit twhat the reliable hit count in a search engine is.g

3．EXPERIMENT AND TRUSTWORTHINESS OF HIT COUNTTRUSTWORTHINESS OF HIT COUNT

We cannot trust search engines’ hit count

Put a Query andClick “Search”Button

Where have the 6,800 pages gone?Sometimes, ,

hit count decreases to 1/10or

increases to 10 times.

Click “Next” Button

Purpose of Our Research

To provide “reliable hit count in a psearch engine” for researches using it

Cues of hit count ‘dance’ (change):

CASE1: Clicking the “Search” button many timesg y

CASE2: Clicking the “Next” button step by step to reach the last page of the search resultsto reach the last page of the search results

CASE3: Searching with the same query on different

Experimental Setting

Target:

(APIs)(APIs)

API Settings:API Settings:non-phrase search, normal-safe filter, Japanese language

Queries:10 000 queries provided by Yahoo! JAPAN10,000 queries, provided by Yahoo! JAPANas the top 10,000 frequent queries in December 2007.

Experimental Period:From October 2009 to December 2009.

From October 2009 to December 2009.

Case 1:Clicking the “Search” button many times

Queriesquery1 100 Hit C t

100 times in 5 min.

query1query2

query10000

・・・

100 Hit Counts

CV of “query1”query10000 CV of query1

1. Submit the same query to each SE 100 times in 5 min.2. Calculate the coefficient of variation(CV) from 100 Hit Counts.

CV =standard deviation

=variance

3. Execute 1 and 2 for all queries, then produce its histogram.

CV = average = average

Histogram of the coefficient of variation

All Search Engines’ CV are l t l th 5 0%

We may ignore

almost less than 5.0% the dance in case1.

Case 2: Clicking the “Next” button step by step

Queriesquery1

1. Submit one query to Search Engine to getHitCount(1,10), HitCount(11,20), ..., HitCount(991,1000)

1 t SERP 2 d SERP 100th SERPquery1query2

q er 10000

2. Calculate Deep hit count Vector(DV)1st SERP 2nd SERP 100th SERP

query10000 DV =

3. Execute 1 and 2 for all queries,th l k l t i t t

Click!!

then apply k-means clustering to a setof DVs

20,0001-10

20,00011-20

......

12,000991-1000 note:

Case 2 is applied to only Bing and Yahoo!because Google API does not return 1000

because Google API does not return 1000 results.

Clustering( by using k-means)

• Objective– extract transition patterns of the hit countsp

• MethodMethod– vary its clustering size k from 1 to 6, then, select the

best size by manual.best size by manual. – choose the best size based on the following points;1) start offsets when hit count dances begin are clearly1) start offsets when hit count dances begin are clearly

different among clusters.2) curves of change ratio are clearly different among ) g y g

clusters.

Transition of Hit Counts - Case 2

Adj ti th fi l hit t

Adjusting the final hit countto the number actually returned

web documents.

Search Offset N

Re-calculateHit Count!!Every 100 results,

ny ,Yahoo! re-calculates

the hit count

Search Offset N

Discussion(our assumption)

At first, hit count should be calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultspto calculate hit count.

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

g , j g

Search Engines’ Pruned Index Architecture

Querieshits results cache

Neither result cache hit

not hit results cache, but hits pruned index

Neither result cache hit nor pruned index hit

Results Cache

Low level cachePrunedIndex

PrunedIndex

Low-level cache

Full Index

Gleb Skobeltsyn, Flavio P. Junqueira ,Vassilis Plachouras, Ricardo Baeza-Yates:”ResIn: A Combination of Results Caching and Index Pruning for High-performanceW b S h E i ” SIGIR2008 131 138

Web Search Engines”, SIGIR2008, pp.131-138

Discussion(our assumption)

At first, hit count is calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultstop-k results

Reliable Hit Count in Case 2 isp

To calculate hit count.HitCount(k, k+9) where k is defined as the largest number,usually 991 for a search having more than 1,000 matched

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

pages. If a search engine adjusts the last hit count,we should use the hit count just before the adjusted count.

g , j gwe should use the hit count just before the adjusted count.

Case 3: Searching on different days

1 Submit queries to Search Engine every day during 2 months1. Submit queries to Search Engine every day during 2 months.(From October 11, 2009, to December 12, 2009)

2. Calculate Vectors of Variational ratio (VV).

3. Apply k-means clustering to VVs like Case 2.3. Apply k means clustering to VVs like Case 2.

Clustering Result - Case 3

(k=4) (k=5)

within a week

changes more than 30%

Clustering Result - Case 3

(k=3)( )

within a week

h th 30%

changes more than 30%

Conclusion of Case 3(k 4)(k=4)

Search Engines have two phases

“Dancing Hard Phase”

: index update phase.During this phase, hit counts will change

th 30% ithi k fCh

more than 30% within a week from our observation.

R“Stable Phase” Hit counts do not change more than 30%within a week.

Reliable Hit Count in Case 3 is...

A hit count when a Search Engine is on stable phase

A hit count when a Search Engine is on stable phase.

4．CONCLUSIONS

Conclusion

Hit count has become one of indispensable web resourcesresources

Reliable hit count can be the last hit count when h i i bl h h ha search engine is on stable phase where there

exists small change less than 30% during one kweek.

Our next challengeBuilding some benchmark to correct/refine theBuilding some benchmark to correct/refine the hit count more reliable.Providing some quality/reliable measure for the

Providing some quality/reliable measure for the hit count.

THANK YOU FOR YOUR ATTENTION

Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org)...

Documents

APRESENTAÇÃO DAS COMUNIDADES IMPACTADAS PELA YAMANA GOLD EM JACOBINA

Yamana Gold

Unbeständige Prognosen und Pläne - infomarkt.de · Shoei Yamana hat sein Ziel nicht erreicht. Vor drei Jahren hat - te der Vorstandsvorsitzende von Konica Minolta seinen Aktionä-ren

3_Tecnologia Para La Gestion Del Control de Riesgos - Yamana - R.palma

CASA YAMANA GOLD é inaugurada em Goiás · momentos de glória e louvor da consciência pública;da independência; dos sentimentos de direito e justiça. Viva o Brasil.(Laércio

Piri Yamana Vale

Depuración de código con PHP - Grupo de Procesamiento del Lenguaje …gplsi.dlsi.ua.es/asignaturas/pi/pi-old/fitxers/expos/jaf/... · 2005-11-30 · Al programar en cualquier lenguaje

Title Prefrontal cortex and neural mechanisms of …...1 Prefrontal cortex and neural mechanisms of executive function Shintaro Funahashi 1,2) and Jorge Mario Andreau 2) 1) Kokoro

Locoregional Cellular Immunotherapy for Patients with ... · Locoregional Cellular Immunotherapy for Patients with Advanced Esophageal Cancer1 Uhi Toh, Hideaki Yamana,2 Susumu Sueyoshi,

LES ECOBOLES I LA SOPA DE PEDRES - gplsi.dlsi.ua.es · ment històric. Al seu torn, l’evolució social, tecnològica i cultural provo-quen canvis en les demandes soci-als, també

Programación de aplicaciones web: historia, principios ...gplsi.dlsi.ua.es/~slujan/materiales/pi-cliente2-muestra.pdf · Programación de aplicaciones web: historia, principios básicos

UNIVERSIDAD NACIONAL DE LOJA - dspace.unl.edu.ecdspace.unl.edu.ec/jspui/bitstream/123456789/1063/1/TESIS.pdf · belleza natural, por su clima templado y acogedor, Yamana no es la

Blogs.ppt [Modo de compatibilidad]gplsi.dlsi.ua.es/proyectos/webeso/pdf/Blogs.pdfExisten weblogs de tipo personal, periodístico, empresariales o corporativos, tecnología, educativos,

file. · PDF fileSATO Hirokazu YAMADA Katsu TERAI Chiaki YAMAMOTO Kyousuke FUNAHASHI Takara SUZUKI Michiya IMAI Shuji KIMURA Yudai TOISAWA Shunsuke TENKAWA Naoki

Taller Yamana 03-2012

Jun. 2015 / Vol. 150 Mitsubishi Electric · Kunihiko Egawa Tetsuyuki Yanase Tadashi Kato Hiroaki Imamura Shinji Yamana Takafumi Kawai Masato Oshita Toshihiro Kurita Vol. 150 Feature

YAMANA ¡yo soyMEJICANO.J’hemeroteca-paginas.mundodeportivo.com/EMD02/HEM/... · en el aspecto internacional, pero hay que reaccionar en la creencia 1 que los «cronos» mundiales

Im japanischen Wolkenkuckucksheim€¦ · verwies er auf den Vortrag von Konzernchef Shoei Yamana, den The Research Company Die aktuelle Information über den IT-, MFP- und Print-Markt

SPIN2006 @ 京都 Y. Seki (Kyoto Univ. / RIKEN ) ( 京大理・関義親 ) K. Taketani (Kyoto Univ.) H. Funahashi (Osaka Electro-Communication Univ.) M. Kitaguchi, M. Hino

Yamana Gold Inc