Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org)...

Preview:

Citation preview

Takuya FUNAHASHI and Hayato YAMANA (yamana@acm.org)

Computer Science and Engineering Div., Waseda Univ. JAPAN

1

Have you ever experienced?

8 870 t 2 070?8,870 to 2,070?Where have 6,800 web pages gone?

Put a Query andClick “Search”Button

, p g g

How can we have reliable hit counts?How can we have reliable hit counts?

Click “Next” Button

W t t k l it2

We want to make clear it.

Agenda:

PART1:PART1: Importance of Search Engines’ hit counts

PART2PART2: Related Work

PART3: Experiment and Trustworthiness of hit countsExperiment and Trustworthiness of hit counts

PART4: C l iConclusions

33

Introduction

1.IMPORTANCE OF SEARCH ENGINES’ HIT COUNTS

Introduction

SEARCH ENGINES’ HIT COUNTS

44

Background – How important hit-count is?

Many researches based on Hit Count exist …

Researches using Hit Count

- Translation SupportAl-Onaizen, Y. and Knight, K.:”Translating named entities using monolingual and bilingual resources”, Proc. of the 40th Ann. Meeting on Association for

NLP researches using web data aremostly based on SE’s hit counts

- Calculate Similarity between Words or Sentences

and bilingual resources , Proc. of the 40 Ann. Meeting on Association for Computational Linguisitics, pp.400 – 408 (2002).mostly based on SE s hit counts

E l ti f th lit f b

Cilibrasi, R.L. and Vitanyi, P.M.B.:”The Google Similarity Distance”, IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.3, pp.370-383 (2007).

SE’s hit counts now become- Evaluation of the quality of web pages

Gelman, I.A. and Barletta, A.L.:”A “quick and dirty” website data quality indicator”, Proc. of the 2nd ACM Workshop on Information credibility on the web, pp.43-46

one of indispensable web resources

5

(2008)

Example “Hont? Search” System[4]by Y.Yamamoto et al.(Fact Search System)( y )

It utilizes the difference among hit counts.

66

g

Hit count v.s. the number of returned web documents

• Difference between the Hit Count on the 1st SERP and the number of returned web documents

e 1s

tS

ER

P

SE

RP

ount

s on

the

s on

the

1st

hit c

o

number of returned hit c

ount

s

number of returnedweb documents(MSN)

web documents(Google)

e 1s

tS

ER

P

Experiment by using 1,000 querieson Nov 2008ou

nts

on th

77number of returned web documents(Yahoo! JAPAN)

on Nov.2008

hit c

o

How often does the relationship turn over?・Using 1,000 queries to compare the relationship, i.e. which has large hit count,for every pair of 1,000 queries

CASE1 i th hit t th 1 t SERP

12 11,1% CASE1: comparing the hit counts on the 1st SERP.

CASE2: comparing the numbers of returned web documents

8

10

5 9We should make clear the

t t thi f

4

6

2 6

5,9trustworthiness of search engines’ hit counts

2

4 2,6

0Google Yahoo! JAPAN Live Search

t ti f th b f lt f i

88

turn over ratio of the number of results of every pair

2.RELATED WORK

99

Related work

• Googleology is Bad Science– Kilgarriff, A. : “Googleology is Bad Science”, Computational Linguistic,

V l 33 I 1 Queries repeated the following day gave hit Vol.33, Issue 1, pp.147-151 (2007).

• Quantitative comparisons of search engine results

Que es epeated t e o o g day ga e tcounts varied by more than 10%.

Quantitative comparisons of search engine results– Thelwall, M. : “Quantitative comparisons of search engine results”, J. of

American for Information Science and Technology, Vol.59, Issue 11, pp 1702 1710 (2008)

He compared three search engines, i.e. difference among the three.pp.1702-1710 (2008)

• Investigation of the accuracy of search engine hit counts

difference among the three.

search engine hit counts– Uyar, A.:”Investigation of the accuracy of search engine hit counts”, J. of

Information Science, Vol.35, Issue 4, pp.469-480 (2009)He compared three search engines to find out

the search engines returning the mostthe search engines returning the most accurate hit count.

He assumed the number of returned web

10

documents is correct.

Related work

None of previous researches provideh t th li bl hit twhat the reliable hit count in a search engine is.g

1111

3.EXPERIMENT AND TRUSTWORTHINESS OF HIT COUNTTRUSTWORTHINESS OF HIT COUNT

1212

We cannot trust search engines’ hit count

Put a Query andClick “Search”Button

Where have the 6,800 pages gone?Sometimes, ,

hit count decreases to 1/10or

increases to 10 times.

Click “Next” Button

13

Purpose of Our Research

To provide “reliable hit count in a psearch engine” for researches using it

Cues of hit count ‘dance’ (change):

CASE1: Clicking the “Search” button many timesg y

CASE2: Clicking the “Next” button step by step to reach the last page of the search resultsto reach the last page of the search results

CASE3: Searching with the same query on different

14

days

Experimental Setting

Target:

(APIs)(APIs)

API Settings:API Settings:non-phrase search, normal-safe filter, Japanese language

Queries:10 000 queries provided by Yahoo! JAPAN10,000 queries, provided by Yahoo! JAPANas the top 10,000 frequent queries in December 2007.

Experimental Period:From October 2009 to December 2009.

15

From October 2009 to December 2009.

Case 1:Clicking the “Search” button many times

Queriesquery1 100 Hit C t

100 times in 5 min.

query1query2

query10000

・・・

100 Hit Counts

CV of “query1”query10000 CV of query1

1. Submit the same query to each SE 100 times in 5 min.2. Calculate the coefficient of variation(CV) from 100 Hit Counts.

CV =standard deviation

=variance

3. Execute 1 and 2 for all queries, then produce its histogram.

CV = average = average

16

q p g

Histogram of the coefficient of variation

All Search Engines’ CV are l t l th 5 0%

We may ignore

17

almost less than 5.0% the dance in case1.

Case 2: Clicking the “Next” button step by step

Queriesquery1

1. Submit one query to Search Engine to getHitCount(1,10), HitCount(11,20), ..., HitCount(991,1000)

1 t SERP 2 d SERP 100th SERPquery1query2

q er 10000

2. Calculate Deep hit count Vector(DV)1st SERP 2nd SERP 100th SERP

query10000 DV =

3. Execute 1 and 2 for all queries,th l k l t i t t

Click!!

then apply k-means clustering to a setof DVs

20,0001-10

20,00011-20

......

12,000991-1000 note:

Case 2 is applied to only Bing and Yahoo!because Google API does not return 1000

18

because Google API does not return 1000 results.

Clustering( by using k-means)

• Objective– extract transition patterns of the hit countsp

• MethodMethod– vary its clustering size k from 1 to 6, then, select the

best size by manual.best size by manual. – choose the best size based on the following points;1) start offsets when hit count dances begin are clearly1) start offsets when hit count dances begin are clearly

different among clusters.2) curves of change ratio are clearly different among ) g y g

clusters.

1919

Transition of Hit Counts - Case 2

(k=1)

Adj ti th fi l hit t

e R

atio

Adjusting the final hit countto the number actually returned

web documents.

Cha

nge

Search Offset N

(k=2)

nge

Rat

io

Re-calculateHit Count!!Every 100 results,

Cha

ny ,Yahoo! re-calculates

the hit count

20

Search Offset N

Discussion(our assumption)

At first, hit count should be calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultspto calculate hit count.

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

21

g , j g

Search Engines’ Pruned Index Architecture

Querieshits results cache

Neither result cache hit

not hit results cache, but hits pruned index

Neither result cache hit nor pruned index hit

Results Cache

Results Cache

Results Cache

Low level cachePrunedIndex

PrunedIndex

PrunedIndex

Low-level cache

Full Index

Gleb Skobeltsyn, Flavio P. Junqueira ,Vassilis Plachouras, Ricardo Baeza-Yates:”ResIn: A Combination of Results Caching and Index Pruning for High-performanceW b S h E i ” SIGIR2008 131 138

2222

Web Search Engines”, SIGIR2008, pp.131-138

Discussion(our assumption)

At first, hit count is calculated by using top-k

results

top-k results

results

When requesting more

Full Index

q gresult,

search engine will useanother top-k resultstop-k results

Reliable Hit Count in Case 2 isp

To calculate hit count.HitCount(k, k+9) where k is defined as the largest number,usually 991 for a search having more than 1,000 matched

Most reliable hit count will be the hit count when “k” is the largest number, but not adjusted to such as Bing.

pages. If a search engine adjusts the last hit count,we should use the hit count just before the adjusted count.

23

g , j gwe should use the hit count just before the adjusted count.

Case 3: Searching on different days

1 Submit queries to Search Engine every day during 2 months1. Submit queries to Search Engine every day during 2 months.(From October 11, 2009, to December 12, 2009)

2. Calculate Vectors of Variational ratio (VV).

VV =

3. Apply k-means clustering to VVs like Case 2.3. Apply k means clustering to VVs like Case 2.

24

Clustering Result - Case 3

(k=4) (k=5)

Rat

io

atio

Cha

nge

R

Cha

nge

Ra

within a week

25

changes more than 30%

Clustering Result - Case 3

(k=3)( )

tioC

hang

e R

atC

within a week

h th 30%

26

changes more than 30%

Conclusion of Case 3(k 4)(k=4)

o

Search Engines have two phases

(k=5)

hang

e R

atio

“Dancing Hard Phase”

: index update phase.During this phase, hit counts will change

th 30% ithi k fCh

Rat

io

more than 30% within a week from our observation.

(k=3)

Cha

nge

R“Stable Phase” Hit counts do not change more than 30%within a week.

Rat

ioC

hang

e R

Reliable Hit Count in Case 3 is...

A hit count when a Search Engine is on stable phase

27

A hit count when a Search Engine is on stable phase.

4.CONCLUSIONS

2828

Conclusion

Hit count has become one of indispensable web resourcesresources

Reliable hit count can be the last hit count when h i i bl h h ha search engine is on stable phase where there

exists small change less than 30% during one kweek.

Our next challengeBuilding some benchmark to correct/refine theBuilding some benchmark to correct/refine the hit count more reliable.Providing some quality/reliable measure for the

29

Providing some quality/reliable measure for the hit count.

THANK YOU FOR YOUR ATTENTION

3030

Recommended