22
Web Page Language Identification Based on URLs Reporter: 鄭鄭鄭 Advisor: Hsing-Kuo Pao 1

Web Page Language Identification Based on URLs

  • Upload
    chelsey

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Web Page Language Identification Based on URLs. Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao. Reference. Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Web Page Language Identification Based on URLs

Web Page Language Identification Based on

URLsReporter: 鄭志欣

Advisor: Hsing-Kuo Pao

1

Page 2: Web Page Language Identification Based on URLs

Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008

Reference

2

Page 3: Web Page Language Identification Based on URLs

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

3

Page 4: Web Page Language Identification Based on URLs

Given only the URL of a web page, can we identify its language? Web crawlers Personalized Web Browser

We consider the problem of determining the language of a web page using only its URL. English , French , German , Spanish , and Italian .com (60%) , .org (10%)

www.wasserbett-test.com

Introduction

4

Page 5: Web Page Language Identification Based on URLs

Applying machine learning techniques Features

Word features N-grams features Custom-made features

Machine learning algorithm Naïve Bayes Decision Tree Relative Entropy Maximum Entropy

Introduction

5

Page 6: Web Page Language Identification Based on URLs

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

6

Page 7: Web Page Language Identification Based on URLs

Words as features Remove “www” , ”index”, ”html” …,etc. For example,

http://www.internetwordstats.com/africa2.htm Split into : internetwordstats , com , africa cnn , gov are indicative of English Produits ,recherche are indicative of French

Extracting Feature Vectors

7

Page 8: Web Page Language Identification Based on URLs

Trigrams as features Start with the some token as the method

above(word as features) Eg, weather

“_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” “_th” , “ing” are very common in English

8

Page 9: Web Page Language Identification Based on URLs

Custom-made features Top-level domain country code OpenOffice dictionaries Dictionary with city names Number of hyphens

9

Page 10: Web Page Language Identification Based on URLs

Country code top-level domain only (ccTLD) Country code top-level domain plus

(ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)

Classification Algorithms

10

Page 11: Web Page Language Identification Based on URLs

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

11

Page 12: Web Page Language Identification Based on URLs

The algorithms were evaluated on three different data sets Open Directory Project Microsoft’s Live Search 1260 pages form a large web crawl labels by

hand

DataSet

12

Page 13: Web Page Language Identification Based on URLs

Data set Language Training size

Test size

Open Directory Project

English 145,000 4910

German 144,999 4965

French 144,996 4961

Spanish 144,974 4878

Italian 144,987 4933

SearchEngineResults

English 99,992 999

German 99,572 992

French 99,549 997

Spanish 99,838 997

Italian 99,786 997

WebCrawl

English 0 1082

German 0 81

French 0 57

Spanish 0 19

Italian 0 2113

Page 14: Web Page Language Identification Based on URLs

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

14

Page 15: Web Page Language Identification Based on URLs

P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−)))

= p(+|+)

= p(−|−)

F = 2/(1/R+1/P)

15

Page 16: Web Page Language Identification Based on URLs

Human Performance

16

Page 17: Web Page Language Identification Based on URLs

Baseline : ccTLD

17

Page 18: Web Page Language Identification Based on URLs

18

Page 19: Web Page Language Identification Based on URLs

19

Page 20: Web Page Language Identification Based on URLs

20

Page 21: Web Page Language Identification Based on URLs

21

Page 22: Web Page Language Identification Based on URLs

This paper shows that high quality language identifiers for web pages can be built based on URLs alone.

The largest challenge is to identify English-looking URLs of non-English web pages.

Conclusions

22