[IEEE 2008 National Radio Science conference (NRSC) - Tanta, Egypt (2008.03.18-2008.03.20)] 2008 National Radio Science Conference - Improved focused crawling using bayesian object

25th NATIONAL RADIO SCIENCE CONFERENCE (NRSC 2008)

C52 1 March 18‐20, 2008, Faculty of Engineering, Tanta Univ., Egypt

ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــIMPROVED FOCUSED CRAWLING USING

BAYESIAN OBJECT BASED APPROACH

Ahmed Ghozia and Hoda Sorour

Computer Eng. And Science Department

Faculty of Electronic Eng., Menofiya University

[email protected],[email protected]

Ashraf Aboshosha

Eng. Dept., NCRRT, Atomic Energy Authority

[email protected]

ABSTRACT The rapid growth of the World-Wide-Web made it difficult for general purpose search engines, e.g. Google and Yahoo, to retrieve most of the relevant results in response to the user queries. A vertical search engine specialized in a specific topic became vital. Building vertical search engines is accomplished by the help of a focused crawler. A focused crawler traverses the web selecting out relevant pages to a predefined topic and neglecting those out of concern. The focused crawler is guided toward those relevant pages through a crawling strategy. In this paper, a new crawling strategy is presented that helps building a vertical search engine. With this strategy, the crawler is kept focused to the user interests toward the topic. We build a model that describes the Web pages' features that distinguish relevant Web documents from those that are irrelevant. This is accomplished in the form of a supervised learning process, the web page is treated as an object having a set of features, and the features' values determine the relevancy of the web page through a Bayesian model. Results from practical experiments proved the efficiency of the proposed crawling strategy. 1. INTRODUCTION General purpose search engines (GPSEs) aim to collect all the web documents published, indexing and ranking all these pages thus answering ideally every query in any category. A crawler is a main part of the structure of a general purpose search engine [1], it traverses the Web retrieving the Web pages to the GPSE repository. For each crawled page, the crawler extracts new URLs, and adds them to a queue of the URLs to be retrieved and so on. In this way, the crawler follows a breadth first search algorithm, URLs are retrieved in the order they are discovered, it addresses the problem of downloading as many documents as possible in a certain time. The web publishing rate is increasing in such a way that made it difficult for commercial search engines to keep its indexes up to date. Users often find it difficult to search for useful and high quality information on the web using commercial search engines. As a potential solution, Vertical search engines work to collect, index and provide search services in a specific domain, for example: sports, nanotechnology or computer scientific papers. Focused crawlers [2] are the main element in building domain specific search engines. They are agents or spiders that traverse the web collecting only relevant data to a predefined topic while neglecting on the same time off-topic pages. The crawler is kept focused through a crawling strategy which determines the relevancy degree of the web


2 C52 March 18‐20, 2008, Faculty of Engineering, Tanta Univ., Egypt

ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــpage to the predefined topic and depending on this degree, a decision is made whether to download the web page or not. In contrary to GPSE crawler, the focused crawler does not address the performance problem but it aims to collect relevant pages focusing on a specific subject. In this paper, we propose a crawling strategy that keeps the crawler focused to the user defined topic. A Bayesian probabilistic model is built in a supervised learning process. This model pretends to distinguish the Web pages' features that mark relevant pages from those that are irrelevant, the relevant web page is the one considered relevant from the user viewpoint. The rest of the paper is structured as follows. Section 2 reviews related research work. Section 3 describes our proposed domain specific web page classifier. In Section 4,the evaluation methodology and the experimental results are presented and in Section 5 we discuss our conclusions and future directions. 2. RELATED WORK Focused crawling was first introduced by chackrabarti et. al. in 1999[2].Michael Chau et al. introduced their Hopfield Net Spider[3], they modeled the web as a neural network in which the nodes are web pages and the links are simply hypertext links. The Hopfield Net Spider incorporated a spreading activation algorithm for knowledge discovery and retrieval. In [4],Yunming Ye et al. presented isurfer, a focused crawler that uses an incremental method to learn a page classification model and a link prediction model. It uses an online sample detector to incrementally distill new samples from crawled web pages for online updating for the model learned. An association metric was introduced by S.Ganesh et al. in [5].This metric estimated the semantic content of the URL based on the domain dependent ontology, which in turn strengthens the metric used for prioritizing the URL queue. George Almpanidis et al. [6] presented a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. Milad Shokouhi et al. introduced an intelligent crawler Gcrawler [7] that uses a genetic algorithm for improving its crawling performance. It estimates the best path for crawling and expands the initial keywords by using a genetic algorithm. 3. PROPOSED CRAWLING STRATEGY We aim to build a crawling strategy that guides the spider to collect web pages that are relevant to the user interests. Taking the user experience into consideration comes from the fact that the document is not considered relevant until the user chooses the specified page to be relevant. To accomplish this task, a randomly selected set of web pages are rendered to the user. These pages represent a sample set of the pages that the crawler will traverse on its path. Each Web page is treated as an object having a set of features, the features' values determine the relevancy of the web page through a Bayesian model leading to a Bayesian Object Based crawling strategy(BOB crawler).These features include the following:

1. The existence of the topic keywords in the page URL, (F1). 2. The existence of the topic keywords in the page title, (F2). 3. The textual similarity between the web page and the focused crawler topic, (F3).

Features F1 and F2 are directly calculated, feature F3 is discussed in detail in section 3.1 . The user will select those pages that are relevant to the topic from his view point. Two main questions that are answered on this phase are the following:

1. Which Web pages have been chosen to be relevant? 2. What are the attributes values that distinguish relevant pages from those that were selected to be

irrelevant? With these two questions answered, a Bayesian model is built as we will see in Section 3.2.This model would be the crawling recommender which will measure the relevancy level of the crawled web pages. 3.1 Estimating the Textual Similarity

To identify deeply the textual aspect of the specified topic, a centroid to represent the topic is created [8].The topic name is sent as a query to the Google web search engine and the first 7 results are retrieved. This is done automatically with the formal SOAP API, announced by Google [9].The retrieved pages are parsed, stop words such as “the” and “is” are eliminated, words are stemmed using the porter stemming algorithm [10] and the number of occurrences of each word is calculated. The 20 words with most occurrences are recorded and used to build the topic centroid. The centroid would be a vector containing 20 elements representing the term weights.



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ The term weights centroid = {C1 C2 …… Ci……C20}> are computed as:

m a x

ii

ncn

= (1)

Where ni is the term occurrences in the web page and nmax is the frequency of the term with most occurrences. A vector is created for each web page of the training set, this vector contains 20 elements

Page = <V1 V2 ...Vi...V20> (2)

Where Vi is representing the weight of the topic word Wi using the standard TF-IDF weighting scheme [11] where for each term, its weight is calculated as follows:

Vi = TFi ∗ IDFi (3)

Term Frequency (TF)

m ax

intfn

= (4)

Where ni is the number of occurrences of the considered term. nmax is the count of the word with maximum occurrences in the page.

Inverse Document Frequency(IDF)

log

:ii

DIDFd t d

=∈ (5)

Where: D is the total number of pages.

: id t d∈ is the number of pages containing the term. The textual similarity between the page and the topic centroid is the cosine similarity between the tf *idf vector of the topic and the tf *idf vector of each web page:

( ) ( )

20

120 20 22

1 1

*( , )

*

i jii

j

i jii i

c vsim c v

c v

=

= =

=∑

∑ ∑ (6)

3.2 Developing the Bayesian Model

After calculating the web page features and storing the user interests toward these web pages, the following probabilities can be calculated:



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ 1. P (D): the prior probability of the web page to be relevant, it is prior in the sense that it does not take into account any information about the features.

2. P (Dc): the prior probability of the web page to be irrelevant. 3. P (F|D): the probability of the feature to exist in a relevant web page. 4. P (F|Dc): the probability of the feature to exist in an irrelevant web page. These probabilities are fed to a Bayesian model that will calculate the web page relevancy as follows:

{Pr

( )( ) * ( )

( | ) * ( ) ( | ) * ( )c ciorPosterior

Likelihood

P F DP D F P D

P F D P D P F D P D=

+1424314444444244444443

(7)

4. EXPERIMENTS AND ANALYSIS

4.1Experiment Setup The Bayesian model was built through a training phase that was implemented in Java. For the crawling module, Heritrix[12], the open source java crawler was used. An HTML parser was implemented to extract the title, the URL and the textual content of the Web pages. English words were stemmed using the well known porter stemming algorithm. Stop words like “are” and “the” were removed. The crawler seed URLs were obtained through searching the Google and Yahoo! search engines. They were chosen to be of high authority as possible. Two crawling strategies were implemented for performance comparison: 1. Naïve Bayesian Strategy in which the web page relevancy is estimated using the well known naive Bayesian classifier available through the BOW package library [13]. 2. BOB Strategy Web pages relevancy is estimated and the decision is made whether to retrieve this page or not. Crawled pages were evaluated by the evaluation module to determine their relevancy. In these experiments, we only used the vector space model to measure the similarity between downloaded Web pages and the topic centroid.

4.2 Evaluation Method Many metrics were proposed to evaluate the performance of the focused crawler. Precision and recall were used to evaluate the focused crawler performance, but this was done on a predefined test bed [3]. Expert judgments were cited to be the best way to evaluate the focused crawler performance but the large number of crawled web pages makes it difficult to use [2]. We employed the harvest rate to evaluate the crawled pages. The harvest rate has been used as the performance metric in the focused crawling community [2, 4, 14, 15, 16,17].Harvest rate is defined as the average relevancy rate of the crawled pages.

1

R e ( , )N

ii

levance U RL TH R

N==∑

(8)

Where Relevance (URLi, T) is a measure of the relevance of the page with URLi to the target topic T and N is the total number of crawled pages. 4.3 Experiments Here, we provide practical performance measurements for four focused crawling tasks using the naive Bayesian crawling strategy and our BOB focused crawler. The target topics were computer graphics, electronics, Middle East news and motorcycle sales. For each topic we constructed a seed set of 50 URLs each, from the pages listed at results to corresponding entries of the Google and Yahoo search engines. Table1 lists the harvest ratio of the naive Bayesian crawler and BOB crawler for the retrieved web pages



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــfor each topic. The harvest ratios varied among the different topics and seed sets, possibly because of the linkage density of pages under a particular topic or the quality of the seed sets. For the two scientific topics computer graphics (figure 1) and Electronics (figure 2), the BOB focused crawler outperformed the naive Bayesian focused crawler by a percentage of 19.75% and 52% , respectively , the enhancement in the performance appeared especially after the first few hundred pages. The case was not the same for the Middle East news (figure 3) and motorcycle sales (figure 4) topics. For the Middle East news, the BOB crawler performance was better with a percentage of 17.45% which is less than that obtained with the computer graphics and electronics topics. In the motorcycle sales experiment, the BOB crawler began crawling with a lower performance rate but it kept filling up this gap until the performance of both of them were approximately the same. The naïve Bayesian crawler performed better with a percentage of 37.6%. The high performance of the BOB crawler in the scientific topics indicates that the crawler is suitable more for scientific topics which enjoy a well defined literature and a better defined content which will help more building a model that describes the topic of concern. Scientific topics are usually published after a good amount of review and their content tends to have shared keywords and phrases which adds some sort of semantics to the published content.

Table1. Comparison of Naïve Bayes and BOB Crawlers.

Target Topic Naïve Bayesian Crawler BOB Crawler Improvement (%)

Computer graphics 0.31228 0.37396 19.75%

Electronics 0.22530 0.34248 52%

Middle east news 0.43096 0.50619 17.45%

Motorcycle sales 0.34348 0.2496 - 37.6%

5. CONCLUSION In this paper, we have proposed a user based crawling strategy for focused crawling, Web pages were treated as objects having a set of features, and the aim was to predict which values for these features can give a prediction of relevant pages. We have implemented the BOB focused crawler on this framework. Experiments on several topics have shown that our user driven focused crawler improved the harvest rate. In the future, we plan to use more features that describe other aspects of the web page especially the metadata and linkage structure features. We will also study the dependency relationships that may exist between these features and give prediction of the web page relevancy.

6. REFERENCES [1] S. Brin and L. Page. “The anatomy of a large-scale hyper textual web search engine”,Computer Networks, Vol. 30(1-7), pp. 107–117, 1998. [2] S. Chakrabarti, M. van den Berg and B. Dom, “Focused crawling: A new approach to topic-specific web resource discovery”, Computer Networks, Vol.31(11-16), pp. 1623–1640, 1999. [3] Michael Chau and Hsinchun Chen, “Comparison of Three Vertical Search Spiders”, IEEE Computer, Vol.36(5), pp. 56–62, 2003. [4] Y. Ye, F. Ma, Y. Lu, M. Chiu, and J. Huang, “iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples”, APWeb, Springer,2004, pp. 122-134. [5] S. Ganesh , M. Jayaraj ,V. Kalyan , S. Murthy and G. Aghila. “Ontology-based Web Crawler”, IEEE Computer Society , Las Vegas – Nevada – USA, pp. 337-341 ,2004. [6] G. Almpanidis, C. Kotropoulos, and I. Pitas. “Focused Crawling Using Latent Semantic Indexing - An Application for Vertical Search Engines”, ECDL, pp. 402–413, April 2005.



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ[7] M.Shokouhi, P. Chubak, and Z. Raeesy. “Enhancing Focused Crawling with Genetic Algorithms”. ITCC (2), pp. 503–508,2005. [8] D. Bergmark, C. Lagoze, and A. Sbityakov. “Focused crawls, tunneling, and digital libraries”. ECDL, pp. 91–106, 2002 . [9] Google Inc. “Google soap search api (beta)”. http://code.google.com/apis/soapsearch/reference.html, 2007 (accessed September 10,2007). [10] M. F. Porter. “An algorithm for suffix stripping”. Readings in information retrieval, Morgan Kaufmann Publishers Inc, pp. 313-316, San Francisco – CA - USA , 1997. [11] G. Salton and C. Buckley. “Term weighting approaches in automatic text retrieval”.Inf. Process. Manage., Vol. 24(5), pp. 513–523, 1988. [12] Internet archive. “Heritrix open source web crawler”. http://crawler.archive.org/, 2007 (accessed September 15, 2007). [13] A. McCallum. “Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering”. http://www.cs.cmu.edu/~mccallum/bow, 2007 (accessed October 20,2007). [14] M. Jamali, H. Sayyadi, B. Hariri, and H. Abolhassani. “A method for focused crawling using combination of link structure and content similarity”. In Web Intelligence, pp. 753–756, 2006. [15] C. Su, Y. Gao, J. Yang, and B. Luo. “An efficient adaptive focused crawler based on ontology learning”. HIS, pp 73–78, 2005. [16] I. Altingövde and Ö. Ulusoy. “Exploiting interclass rules for focused crawling”. IEEE Intelligent Systems, Vol. 19(6), pp. 66–73, 2004. [17] N. Luo, W. Zuo, F. Yuan, and C. Zhang. “A new method for focused crawler cross tunnel”. RSKT, pp. 632–637, 2006.



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ

Figure 1 Performance comparison of the two focused crawlers for the topic of computer graphics. Each segment is a 100 pages.

Figure 2 Performance comparison of the two focused crawlers for the topic of Electronics. Each segment is a 100 pages.



ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ

Figure 3

Performance comparison of the two focused crawlers for the topic of Middle East News. Each segment is a 100 pages.

Figure 4

Performance comparison of the two focused crawlers for the topic of motorcycle sales. Each segment is a 100 pages.

Documents

[IEEE 2008 National Radio Science conference (NRSC) - Tanta, Egypt (2008.03.18-2008.03.20)] 2008 National Radio Science Conference - Improved focused crawling using bayesian object