Web Recommendation System with Image Retrievaluu.diva-portal.org/smash/get/diva2:431579/FULLTEXT01.pdf · Web Recommendation System with Image Retrieval Bin Yan The amount of information

IT 11 030

Examensarbete 30 hpJuni 2011

Web Recommendation System with Image Retrieval

Bin Yan

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Web Recommendation System with Image Retrieval

Bin Yan

The amount of information on the Internet has dramatically increased during recentyears such that increment causes a problem so called “information overload”, whichcan only be partially solved by search engines. Although there is a considerableliterature on search engine focusing on information overload, it has still not beencompletely overcome to date due to concerns about commercial interests, individualdifference and objective process. Addressing those concerns, recommendationsystems, which are information-filtering systems that can recommend informationwithout explicit participation of the user, was designed to aim those problems.

The recommendation system collects the interests of users to create an independentprofile for each user. Moreover, it compares the user profile to some referencecharacteristics, and the system recommends information of potential interest to theuser. They redeem from shortcomings of search engines, since recommendationsystems focus on the specific characteristics of each user.

Unlike previous literature that focuses on text, this thesis presents an improvedrecommendation system, which considers the information stored in images.Based on methods of user modeling and user profile expression are analyzed, A newdesign for user profiles joint with methods for content based image retrieval arepresented. In this design, the new user profile contains information from images onthe web pages to increase the accuracy of the recommendation. Furthermore,algorithms for updating the user model according to user feedback are alsointroduced such that the user model can reflect the interest modification of users.Using a real-word deployment, the thesis shows the new system achieves betteraccuracy comparing to existed text-only methods given small amount of data.

Finally, the thesis argues about the feature selecting in Image analysis is the bottleneckfor recommendation system. It appears very hard to significant improve existedsystem without new features and semantic analysis.

Tryckt av: Reprocentralen ITCIT 11 030Examinator: Anders JanssonÄmnesgranskare: Ivan ChristoffHandledare: Chenxi Zhang

TableofContents

1INTRODUCTION................................................................................................................1

1.1 Background ............................................................................................................................... 1

1.2 Related Work ............................................................................................................................. 2

1.3 Objectives ................................................................................................................................. 5

1.4 Thesis Overview......................................................................................................................... 5

2USERPROFILE...................................................................................................................6

2.1 Classification ............................................................................................................................. 6

2.2 The Data Source ......................................................................................................................... 7

2.3 Representation .......................................................................................................................... 9

3MULTIDIMENSIONALUSERPROFILE....................................................................11

3.1 Relativity between Text and Image Content in Web Pages ........................................................ 11

3.1.1ExperimentalDesign ............................................................................................................... 11

3.1.2ExperimentResult ................................................................................................................... 11

3.1.3AnalysisandConclusion ......................................................................................................... 12

3.2 Vector Space Model Based on HTML Structure ......................................................................... 12

3.3 Image information extraction and representation .................................................................... 14

3.3.1CommonMethodsofContentBasedImageRetrieval(CBIR) .............................................. 14

3.3.2 Color Features Extraction and Representation based on Image Block ..................................... 15

3.4 User Profile Design .................................................................................................................. 18

3.4.1InterestVector ......................................................................................................................... 18

3.4.2SimilarInterestJudgment ...................................................................................................... 18

3.4.3UserModelingAlgorithm ....................................................................................................... 19

3.4.4DistancebetweenUsers ......................................................................................................... 19

3.4.5SimilarUserClustering ........................................................................................................... 20

3.4.6UserFeedback ......................................................................................................................... 21

3.4.7UpdatingUserProfile .............................................................................................................. 21

4AHYBRIDSYSTEM‐COMBINEDCONTENTBASEDANDCOLLABORATIVERECOMMENDATIONS......................................................................................................26

4.1 Advantage of Hybrid System .................................................................................................... 26

4.2 System Architecture ................................................................................................................. 27

4.3 System Flow ............................................................................................................................ 28

4.4 Key Technology Used in the System .......................................................................................... 29

4.4.1ChinesewordSegmentation ................................................................................................... 29

4.4.2 Calculation of Chinese compound word ................................................................................... 31

4.4.3 Capture User Browsing Behavior .............................................................................................. 32

4.5 The Brief Introduction of the Prototype .................................................................................... 33

4.5.1 Server Modules ......................................................................................................................... 33

4.5.2 Client Interface .......................................................................................................................... 38

4.6 The Experiment ....................................................................................................................... 40

5CONCLUSIONS................................................................................................................41

5.1 Result ...................................................................................................................................... 41

5.2 Future Work ............................................................................................................................ 41

REFERENCES................................................................................................................43

1

1Introduction

1.1Background

According to China Internet Network Information Center (CNNIC), the number of

Internet users in China reaches more than 384 million up to Dec. 31 2009. The total

number of domain name in China is more than 16.28 million, and the number of web

pages reaches 33.6 billion [1]. Figure 1.1 shows the increasing trend of the web pages

in China.

Figure 1.1 the trend of web pages number in China from 2003 to 2009

Regarding to the rapid growing of the Internet, information on the Internet has

increased extremely, and the increasing may result in the problem called

“information overload” which refers to the difficulty of users to make decision by

information [2].

The search engine can help the Internet user to handle the information overload

situation partially. However, search engine or information retrieval system has

following defects:

1. The search result is disturbed by commercial interest. Currently, search engine

rely on advertisement to generate revenue. The service providers always place

the advertisement on the search result that reduces the optimum of result.

2. Search engine ignores the difference between users. The users would get the

same result if they input the same keywords; even through emphasis is distinct

between individuals.

3. The user dominates the search process. The search engine depends on the users’

input, but the requirements of users are always unclear. To improve the accuracy

of the result, user should modify the search keywords and research again.

0

5

10

15

20

25

30

35

40

2003 2004 2005 2006 2007 2008 2009

2

Addressing those defects, recommendation systems, which are information‐filtering

systems that can recommend information without explicit participation of the user,

was designed to aim those problems. The system collects user information to create

independent profile for each user. Moreover, It compares user profile to some

reference features, and the system recommends item to users who have potential

interest on specified topic.

However, the previous web page recommendation system is based on the plain text

content of the web pages rather than the information contained by the images [3],

[4]. The goal of this thesis is to design a new user profile that handles the

information contained in the images in the web pages. Also, the thesis implements a

prototype of the web page System, and it measures the performance of new user

profile by experiments.

1.2RelatedWork

Recommendation System

Recommendation system derives from a specific type of information filtering system

technique that attempts to recommend information items that are likely to be of

interest to the user. Typically, recommendation system includes three key elements:

candidate items, users and recommendation algorithm. Figure 1.2 illustrates the

architecture of recommendation system. When building the user's profile, the users

can explicitly provide interest information, or the system can collect data implicitly.

Recommendation algorithm calculates the candidate items based on user interest

profile to make recommendation.

Figure 1.2 architecture of recommendation system

Research Situation

Robert Armstrong etc. advanced the first recommendation system Webwatcher in

1995. After that variety of recommendation systems have been developed such as

Amazon, eBay, Taobao etc. Table 1.1 listed classified mainstream recommendation

system in both research and commercial area.

3

Table 1.1 mainstream recommendation system

Area Recommendation System E-commerce Amazon.com, eBay, Levis, Ski-europe.com Web Page Fab, Foxtrot, ifWeb, MEMOIR, METIOREW, ProfBuilder, QuIC,

Quickstep, R2P, Siteseer, SurfLen Music CDNOW, CoCoA, Ringo, Music.Yahoo.com Movie Netfilx.com, Moviefinder.com, MovieLens, Reel.com News GroupLens, PHOAKS, P-Tango

In theoretical literature, recommendation system has become an independent

discipline, contains areas like E‐commerce, Network Economics, sociology etc. Recent

years, research on recommendation system increase rapidly: 1) ACM set up

conference: ACM recommender system; 2) Papers about recommendation system

increased year by year in Top conference on human computer interaction, data

retrieval and machine learning (SIGCHI, KDD, SIGIR, WWW etc.). 3) Top Journals (such

as IEEE Trans. on knowledge and Data Engineering, ACM Trans. on Information

system) have collected several papers on recommendation system. The research

institute (researcher) advanced in recommendation system include: New York

University (Alexander Tuzhilin), GroupLens group in University of Minnesota (Joseph

A. Konstan, John Riedl etc.), University of Michigan (Paul Resnick), Carnegie‐Mellon

University (Jaime Callan), Microsoft Research (Ryen W. White) etc. Besides, University

of Michigan offered recommendation system course begin from 2006.

Previous Work on Web Page Recommendation System

1. Fab

Marko Balabanović and Yoav Shoham implemented a hybrid content‐based,

collaborative system in 1997 [7]. As a part of digital library project, Fab aims to help

users filter useful information from huge quantity of Internet information. The

system combined Content Based Recommendation and Collaborative Filtering

Recommendation to create a hybrid recommendation system. The process of

recommending can be divided into two phases: 1. Collect information and set up

manageable database; 2. Select certain information for certain user.

Three parts composed fab: Collection Agent, Selection Agent and Center Router.

Every agent maintains a profile contains words of web pages that have been rated.

The profile of Collection Agent represents its current topic, whereas a selection

agent’s profile represents a single user’s interests. Pages found by the collection

agents are sent to the central router, which forwards them on to those users whose

profiles they match above some threshold. The users are required to assign

appropriate ratings from a 7‐point scale. The user’s ratings are used to update their

personal selection agent’s profile, and are also forwarded back to the originating

collection agents, which will use them to adapt their profiles. Additionally, any highly

4

rated pages are passed directly to the users’ nearest neighbors—other people with

similar profiles. In conclusion, by combined two recommendation algorithms, Fab

can filter large scale, rapidly changed information and process dynamic feedback.

2. Siteseer

James Ruker and Marcos J. Polanco developed Siteseer system in 1997 [8]. Siteseer is

a web page recommendation system that utilizes user bookmarks (including Hotlist

and Favorites) and the organization of bookmarks. Bookmarks represent users’

interests, especially classified bookmarks that reflect a conscious behavior of user.

Siteseer contrast bookmarks from different user, calculate users similarity based on

URL similarity in their bookmarks. Then the system introduces different URLs

between neighbors.

3. ifWeb

Fabio A. Asnicar and Carlo Tasso introduce the ifWeb system in 1997 [9]. ifWeb is a

content based web page recommendation system. Other than Fab, ifWeb use a

multi‐dimensional description of the web page. ifWeb not only utilizes the key words

in web pages, but also exploit domain name, size of HTML file, number of images etc.

to represents the content of the web pages.

Analysis of Pervious Web Page Recommendation System

Because of variety of limitations, there is no instance of web page recommendation

system achieved commercial success. The main defects are listed below:

1. The description of user profile and knowledge are unitary. For instance, Syskill &

Webert [10] and Fab [4] are all use several key words to represent the content of

the web pages. Accordingly, the matching between candidate items and users are

all based on text matching methods.

2. Over‐fit problem caused by only using content‐based recommendation algorithm.

Exclusive content‐based recommendation algorithm can only recommend item

similar with the one that the user has browse before. Over‐fit problem is that the

recommendations are either too similar with previously items or unrelated at all.

3. The cold start problems include new user problem and new item problem. When

new users or new items are added into the system, the system cannot make

recommendation until new users have given enough feedback rating or the new

items have received enough feedback rating. Sparse problem means that when

the number of user is too small compare to the number of items, there is always

some items did not receive any feedback. The condition that the difference

between any two users is too large can also cause sparse problem.

5

1.3Objectives

The objectives of this thesis is to design and specify a new user profile based on both

the text and image content of the web pages, and the thesis also implements a

prototype of the web page recommendation system using this user profile. Finally,

the thesis shows the performance of new recommendation system by experiments.

1.4ThesisOverview

General idea and methods on user profile will be introduced in chapter 2. Chapter 3

will give a new user profile that represents interest of both text and image content of

web pages. Chapter 3 also includes the modeling and updating method of user

profile. Chapter 4 introduces the architecture of the whole system and the

implements of the system. Future work is discussed in chapter 5.

2U

The

mod

defi

inte

plan

[13]

dep

cog

with

Use

incr

goo

use

syst

intro

2.1

1. M

Man

from

new

2. U

Use

The

exam

the

UserPro

ere is no sta

del used for

ines it as “a

eraction wit

n (with whic

]. In conclus

pict user’s in

nized by ma

h particular

er profile pla

reasing reco

od user prof

r’s preferen

tem provide

oduce some

1Classifi

Manual Mod

nual Model

m predefine

w user is req

User Sample

er sample m

e general me

mple, Amaz

user is aske

ofile

ndard defin

r capturing,

description

h the outer

ch to reach

sion, in a re

nterests and

achine and

r data struct

ays a core ro

ommendatio

file system c

nce. from in

es the user w

e critical tec

ication

deling by Us

ing means u

ed options.

quired provi

Figu

e Modeling

modeling req

ethod is usi

zon.com req

ed to input

nition of use

recording a

n of users’ u

r world” [12

the goal), b

commenda

d requireme

calculable.

ture.

ole in recom

on accuracy

can analyze

teraction w

with the mo

chnology ab

ser

user provid

For exampl

iding locatio

ure 2.1 my.ya

quired users

ng the rated

quires new

some key w

6

er profile. Ya

and managi

understandi

2]. Alfred Ko

beliefs and k

ation system

ents in a per

It is an algo

mmendation

y and impro

the browsi

with the syst

ost proper r

bout model

es preferen

e, my.yahoo

on, interest

hoo.com new

s provide sa

d items whi

user to sele

words for se

angjun Pei d

ing user’s in

ing of the o

obsa defines

knowledge

m, user prof

riod of time

orithm orien

n system. It

oving the eff

ng history o

tem. Based

recommend

ing user pro

nce informat

o.com, duri

s etc. (see f

w user registe

amples whic

ich user has

ect a preset

earching. Aft

defines user

nterest” [11

uter world

s it as a set

about a par

ile refers to

e. User prof

nted formal

is the key f

ficiency of t

of the user,

on these fe

dation. This

ofile.

tion manua

ng the regis

figure 2.1).

ering

ch they are

s browsed a

category o

ter searchin

r profile as

1]. Zhiwei G

and the

of user’s go

rticular dom

o a model

file should b

description

factor for

the system.

and infer

eatures, the

chapter wi

ally or select

stering proc

interested

as samples.

of goods. Th

ng, the user

“a

uan

oal,

main

be

n

A

ll

ts

cess,

in.

For

en

r

nee

with

3. A

Aut

proc

feed

web

2.2

1.

2.

3.

ed to rate ite

h the search

Automatic M

omatic Mod

cess but the

dback. For e

b site in its h

2TheDa

Data from

The log file

and downl

Data from

As an inter

user’s beh

Data from

Data repre

ems in the s

h result (see

Fig

Modeling by

deling mean

e system es

example, th

home page.

Figure 2.3

ataSourc

Web Sever

e on web se

load behavi

Proxy Seve

rmediate no

avior of bro

Client

esents user’

search resu

e Figure 2.2

gure 2.2 samp

y System

ns the user

tablishes th

e Google Ch

. The data is

most freque

ce

erver record

ors etc.

r

ode betwee

owsing mult

s interest o

7

lt. The proc

2).

ple modeling

does not ex

he user prof

hrome brow

s collected

ently visiting w

ds the web p

en user and

tiple web sit

n client side

cess iterates

of Amazon.c

xplicitly par

file based o

wser lists th

implicitly (s

web site lists

page URL, t

web server

tes.

e includes:

s until the u

om

rticipate in t

n the user’s

e most freq

ee figure 2.

in Chrome

ime user br

r, proxy serv

user satisfies

the modelin

s implicit

quently visit

.3).

rowsed, upl

ver records

s

ng

ted

oad

the

8

1) User’s browsing history and Internet temporary files. The local history or

temporary files record the web site the user has browsed recently and the

time of browsing;

2) User’s bookmark. Bookmark represents user’s interest, especially the

classified bookmarks which reflect a conscious behaviors of users;

3) Searching keywords; 4) Browsing behavior of the user, which include the time the user stay on each

page, keyboard and mouse operating, printing or saving the page, adding

bookmarks etc;

5) Cookies and forms saved by the browser;

6) Documents download by the user.

Among the data sources mentioned above, searching key words can only represents

the current interest but not the long‐term interest. Cookies are difficult to

understand without particular knowledge of the web server. Generally, user only add

pages they interested in into bookmarks, thus bookmark can well represent user’s

interest. However, bookmark is too few compare to the history files or temporary

files. In fact, users do not always add every interesting page into bookmarks. So

modeling based on bookmarks can not reflect the overall interest of the user.

Contrast to the bookmarks, the history file can better represent user’s interest. The

browsing history file is saved implicitly by the browser. The system can establishes

user profile without explicit participation of user. Of course this data source also has

some drawbacks. Web pages in the history folder may not all interested the user. For

example, the hyperlink cannot depict the page well, the user may find the page is not

interesting after open the link. That means the history files contain some

interference. The system should rule out that interference when utilizing history files

as data source.

The browsing behavior can also reflect user’s interest. When user stays a relatively

long time on a certain page, it is inferred that the user may interest in the page. To

present user’s interest, browsing behavior should be utilized together with page

which is browsing.

The log on web server or proxy server not only records the pages the user has

browsed but also records the varied behavior when browsing. Log on proxy server

always records all web sites the user has browsing. Thus it can represent user’s

interest completely. On the other hand, web server only records the visitation of the

particular web site. But it has no idea about other sites. So web server only proper

for user modeling on the particular site.

The document download or saved by user can also represent his/her interest.

Normally, user only downloads and save documents he/she interested in. Besides, in

order to help managing and accessing, the downloaded files are always classified.

9

The information collect from those classified files can reflect topics user concerned.

To sum up, the sever log, browsing history and the browsing behavior can represent

the user’s interest most completely. Bookmarks may not reflect overall interest but

still represent the user’s concern.

2.3Representation

According to [14], [15], [16], traditional reprehensive method include:

1. Topic words representation method

This method use topic words to represent user’s interest. Topic words always

represent a particular domain. For example, topic words like “Sports” or “News”.

This method is always used together with manual modeling. For instance,

my.yahoo.com records the selection of preset options like “Sports”, “Technology”,

and “Finance” etc. Then the information

2. List of key words method

This method uses a list of key words to present user’s interest. For example,

assume a user interest in football, and then the user profile may like {football,

Word Cup, Messi, UEFA Champions League}. The key words can determine by

user or learn by system. The typical recommendation system use key word list is

Webwatcher. Webwatcher required that the user should input interested key

words first. Then it recommends web pages to the user when browsing.

3. Methods based on Vector Space Model

This method use vectors in the vector space of key words to represent user’s

interest. Vector space model is a common method to represent document. Every

document can be present as , , , , … , , in which is

the items (word or phrase), is the weight of in . To use the method

represent user profile, , , … , represnt items user interested in, and

, , … , represent the degree of interest of each items.

4. Bookmark Method

Users always add web pages they are interested in into bookmark (including

Hotlist and Favorites) in order to visit again. The system use this method include

Siteseer, Open Bookmark and online bookmark service.

5. Methods based on User‐item Matrix

User‐item matrix method uses a R ∗ matrix to represent user profile [17]. In

the matrix, m is the number of users，n is the number of items。Every element r in

the matrix represents rating of the user (the row) to the item (the column).

Generally r is an integer, for example from 1 to 5. Empty value means user has

not rating the item yet. Systems based on collaborative filtering are suitable for

using this method.

10

All the methods mentioned above did not consider the multi‐media content

contained in the web page. This thesis combined user profile modeling with image

retrieval in order to utilize the information contained in images to refine the

representation of users’ interest.

11

3MultiDimensionalUserProfile

3.1RelativitybetweenTextandImageContentinWebPages

Images transfer information by painting language which is more direct and has larger

information capacity. Besides, images are more accuracy than text. It is harder to

tamper or twist images. Thus most of web pages on Internet use images to better

represent the content. However, images in web page are not all related to the main

topic of the page. There are lots of images such as advertisement, UI elements or

logos in the web page. The first part of this chapter will introduce an experiment to

find out the relativity between text and image content in web pages.

3.1.1ExperimentalDesign

The experiment analyzes the relativity of images and text through automatic scan

244 web pages. The key points of the experiment include:

1) The selection of web pages This experiment has two groups of web pages. Group randomly selects 100

web pages from the Internet. Group selects 144 web pages from the data

source THE SYSKILL AND WEBERT WEB PAGE RATINGS [33].

2) Image analysis

This experiment analyzes the HTML file of selected web pages. Calculate the

number of <img> tag to get the number of images. The attributes include src, alt,

title in <img> tag represent the image content from the web page producer’s

view. The src attribute is used for set the source file of the image, the alt attribute

is used for specifying alternative text (alt text) that is to be rendered when the

element to which it is applied cannot be rendered. The title attribute set the title

of the image.

3) Relativity judgment

Sort the words in web page by the word frequency. Compare the top 5 frequently

occurred words with the words in src, alt, title attributes in <img> tag. Once

matched word is found, the image is considered relative to the topic of the web

page.

3.1.2ExperimentResult

In group 99% web pages have images, more than 78% pages has at least one

image relative with the page topic. Among all the pages, 14.7% were relative with the

page topic. See figure 3.1

The

ima

rele

3.1

The

but

hav

be h

In p

ima

accu

3.2

As m

Bec

mod

e data sourc

ges are rela

evant to the

1.3Analy

e result show

most pages

e no alt, titl

higher than

previous wo

ges was no

uracy user p

2Vector

mentioned

ause this m

del to repre

Figure 3.

e [33] has o

ative to the

page topic

Figure

sisandC

ws that alth

s at least ha

le attribute,

it in the ex

rk of web p

t considere

profile whic

rSpaceM

in 2.3, vecto

method is ca

esent the pa

1 images rele

only 34.2% p

page topic.

.

e 3.2 images r

Conclusio

ough the m

ave one ima

, which dec

periment.

page recomm

d. The last p

ch combined

ModelB

or space is a

lculable and

art of the us

12

evant ratio in

pages have

. See figure

relevant ratio

on

most of imag

age relative

clines the ra

mendation

part of this

d with infor

Basedon

a common

d operable,

ser profile e

random caug

images. Am

3.2. Among

o in data sour

ges are not

to its topic

tio. It is sur

system, the

chapter wi

rmation of i

nHTML

method for

this thesis

extract from

Pages irrelev

Pages image

ght pages

mong all the

g all the pag

ce [33]

relevant to

. In fact, ma

re that the a

e informatio

ll introduce

mage conte

Structu

represent u

will use vec

m text conte

only havevant images

have relevenes

Pages have

Pages have images

Pages only images

e images 44

ges, 17% we

page’s topi

any <img> t

actual ratio

on containe

e a more

ent.

ure

user profile

ctor space

ent.

nt

no images

relevant

irrelevant

4.8%

ere

ic,

ags

will

d in

e.

13

Vector space model use the vector like , , , , … , , to present

user’s interest. In the vector, is the item, which is a word depicting the user’s

interest. Because the web pages interested by the user may contain a lot of word

have little help to represent user’s interest. Besides, along with the increasing of user

description file, memory space and calculating cost are also increasing. Thus how to

select items and their weights is the central problem of vector space model.

In this thesis, vector space model is based on HTML structure and the word

frequency. The more frequently occurred words have higher weight. Besides, words

in different tags of HTML file have distinguishing importance on representing the

topic the page. Generally, words in the title may have a closer relation to the topic

then the words occur in the other part of the web page. Considering the HTML

structure, words in different tags should be assigned with different weight. See table

3.1.

Table 3.1 tags affect item weights

Tag Specification Weight

<title></title> Title of the page 10

<h1></h1>,<h2></h2>,…<h6></h6> Headings of the

page

6 , in which is the level

of the heading

<body></body> Text body 1

<em></em>, <strong></strong> Emphasis 1

The weight computing equation of key word t is

∑ (3.1)

In which, k presents the type of the tag, is the times key word t occurs in the

tag k. is the corresponding weight in table 3.1. The function is

1, in tag <em> or <strong> ( )

0, not in tag <em> and <strong>

Tt

T

(3.2)

Now key word t and its weight can be calculated. After calculating all the words in

the page, vector V is gained by assemble the highest n weight words.

V , , , , … , , (3.3)

14

3.3Imageinformationextractionandrepresentation

3.3.1CommonMethodsofContentBasedImageRetrieval(CBIR)

Image retrieval system is a system for browsing, searching and retrieving images from

large database of digital images. The traditional methods utilize text description of

images, but this method has great disadvantages: first, images are generally rich in

detail and extended meaning. It is difficult to describe by a few keywords or a simple

comment; Second, different people will have a different understanding of the same

image, which makes it difficult to use the text label for responding to user queries

accurately; Third, the image text annotation can only be done by hand, which is

feasible only when the number of images is small. However, if the total number of

images grow too fast, the annotation by hand will become very difficult.

Content‐based Image Retrieval provides a good solution to these problems by

extracting the characteristics from the image itself. This features extract from images

are objective and comprehensive. Besides, the entire process can be done

automatically. The speed and accuracy of the retrieval are increased. The common

methods of CBIR include:

1. Retrieval based on color

Compare to the shape, color has a rotation invariance and scale invariance

[18]. The basic idea of color based retrieval is to utilize feature of the color

distribution. Color histogram is a common method of color based retrieval. This

method represents color distribution by a histogram and attributes the distance

between the images to the distance between its color histogram. Therefore

image retrieval becomes color histogram matching. This method has

disadvantages: because the color histogram does not maintain any spatial

information, the search results were not accurate. Color histograms of two

completely different images may very similar.

2. Retrieval based on texture

Texture is an important feature of objects, which reflecting the changes of surface

color and grayscale. Retrieval based on texture can be divided into three

categories: statistical methods, structural methods and spectral methods.

Statistical method is to identify the numerical characteristics of the images, such

as Fourier spectrum [23], co‐occurrence matrix (co‐occurrence matrix)

[24], Markov random field models [25‐26]. Structural approach assumes that the

texture pattern has certain texture primitive arranged with certain rules. This is

only proper for some regular images.

3. Retrieval Based on Shape Feature

Shape is one of the essential characteristics for characterizing objects. Shape is

also the initial stage that people learn about things. The problem of image

retrieval based on object shape is segmenting objects from the image by

15

appropriate image segmentation method. The key is to find characteristics of

shape consistent with the human eye perception. The traditional shape‐based

retrieval is based on shape features composed by shape feature vector. The

classic description of the shape in Image analysis include: Fourier descriptors,

moment invariants and various simple form factor (size, roundness, eccentricity,

etc.), spindle orientation (major axis orientation) and so on.

3.3.2 Color Features Extraction and Representation based on

Image Block

Considering the computing complexity, this thesis selects color feature to represent

the images. In order to reduce the error, this thesis also utilizes image block division

to represent shape features in some extent.

1. Color Feature Extraction and Representation

Color histogram is widely used in many image retrieval systems to represent color

features. It concerns the proportion of different colors in the whole image but

does not care the spatial location of each color. It cannot describe the object in

images. Color histogram is particularly suitable to describe the image which is

difficult to segment object automatically.

The color histogram can base on different color space and coordinate system. The

most common color space is RGB color space, because most of the digital image

is expressed in this color space. However, RGB space structure does not consist

with people's subjective color similarity judgments. Therefore, the RGB color

space is often converted into HSV color space.

HSV (hue, saturation, value) color space model corresponding to conical subset in

the cylindrical coordinate system. The top of the cone corresponding to V =

1. The top surface contains the sides of R = 1, G = 1, B = 1 in RGB model and

represented by the bright colors. Hue H is determined by the angle rotated

around the V axis. Red corresponds to the angle of 0 °, green corresponds to the

angle of 120 °, and blue corresponds to the angle of 240 °. In the HSV color model,

each color and its complementary color differed 180°. Saturation S values from 0

to 1, therefore the top surface of the cone has the radius of 1. In HSV color model,

the color domain is one hundred percent color saturation, and its saturation is

generally less than one hundred percent. At the cone vertex (i.e. the origin), V = 0,

H and S is not defined, represents black. At the center of the top surface of the

cone S = 0, V = 1, H is not defined, represents white. Axis from this point to the

origin represents gray dimming light, i.e. the grayscale. For these points, S = 0, H

is not defined. It can be said, HSV model’s V‐axis corresponds to main diagonal in

RGB color space. The circumference on the top surface of the cone is the solid

color which has V = 1, S = 1. HSV color model corresponds to the way the painter

mixes the p

one solid co

change the

black can p

conical spa

The formula

In the equa

RGB values

space is div

histogram.

number of

calculated H

180 bins, w

pigment. Pa

olor. To cha

e hue painte

produce vari

ce model, a

a convert R

m

[

v

s v

h

' [

' [

' [

r v

g

b

ation, , ,r g

s should be

vided into se

This proces

pixels falls w

H, S, V three

which means

inters chan

nge the val

ers add blac

iety of color

as shown in

Figur

RGB space i

max( , , )

min( , ,

5 ' if

1 ' if

1 ' if g

3 ' if

3 ' if

r g b

v r g

b r

g r

r

b g

g b

5 ' othe

] [ m

[ ] [ m

] [ m

r

v r v

v g v

v b v

[0,1],b h

normalized

everal smal

ss is called c

within each

e‐channel c

s adjacent a

16

ge the hue

ue, painters

ck. Adding d

rs. HSV colo

Figure 3.3.

re 3.3 HSV co

into HSV sp

, )] /

max( , ,

max( , ,

max( , ,

max( ,

max( ,

b v

r g

r g

r g

r g

b r g

erwise

min( , , )]

min( , , )

min( , , )]

r g b

r g b

r g b

[0,6], and

. In order to

l color rang

color quant

h bin, histog

color histog

angles of hu

and value t

s mixed soli

different pro

or space can

olor space

pace is

, ) and

, ) and

, ) and

, ) and

, ) and

b g

b g

b b

b b

b

]

r

d , [0,1]s v

o calculate t

ges. Each ran

ization. The

gram can be

ram. Hue h

ues are mer

to get differ

d color with

oportions of

n be illustrat

min( , , )

min( , , )

min( , , )

min( , , )

r g b

r g b

r g b

r g b

min( , , )

r g b

. Before cal

the color hi

nge is a bin

en, by count

e obtained. T

istogram is

ged into on

rent color fr

h white; to

f white and

ted by a

)

)

)

)

)

lculating, th

istogram, co

in the

ting the

This thesis

divided into

ne bin.

rom

(3.4)

he

olor

o

17

Saturation and brightness has 256 bins, which can be gained by multiply original S,

V by 255. Thus, histogram is obtained by counting each pixel's H, S, V, then

counting number of pixels fall within each bin. Expressed by a vector:

1 2 180

0 1 255

0 1 255

histH { , ,..., }

histS { , ,..., }

histV { , ,..., }

h h h

s s s

v v v

(3.5)

In the equation, , ,i j kh s v is the number of pixels fall into each bin.

2. Image Block

The disadvantage of representing image by color features is that two completely

different images may have a similar distribution of the image. For example, two

different images in Figure 3.4 have 9 pixels each, but they have the same color

distribution.

Figure 3.4 different images have the same color distribution

The image block may avoid this problem in certain extent. This method is to split

the whole image into small pieces, then extract color features from each piece

separately. Then the color features of each block matches correspondingly. By

doing so, the problem will be limited into each block to avoid differences in the

overall image is too large. The number of blocks depends on balance among the

complexity of computing and the desired effect. The problem can also be

eliminated by the text description in the web page. Taking the computing

complexity into account, this thesis segment an image into 4 × 4 sub‐blocks. So

each image is represented by three 16‐dimensional image feature vectors. See

equation (3.6), (3.7), (3.8)

I H , H ,… , H (3.6)

I S , S , … , S (3.7)

I V , V , … , V (3.8)

18

3.4UserProfileDesign

3.4.1InterestVector

Vector T [equation (3.9)] can describe a user interested topic. It composed by a

keyword ‐ value pairs and the image feature vector. The key word represent the topic,

the image feature vector represent relative image. If relative image to the topic does

not exist, the image feature vector part will be set a null value.

T={key words，weight， I , I , I } (3.9)

The final user profile is composed by several interest vectors.

3.4.2SimilarInterestJudgment

Because the interest vector is composed by two parts: key words and weight

extracted from the text content; Image feature vector extracted from image content

relative to the topic, the distance between two interest vectors are composed by the

distance of key words and distance between image feature vectors.

1. Distance between key words with weight

Considering the computing complexity, this thesis does not contain semantic

analysis of the key words. The same key words are considered to have distance 0

and the different words are thought to have distance∞. Thus when merge

similar interest vectors, only vectors with same topic will be merged.

2. Distance between color histogram

If considering the height of histogram bins as distribution of a discrete

random variable, then the distance of color histogram can be represented by

correlation coefficient of the two random variables. The benefit of using

correlation coefficient is the ability to handle negative correlation, which means

complementary colors can also be recognized when the images are similarity. It is

calculated as follows:

2 2 2 2

( , ) ( ) ( ) ( )

( ) ( ) ( ) ( )X Y

Cov X Y E XY E X E Y

E X E X E Y E Y

(3.10)

If the two histograms are positive correlated, is positive. When the two

histograms are perfect positive correlation, 1. If the two histograms are

negative correlated, ρ is negative, and 1 when they are perfect negative correlated. The absolute value of correlation coefficient is closer to 1, the two

19

histograms are more closely related, and the images have higher similarity. If it is

closer to 0, the two histograms are less closely related and the images have lower

similarity.

3. Distance between image feature vector

Now distance of color histograms in each block can be calculated correspondingly.

To calculate the distance between image feature vectors, all blocks are

considered equally important. The distance between image feature vectors is the

mean value of distances between each corresponding block:

16

1

1, { , , }

161

( , , )3

j iji

H S V

D j H S V

D D D D

(3.11)

4. The distance between interest vector

Interest vector distance F is defined as:

T , T =∞,differentkeywordsD,samekeywords (3.12)

If the distance is less than the threshold S, then the two vectors are similar.

3.4.3UserModelingAlgorithm

Algorithm 3.1 gives the user modeling algorithm.

3.4.4DistancebetweenUsers

In this thesis, distance between users is calculated only based on user profile. The

user rating is used for updating the user profile. The distance can be calculated as

follows:

1U and 2U are two different user profiles. They are both composed by

some interest vectors. Calculate the distance between them is divided into the

following steps:

1. Digitalize Vector

Assume user profile 1U contains the key words 1 1,..., , ,...,m kt t x x , the user profile

2U contains key words 1 1,..., , ,...,k nx x v v , in which 1,..., kx x is common (same or

similar, see 3.4.2) interest vectors contained by both user profile. The two user

profile can be converted into two equal‐length digital vector by the way

illustrated in Table 4.1. Different key words are replaced by its

corresponding weight. In order to reflect the difference between the two user

profiles on the same or similar key words, they are replaced by corresponding

20

weight combined the distance between the two interest vectors. If the key word

does not exist in one of the user profile, the corresponding position is set to 0.

Table 3.2 digitalize user profile

User Profile … … …

1U … 0.5 … 0.5 0 … 0

2U 0 … 0 ′ 0.5 … ′ 0.5 ′ … ′

2. Calculate the Cosine Distance

The cosine distance between user profiles 1U and 2U can be calculated after

digitalization.

2 21 2 1, 2, 1, 2,1 1

1

cos( , )K

K K

i i i ii ii

L U U U U U U

(3.13)

In the formula, K is the number of interest vectors in the user profile. In this

thesis it is 20.

3.4.5SimilarUserClustering

This thesis uses an improved K‐means algorithm to cluster the similar user.

Algorithm 3.2

Input: user profile set U

Output: user profile clusters

1. Randomly select logK user profile as the initial center of the cluster;

2. For k from logK to K

3. Calculating distance between other profiles and the cluster centers, attribute

the user profile to the nearest cluster;

4. Select the profile farthest from each cluster center as the center of new

cluster;

5. k = 2 * k;

6. End

21

3.4.6UserFeedback

This thesis includes several aspects of user feedback rating:

1. The explicitly rating by user

After user clicks the link in the recommendation result list to access the page, the

system required the user to assign appropriate ratings from a 5‐point scale.

2. Users browsing behavior

User’s browsing behavior can imply user's preferences. For example, when

browsing the user may add bookmark, download, copy or do other operations.

The user browsing behavior can be seen as a kind of feedback on the

recommendation result. This thesis considers that variety of behaviors represents

different degree of user’s interest, as: Add Bookmark> Download> Save> Copy.

presents the weight of user’s browsing behavior. Then the page behavior

vector can be represent as: { (represent adding bookmark)， (represent

download)， (represent save page)， (represent copy)}. Their values are set

as 4,3,2,1 separately. The weight of behavior is only two possibilities, either 0 or

the specified value. For example, if a user added a page into bookmark and also

copied some content, but not download and save, then the user behavior vector

is {4,0,0,1}.

The score of user’s browsing behavior would be sum of the operation weight:

1 (3.14) Because once the user is reading a page, even none of the four browsing

operation was done, it does not mean that users is completely not interested in

the page. Therefore, a constant 1 is added at the end of the equation.

At least the feedback is mainly based on the user to explicitly rating. If the user

has no give any rating on the website, the score is gained from the user browsing

behavior. Both situations will give a 1‐5 score feedback. Feedback score occurred

only after user clicks one of the links in the recommendation result list to enter

the page. Pages in results list without user access will get a score of 0.

3.4.7UpdatingUserProfile

User profile should be able to reflect changes in user interest. Thus user profile needs

to be updated and maintain. On the other hand, user profile can be refined based on

user’s feedback. In this thesis, updating of user profile consists of two aspects:

1. Regular updating by system.

According to the user’s new browsing behavior in the most recent period, the

system captures the changes of user’s interest and automatically updates the

user profile. User model update algorithm is described as Algorithm 3.3.

22

2. Updating based on user feedback ratings

Combining with collaborative filtering, the system required user to rate the

results. At the same time, the system captures the user’s browsing behavior on

pages in the results list. Utilizing these rating and browsing behavior, the system

refines the user profile. See algorithm 3.4.

If a web page gets a feedback scoreless then 3, add interest of vectors extracted

from the page into user profile. If the interest vector is already exist in the user

profile, then decrease the corresponding weight in the user profile. If feedback

score is greater than or equal to 3, add interest of vectors extracted from the

page into user profile. If the interest vector is already exist in the user profile,

then increase the corresponding weight in the user profile.

23

Algorithm 3.1

Input: set of web pages W

Output: user profile U

⑴Preprocessing on W;

WHILE S != ∅

{

⑵∀P ∈ W；

⑶Analyze P，counting frequency of each words in different tags (table 3.1);

⑷Calculate weight of each words using equation 3.1;

⑸Calculate vector V using equation 3.3;

⑹Sort Vs by their weight, keep top 20 vector weight pairs;

⑺Judge the relativity between image and key words (use the alt, title, src

attributes like chapter 3.11);

⑻Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative to at

least one key word;

⑼Composing the image feature vectors with the key word‐weight pair to

get interest vector T 1 20 . If one image is related to multiple keywords,

then add the image feature vectors into all the interest vectors.

⑽Add interest vector T 1 20 into interest vector set T;

⑾Delete P from W;

}

WHILE T != ∅

{

⑿∀T , T ∈ T, ifF T , T , then T andT aresimilar. (the distance

F is calculated by equation 3.13，S is the threshold);

⒀Merge similar interest vector T , T ：

(i) Sum the weight of the two vectors,

(ii) Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;

⒁Add the interest vector (after merging or not need to merge) into U，and

delete it from T;

}

⒂Sort U by weight，keep the top μinterest.

24

Algorithm 3.2

Input: User profile U，temporary web page set W’

Output: New user profile U’

⑴Preprocessing on W’;

WHILE W’ != ∅

{

⑵∀P′ ∈ W′;

⑶ Analyze P’, counting frequency of each words in different tags (table 3.1);

⑷ Calculate weight of each words using equation 3.1;

⑸ Calculate vector V’ using equation 3.3;

⑹ Sort V’s by their weight, keep top 20 vector weight pairs;

⑺ Judge the relativity between image and key words (use the alt, title, src

attributes like chapter 3.11);

⑻ Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative with at

least one key word;

⑼ Composing the image feature vectors with the key word‐weight pair to

get interest vector T 1 20 . If one image is related to multiple keywords,

then add the image feature vectors into all the interest vectors;

⑽ Add interest vector T 1 20 into interest vector set T’;

⑾ Delete P’ from W’;

}

WHILE T’ != ∅

{

⑿∀T ∈ T’，∀T ∈ U，ifF T , T ，then T andT aresimilar. (the

distance F is calculated by equation 3.13，S is the threshold);

⒀ Merge similar interest vector T , T ：

(i) Sum the weight of the two vectors,

(ii) Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;

⒁Else if ∃T ∈ T’，∀T ∈ U，F T , T ，then adding T intoU;

}

⒂ Sort U by weight, keep the top μinterest, U’ = U。

25

Algorithm 3.3

Input：User Profile U，Feedback Score set S , Interest Vector Set T of the result

pages

Output：New User profile U’

1. For every interest vector T in T

2. ∃s ∈ S, sisthefeedbackscoreofpagescontainT

3. IF s < 3 THEN

4. ∀ ∈ ，IF ∃ ，， , , ∈ , F( , ) < D THEN

5. S;

6. ELSE

7. insert into U;

8. END

9. ELSE IF s > 3 THEN

10. ∀ ∈ ，IF ∃ ，， , , ∈ , F( , ) < D THEN

11. S;

12. ELSE

13. insert into U;

14. END

15. END

26

4AHybridSystem‐CombinedContentbasedand

CollaborativeRecommendations

4.1AdvantageofHybridSystem

This thesis uses a hybrid recommendation approach, combining content‐based

recommendation with collaborative filtering. The advantage is the ability to avoid the

problem caused by only using single recommendation methods:

1. Over‐fit problem cause by only using content‐based recommendation.

Content based method makes recommendation based on the relevancy of

candidate items and the user profile. Over‐fit problem is that the

recommendations are either too similar with the items browse previously or not

related at all. Over‐fit problem comes from the incompleteness of the data.

When combine the content based method with collaborative filtering, the system

fully utilize data from other similar users.

2. Feature extraction problem of content‐based method.

When the features of the candidate item is difficult to extract or description, or

only can get the general characteristics of the object but cannot capture the

precise characteristics of the item, using content‐based recommendation is

difficult to obtain accurate interest of user. The collaborative filtering can be used

to predict the user’s interest based on other similar users’ feedback evaluation on

the same item.

3. New item problem caused by only use collaborative filtering.

The problem is when new items are added into the system, the system cannot

make recommendation until the new items receive enough feedback

rating. Utilizing content‐based recommendation can directly make

recommendation based on the new item.

4. Sparse problem caused by only use collaborative filtering.

Sparse problem means that when the number of user is too small compare to the

number of items, there are always some items did not receive any feedback.

Combining content‐based method can directly make recommendation based on

the content of the item which has not received any the user evaluation. In the

extreme cases, the system can still provide service even only has one user.

5. Sparse problem caused by large selection differences.

27

The problem is the rating difference between any two users is very large or some

user’s preference is too special compared to other users. When combined with

content‐based recommendation, even for special interest users, the system can

make recommendation without refer to other users’ interests.

4.2SystemArchitecture

Distributor

Feedback Controller

Selector

User Profile Database

Collector

User

Similar User

Web Pages

Client

Server

Figure 4.1 system architecture

System uses the Client‐Server architecture. The server includes three main modules:

the collector responsible for collection of pages from the Internet. Distributor

responsible for matching the Web pages and user profiles than make

recommendation use content‐based method. User profile database storing the

user profile.

Selector receives recommendation result from the distributor and filter out pages

have been visited by the user and extra pages on the same website. Feedback

controller collects user feedback rating and capture user browsing behavior. Once a

page gets feedback score more than 4 points, it will directly be recommended

to other similar users belonging to the same cluster. That is the way of collaborative

filtering in this thesis. In addition, feedback controller is also responsible for updating

the use profile based on user feedback.

28

4.3SystemFlow

System flow is as follows:

First of all, after user access the system, the server checks the existence of the

system cookie after receiving the client request. If the cookie exists, then the user

automatically log on. If there is no cookie, then send the login page to the user. In the

login page, registered users can manually log in, not registered users can jump to the

register page.

After login successfully, the system collect the user’s browsing history. If the user is

not a new user, then the system rule out the browsing record before the last login time

in the cookie, and use the recent browsing history to update the user. If the cookie

does not exist, the user profile is not updated this time. Otherwise, for new users, the

system calculates the user profile based on all the browsing history. Then the system

calculates the distance between new user and cluster center already exist and add

the new user to the proper cluster.

In the next step, the system matches the pages in the database and the user profile,

and then recommends the 10 best matches. The result list is contained in a page

send back to the user as http response. The user browses the result page while the

system collects the feedback and user’s browsing behavior. Then the system update

the user profile based on feedback. If the one page gets a score more than 4, the

page is also recommended to all the similar users. See figure 4.2

29

Figure 4.2 system flow

4.4KeyTechnologyUsedintheSystem

4.4.1ChinesewordSegmentation

Difference from English, Chinese words has no space or other separator between the

words. In order to count the word frequency, the system needs to segment the

Chinese text into separate Chinese words. Existing segmentation algorithm can be

divided into three categories: segmentation based on the string matching,

segmentation based on understanding and segmentation based on statistical

methods.

1. Segmentation based on string matching

30

This method is also known as mechanical segmentation method, which is based

on a certain strategy that matching Chinese string with entries in a "sufficiently

large" machine dictionary. If a word was found both in the dictionary and the

string, then matching is successful (identify the word). The methods can be

divided into positive match and reverse match according to the scanning

direction, or divided into largest (longest) match and the smallest (minimum)

matching according to the difference length of matching priority. The common

methods are the combination of the methods mentioned above, as follows:

1) The forward maximum matching method (from left to right);

2) The reverse maximum matching method (from right to left direction);

3) Least segmentation (segment minimum number of words from the string);

4) bi‐directional maximum matching (from left to right then from right to left)

2. Segmentation based on understanding

This segmentation method is to let the computer simulate the way through which

people understanding the sentences to identify the word. The basic idea is doing

syntax, semantic analysis while segmentation to deal with ambiguity. It usually

contains three parts: the segmentation subsystem, Syntax and Semantics

subsystem, control subsystem. Under the coordination of the total control

subsystem, the segmentation subsystem can obtain syntactic and semantic

information about words and sentences to make judgments on word

segmentation ambiguity. It simulates the process that people understanding the

sentence. The segmentation method requires a large amount of language

knowledge and information. Because knowledge of Chinese is complex and

general, it is hard to organize the language information into the form which can

be recognized by the machine directly. Currently segmentation based on

understanding is still on test.

3. Segmentation based on statistical methods

From the formal point of view, the word is stable combinations of characters. The

more the adjacent characters occurred together in a context, the more likely they

form a word. This method analyzes the co‐occurrence of the adjacent characters.

When the frequency of co‐occurrence of the adjacent characters is higher than a

threshold, the characters may constitute a word. This method only based on the

statistical frequency of group of characters in the corpus, no segmentation

dictionary is needed. Thus the method is also called dictionary free methods.

Because Chinese segmentation is not the focus of this thesis, the system used an

open source project called ICTCLAS (Institute of Computing Technology, Chinese

Lexical Analysis System) [38]. ICTCLAS3.0 reaches a segmentation speed of

996KB/s, segmentation accuracy of 98.45%. The size of API is less than 200KB, all

the dictionary data in less than 3MB. ICTCLAS support of Linux, FreeBSD and

Windows, and has the version of C / C + +, C #, Delphi, Java and other

mainstream development language.

31

4.4.2 Calculation of Chinese compound word

Chinese Compound is defined as: Nominal Compound word is a group of continuous

words which equivalent to simple nouns in the overall function, with semantic

integrity, does not contain function words [39]. It is like the phrase in English, such as

"计算机操作系统” (means computer operating system) which is combined by “计算

机”(computer), “操作”(operating) and “系统”(system). In Chinese, compound word

is seen as a single word. If the compound word is segmented into "computer",

"operations" and "system", it will cause misunderstanding when calculating the

vector space of the document.

How to correctly identify the compound word has been a hot research topic in

information retrieval, machine translation and text classification. There are many

sophisticated methods, including:

1. Statistical methods

⑴ Approach based on the frequency

If two words or more words occurred together for a lot of times, the possibility

that they constitute a compound word is higher. At least means the words

together represent a special meaning. The actual use of the method is often

combined with linguistic and heuristic rules to improve the precise and recall.

⑵ Hypothesis testing

Using statistical methods to determine whether it is by chance the words

combined a compound word. Judge under what condition that the words do not

constitute a compound word.

⑶ Likelihood ratio

Use likelihood ratio to indicate how larger one possibility of the way through

which compound word is constitute is than other possibilities.

⑷Relative frequency compared

Compound words in a domain are often occurred in the same domain, but are

rarely occurred in other domains. According to this features of compound words,

comparison of occurrence frequency of compound word in two or more different

corpus can also help finding compound words.

2. Based on rules

Rule‐based methods use the context information and the internal components to

indentify the compound word. But due to the complex of linguistic rules, the

rules extraction is rely on artificial work. Thus linguistic rules are difficult to use

accurately. Now rule‐based analysis is used rarely.

3. Statistical and rule based methods

Statistical and rule based approach is the current trend of compound word

research. Because of the complexity of natural language, statistical methods

cannot identify the low frequency compound words. The artificial rules are

32

difficult to extract. Thus, many researches try to combine the two methods to

extract and identify the compound word.

This hybrid method combined the statistical knowledge and linguistic knowledge

(syntactic and semantic information). In specific implementations, there are a

variety of forms. For example, first use statistical methods to identify candidate

compound words, and then use the linguistic knowledge to filter.

This thesis uses the statistical based methods [38]. The probability of whether

two words constitute compound words can be calculated as the fellow:

xy x y xyf f f f f (5.1)

In Which, xf is the frequency of word X, and yf is the frequency of the word Y ,

xyf is the frequency of the compound word XY. When f > 1%, the XY should be

considered as a compound word.

4.4.3 Capture User Browsing Behavior

In IE, the user browsing behavior can be captured by BHO (Browser Helper Objects).

1. Principle of interaction between BHO and the Internet Explorer

Internet Explorer and BHO communicate via COM interfaces. BHO is a COM

object implement a specific interface. Developed BHO plug‐in is registered in the

registry under a certain key. When the browser starts, Internet Explorer checks

the key and the loads all objects under the key. Internet Explorer initializes the

object and checks the corresponding interface. If the interface is founded,

Internet Explorer uses the methods provided by the interface to pass its

IUnknown pointer to the BHO object.

Browser may found a number of CLSID (Class ID) in the registry, and establishes a

process instance for each CLSID. As a result, these objects are loaded into the

same memory and start to running, as they were native components. Internet

Explorer has a COM feature, in order to hook the browser's event, BHO need to

create a COM‐based communication channel and implement an interface called

IObjectWithSite. Through the interface IObjectWithSite, Internet Explorer can

pass its IUnknown pointer. BHO is able to store the interface and look up the

other interface needed, such as IWebBrowser2, IDispatch, and

IconnectionPointContainer and so on.

BHO object is loaded when browser’s main window displays, and is unload when

the main destructs. No matter the browser is started under what kind of

33

command, BHO object is loaded. Only when explicitly run several iexplorer.exe,

multiple copies of Internet Explorer is created and multiple BHO is loaded. When

opened a new Internet Explorer window from an existed Internet Explorer, every

window only creates a new thread rather than creates a new process, therefore,

BHO will not be re‐loaded.

2. Process of capturing Internet Explorer browsing behavior

The process from loading BHO with booting IE to unloading BHO with exiting IE

can be divided into four phases:

(1) Stage of loading BHO This phase can be summarized as four processes: Internet Explorer start, IE

find the sub key under a directory in the registry and load BHO. IE request the

IObjectwithSite interface of BHO. When getting the interface, IE passes its

own IUnknown interface to BHO.

(2) Stage of establishing connections This stage is the premise for getting all kinds of events of IE. At this stage,

BHO first send request for IConnectionPointContainer and IWebBrowser

through IUnknown interface. If the request is successful, BHO send request

for establish a connection through IConnectionPointContainer to IE browser.

BHO also send request to establish connection with HTML window and HTML

document through IWebBrowser interface. Then IE create the connections

mentioned above.

(3) Stage of event processing IE send events ID through the IDispatch interface BHO exposed. BHO handle

events according to the events ID. The process repeat until IE is closed.

(4) Stage of closing When IE browser is closed, BHO will be unloaded

4.5TheBriefIntroductionofthePrototype

4.5.1 Server Modules

In fact the modules are running in background on sever. The graphic interface of

those modules is used for unit test.

Figure 4.3 shows the of HTML pages analysis module. This module uses the recursive

method to analyze nested HTML tags. The module ignores other tags but only

reserves content in tags of table 3.1, and classifies the content by the tags. The

example page in figure 4.3 comes from: http://baike.baidu.com/view/47277.htm

Figu

Sect

segm

ure 4.4 show

tion 4.3.1, t

mentation i

ws the Chin

this module

is based on

Figure

ese word se

e utilizes the

the results

34

e 4.3 HTML a

egmentatio

e componen

of the HTM

nalysis

n module. A

nts of the p

ML analysis.

As describe

roject ICTCL

d in

LAS. The

Calc

seg

web

wor

wor

culating wo

mentation,

b page, suc

rds frequenc

rds were filte

rds frequen

there are m

h as some

cy and weig

ered out. Th

Figure 4.4

ncy and wei

many words

meaningles

ght created

he module i

35

4 Chinese Seg

ght is base

have little h

ss auxiliary

a stop word

is shown in

gmentation

d on segme

help to desc

word. The m

ds list, acco

Figure 4.5.

entation res

cribe the co

module for c

ording to wh

.

sults. After

ontent of the

calculating

hich the use

e

eless

Figu

mod

ure 4.6 show

dule for con

Figure

ws the calcu

ntrast the co

4.5 Words Fr

ulation of th

olor histogr

36

requency and

he color hist

ram.

d Weight Calc

togram, and

culation

d Figure 4.77 shows the

Figure 4.6 Calc

Figure

37

culating the C

4.7 Contrast

Color Histogra

Images

am

4.5

Use

New

5.2 Client

er first acces

w user can r

t Interfac

ss into the l

register in th

ce

ogin page, s

Figu

he register

Figure

38

see Figure 4

ure 4.8 Login

page, see F

e 4.9 Registe

4.8.

Page

igure 4.9.

r Page

Afte

Whe

ratin

er login, the

en user acc

ng, see Figu

e system giv

cess one link

ure 4.11.

ves recomm

Figure 4.10

k in the resu

Figur

39

mendations,

0 Recommend

ult list, the s

re 4.11 Rating

see Figure

dation Page

system requ

g Page

4.10.

uires the usser to give

40

4.6TheExperiment

In this thesis, experiments adopt WebSpider for crawl the web page. The

crawler will collect the web pages, and it save web pages into the database. In the

experiment, WebSpider firstly collected web pages for more than 72 hours, and it

gets more than 500,000 pages. Then 10 volunteers were asked to use the system.

They provided their browsing history also rating on the recommendation results. For

each user, the user profile was updated at least three rounds of iteration. The result

is shown in Figure 4.12

Figure 4.12 Experiment Result

This thesis adopts the ndpm metric for measuring the recommendation performance.

In order to measure ndpm, an ideal ranking must be defined. An ideal ranking of

some pages for source S is one where the user prefers every page from S to every

page not from S. It does not matter how the user ranks the pages from S relative to

one another, nor the pages not from S. The greater the preference the user

expresses for pages from S over the other pages supplied, the smaller the ndpm

distance between the user’s actual ranking and the ideal ranking for S.

When few feedbacks were given, the recommendation combination of image

analysis was more accuracy. Given a large number of feedbacks, the accuracy of text

content exclusive recommendation and recommendation combination of image

analysis were similar. This result is reasonable: When the user gave more feedback

score, the user profile already becomes accurate. However, when the user is given

less feedback, recommendation combination of image analysis can improved the

accuracy. In the case of the lack of user information, pictures information could help

to refine the use profile.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25

ndpm

The number of rated Web ite

Recommend with ImageInformation

Recommendation only UseText Information

41

5 Conclusions

5.1Result

With the rapid growing of Internet, information on Internet has increased

extremely. Large amount of information actually caused the problem of information

overload. The search engine may help to solve the problem. Some features of the

search engine constraints its effects. Personalized recommendation system may

become a better solution. It does not depend on user provided keywords, but

recommendation system guides the users to find required results. The traditional

web page recommendation is based on the text content of Web pages. However

Web page contains a large number of images, traditional methods did not utilize the

image, as the result, part of the information is a wasted. This thesis designed a

recommendation system combined content‐based image analysis. Main tasks are:

Analysis of the traditional user profile includes modeling, representation methods

and data source. Combined with content‐based image analysis, a new user profile is

designed. The user profile contains the image information that the user interested in

to improve information utilization and increase recommendation accuracy.

The thesis states the algorithm of using user feedback score to iterative update user

profile. That make the user profile can reflect the change of user's interest by time.

Proposes a hybrid recommendation system combined the content‐based method

and collaborative filtering.

Implemented the prototype and carry out the experiment.

5.2FutureWork

Although this study has achieved initial success, but still long way to go, there are

many pending further research work, which are briefly discussed below:

Image features extraction only focus on color feature, since running time is critical

metrics of system performance. The color cannot represent all the features of an

image. For example, the same item can be a variety of colors. In this way, images has

the same color distribution may describe a completely different scene. Although this

paper attempts to adopt image block to reduce false positive, but the performance

still cannot reach a satisfactory accuracy in use case study. Future work should

include further analyze of image texture and shape, refined image feature extraction

and expression.

42

In this thesis, the way combined features of web text and images are relatively rough.

Determining the correlation between images and text only based on <img> tag’s src,

title and alt attributes. However, non‐standard written HTML files cannot be valid for

analysis. A possible improvement is using image content features to directly retrieve

relative web page.

User profile lack of semantic analysis of text vector but just simply matches with the

key words. If the key words cannot match with each other, they are considered to

represent different interests. Further research can increase the ability of the system

to understand the semantic meaning the text.

Although the combination of content‐based and collaborative filtering solved some

problems in recommendation systems such as sparse problems and new item

problem, there are still some problems in recommendation system to be resolved

such as security issues and criteria for system evaluation.

43

References

[1] China Internet Network Information Center, Statistical Report on Internet

Development in China, 2010,

http://www.cnnic.net.cn/uploadfiles/pdf/2010/1/15/101600.pdf date visited

“2010‐11‐05”

[2] Shuning Li. Research on the Information Overload Problem in Web Information

Environment, Information Science, 2005,Vol.23(10):1587‐1590

[3] Yang, C.C.; Chen, Hsinchun; Honga, Kay (2003). "Visualization of large category

map for Internet browsing". Decision Support Systems 35 (1): 89–

102. doi:10.1016/S0167‐9236(02)00101‐X

[4] Hailing Xu, Xiao Wu, Xiaodong Li etc. Comparison Study of Internet

Recommendation System. Journal of Software, 2009,Vol.20(2):350‐362

[5] Jianguo Liu, Tao Zhou, Binghong Wang. Advances in Personalized

Recommendation system. Progress in Natural Science, 2009, Vol.19(1):1‐15.

[6] Resnick P, Varian HR. Recommender systems. Communications of the ACM, 1997,

Vol.40(3):56−58

[7] Marko Balabanović, Yoav Shoham, Content Based, Collaborative

Recommendation, Computaions of ACM, 1997, Vol.40(3): 66‐72

[8] James Ruker,Marcos J. Polanco. Siteseer: Personalized Navigation for the web.

Communications of the ACM, 1997, Vol.40(3):73‐75

[9] Fabio A. Asnicar, Carlo Tasso. ifWeb: a Prototype of User Model‐Based Intelligent

Agent for Document Filtering and Navigation in the World Wide Web. In Proc. of

6th International Conference on User Modelling (2‐5 June 1997)

[10] Michael Pazzani, Jack Muramatsu, Daniel Billsus. Syskill & Webert: Identifying

interesting web sites. AAAI Technical Report SS‐96‐05

[11] Yangjun Pei. Research on User Interest Profile in Personalized Service System.

Master Thesis, Chongqin University, 2005

[12] Zhiweiguan. User oriented Intelligent Human Computer Interaction, Doctoral

Thesis. Institue of Software Chinese Academy of Science, 2000

[13] Alfred kobsa. User Modeling in Dialog Systems. Potentials and Hazards[J]. AI＆

Society, 1990, Vol.4(3)：214—240

[14] Yuxiang Yan. Research on Recommendation System Based on Semantic web.

Master Thesis, Taiyuan University of Technology, 2010

[15] Xialuo. Research of The User Model in Web Mining. Master Thesis. Sichuan

Normal University, 2009

[16] Liying Xiao. Research on Personalized User Profile on Internet. Master Thesis,

Central South University, 2003

[17] Badrul Sarwar, George Karypis, Joseph Konstan, et al. Item‐Based Collaborative

Filtering Recommendation Algorithms. Proceedings of the 10th international

conference on World Wide Web, 2001

[18] Knil K. Jain, Aditya Vailaya. Image Retrieval using Color and Shape. Pattern

44

Recognition, 1996, Vol.29, Issue 8: 1233‐1244

[19] Weicheng Liu, Hongji Sun. Summary on Content Based Image Retrieval. 2002,

Information Science, Vol.20(4):431‐437

[20] Guanming Lu, Content Based Image and Video Retrieval. 2002, Journal of

Nanjing University of Posts and Telecommunications (Natural Science), Vol.22(2):

23‐26

[21] Shaoli Wang, Li Zhang, Jing Fu etc. A System of Query Based on the Content of

Image and Video. Computer Engineering and Applications, 2001, Vol.37(7):

113‐117

[22] Wenlong Qu, Weidong li, Bingyu Yang. Overview of Image Mining Research.

Computer Englineering and Applications, 2004, Vol.40(5): 1‐3

[23] B. S Manjunath, W. Y. Ma.Texture features for browsing and retrieval of image

data. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1996,

Vol.18(8): 837‐842

[24] Mari Partio, Bogdan Cramariuc, Moncef Gabbouj, et al. Rock Texture Retrieval

Using Gray Level Co‐occurrence Matrix. 5th Nordic Signal Processing Symposium

October 4‐7, 2002

[25] George R. Cross, Anil K. Jain. Markov Random Field Texture Models. Pattern

Analysis and Machine Intelligence, IEEE Transactions on. 1983, Vol.PAMI‐5(1):

25‐39

[26] R. Chellappa, S. Chatterjee. Acoustics, Speech and Signal Processing, IEEE

Transactions on, 1985, Vol.33(4):959‐963

[27] Haitao J., Abdel S. H. Scene change detection techniques for video database

system [J]. Multimedia System, 1998(6):186‐195

[28] Patel N. V., Sethi I. K. Video Shot Detection and Characterization for Video

Database [J]. Pattern Recognition, 1997, Vol.30(4): 583‐592

[29] A Nagasaka, et al. Automatic Video Indexing and Full Video Search for Object

Appearances[C]. Second Working Conference on Visual Database Systems, IFIP

WG2.6, 1991: 119‐133.

[30] Zhang H. J, et al. Video Parsing, Retrieval and Browsing: An Integrated and

Content‐based Solution [A]. Proceed in ACM Multimedia’95 [C]. 1995:15‐24

[31] Song M. H., Kwon, T. H. On Detection of Gradual Scene Changes for Parsing of

Video Data [J]. SPIE, 1997, Vol.33(12): 404‐409

[32] Zhang H. J., Wu Jianhua, et al. An Integrated System for Content‐Based Video

Retrieval and Browsing[J]. Pattern Recognition. 1997, Vol.30(4): 643‐657

[33] Michael Pazzani, Syskill and Webert Web Page Ratings,

http://kdd.ics.uci.edu/databases/SyskillWebert/SyskillWebert.data.html, date

visited “2010‐11‐10”

[34] Nicholas Kushmerick, Internet Advertisements Data Set,

http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, date visited

“2010‐11‐10”

[35] The 4 Universities Data Set,

http://www‐2.cs.cmu.edu/afs/cs.cmu.edu/project/theo‐20/www/data/, date

45

visited “2010‐11‐10”

[36] M. Stricker, M. Orengo. Similarity of color images. SPIE Storage and Retrieval for

Image and Video Databases III. 1995, Vol. 2185:381‐392

[37] Chinese Segmentation. http://baike.baidu.com/view/19109.htm, , date visited

“2010‐12‐03”

[38] ICTCLAS. http://ictclas.org/index.html, , date visited “2010‐12‐03”

[39] Changxiong Chen. Compounds Phrase Analysis and Application in Information

Retrieval. Master Thesis. Shanghai JiaoTong University, 2008

[40] Zhi Cai. Research on Intelligent Information Capturing on World Wide Web.

Master Thesis., University of Science and Technology of China, 2002

[41] Yao, Y. Y. Measuring retrieval effectiveness based on user preference of

documents. J. Amer. Soc. Info. Sci. 1995, Vol.46( 2): 133–145