Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
IT 11 030
Examensarbete 30 hpJuni 2011
Web Recommendation System with Image Retrieval
Bin Yan
Institutionen för informationsteknologiDepartment of Information Technology
Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student
Abstract
Web Recommendation System with Image Retrieval
Bin Yan
The amount of information on the Internet has dramatically increased during recentyears such that increment causes a problem so called “information overload”, whichcan only be partially solved by search engines. Although there is a considerableliterature on search engine focusing on information overload, it has still not beencompletely overcome to date due to concerns about commercial interests, individualdifference and objective process. Addressing those concerns, recommendationsystems, which are information-filtering systems that can recommend informationwithout explicit participation of the user, was designed to aim those problems.
The recommendation system collects the interests of users to create an independentprofile for each user. Moreover, it compares the user profile to some referencecharacteristics, and the system recommends information of potential interest to theuser. They redeem from shortcomings of search engines, since recommendationsystems focus on the specific characteristics of each user.
Unlike previous literature that focuses on text, this thesis presents an improvedrecommendation system, which considers the information stored in images.Based on methods of user modeling and user profile expression are analyzed, A newdesign for user profiles joint with methods for content based image retrieval arepresented. In this design, the new user profile contains information from images onthe web pages to increase the accuracy of the recommendation. Furthermore,algorithms for updating the user model according to user feedback are alsointroduced such that the user model can reflect the interest modification of users.Using a real-word deployment, the thesis shows the new system achieves betteraccuracy comparing to existed text-only methods given small amount of data.
Finally, the thesis argues about the feature selecting in Image analysis is the bottleneckfor recommendation system. It appears very hard to significant improve existedsystem without new features and semantic analysis.
Tryckt av: Reprocentralen ITCIT 11 030Examinator: Anders JanssonÄmnesgranskare: Ivan ChristoffHandledare: Chenxi Zhang
TableofContents
1INTRODUCTION................................................................................................................1
1.1 Background ............................................................................................................................... 1
1.2 Related Work ............................................................................................................................. 2
1.3 Objectives ................................................................................................................................. 5
1.4 Thesis Overview......................................................................................................................... 5
2USERPROFILE...................................................................................................................6
2.1 Classification ............................................................................................................................. 6
2.2 The Data Source ......................................................................................................................... 7
2.3 Representation .......................................................................................................................... 9
3MULTIDIMENSIONALUSERPROFILE....................................................................11
3.1 Relativity between Text and Image Content in Web Pages ........................................................ 11
3.1.1ExperimentalDesign ............................................................................................................... 11
3.1.2ExperimentResult ................................................................................................................... 11
3.1.3AnalysisandConclusion ......................................................................................................... 12
3.2 Vector Space Model Based on HTML Structure ......................................................................... 12
3.3 Image information extraction and representation .................................................................... 14
3.3.1CommonMethodsofContentBasedImageRetrieval(CBIR) .............................................. 14
3.3.2 Color Features Extraction and Representation based on Image Block ..................................... 15
3.4 User Profile Design .................................................................................................................. 18
3.4.1InterestVector ......................................................................................................................... 18
3.4.2SimilarInterestJudgment ...................................................................................................... 18
3.4.3UserModelingAlgorithm ....................................................................................................... 19
3.4.4DistancebetweenUsers ......................................................................................................... 19
3.4.5SimilarUserClustering ........................................................................................................... 20
3.4.6UserFeedback ......................................................................................................................... 21
3.4.7UpdatingUserProfile .............................................................................................................. 21
4AHYBRIDSYSTEM‐COMBINEDCONTENTBASEDANDCOLLABORATIVERECOMMENDATIONS......................................................................................................26
4.1 Advantage of Hybrid System .................................................................................................... 26
4.2 System Architecture ................................................................................................................. 27
4.3 System Flow ............................................................................................................................ 28
4.4 Key Technology Used in the System .......................................................................................... 29
4.4.1ChinesewordSegmentation ................................................................................................... 29
4.4.2 Calculation of Chinese compound word ................................................................................... 31
4.4.3 Capture User Browsing Behavior .............................................................................................. 32
4.5 The Brief Introduction of the Prototype .................................................................................... 33
4.5.1 Server Modules ......................................................................................................................... 33
4.5.2 Client Interface .......................................................................................................................... 38
4.6 The Experiment ....................................................................................................................... 40
5CONCLUSIONS................................................................................................................41
5.1 Result ...................................................................................................................................... 41
5.2 Future Work ............................................................................................................................ 41
REFERENCES................................................................................................................43
1
1Introduction
1.1Background
According to China Internet Network Information Center (CNNIC), the number of
Internet users in China reaches more than 384 million up to Dec. 31 2009. The total
number of domain name in China is more than 16.28 million, and the number of web
pages reaches 33.6 billion [1]. Figure 1.1 shows the increasing trend of the web pages
in China.
Figure 1.1 the trend of web pages number in China from 2003 to 2009
Regarding to the rapid growing of the Internet, information on the Internet has
increased extremely, and the increasing may result in the problem called
“information overload” which refers to the difficulty of users to make decision by
information [2].
The search engine can help the Internet user to handle the information overload
situation partially. However, search engine or information retrieval system has
following defects:
1. The search result is disturbed by commercial interest. Currently, search engine
rely on advertisement to generate revenue. The service providers always place
the advertisement on the search result that reduces the optimum of result.
2. Search engine ignores the difference between users. The users would get the
same result if they input the same keywords; even through emphasis is distinct
between individuals.
3. The user dominates the search process. The search engine depends on the users’
input, but the requirements of users are always unclear. To improve the accuracy
of the result, user should modify the search keywords and research again.
0
5
10
15
20
25
30
35
40
2003 2004 2005 2006 2007 2008 2009
2
Addressing those defects, recommendation systems, which are information‐filtering
systems that can recommend information without explicit participation of the user,
was designed to aim those problems. The system collects user information to create
independent profile for each user. Moreover, It compares user profile to some
reference features, and the system recommends item to users who have potential
interest on specified topic.
However, the previous web page recommendation system is based on the plain text
content of the web pages rather than the information contained by the images [3],
[4]. The goal of this thesis is to design a new user profile that handles the
information contained in the images in the web pages. Also, the thesis implements a
prototype of the web page System, and it measures the performance of new user
profile by experiments.
1.2RelatedWork
Recommendation System
Recommendation system derives from a specific type of information filtering system
technique that attempts to recommend information items that are likely to be of
interest to the user. Typically, recommendation system includes three key elements:
candidate items, users and recommendation algorithm. Figure 1.2 illustrates the
architecture of recommendation system. When building the user's profile, the users
can explicitly provide interest information, or the system can collect data implicitly.
Recommendation algorithm calculates the candidate items based on user interest
profile to make recommendation.
Figure 1.2 architecture of recommendation system
Research Situation
Robert Armstrong etc. advanced the first recommendation system Webwatcher in
1995. After that variety of recommendation systems have been developed such as
Amazon, eBay, Taobao etc. Table 1.1 listed classified mainstream recommendation
system in both research and commercial area.
3
Table 1.1 mainstream recommendation system
Area Recommendation System E-commerce Amazon.com, eBay, Levis, Ski-europe.com Web Page Fab, Foxtrot, ifWeb, MEMOIR, METIOREW, ProfBuilder, QuIC,
Quickstep, R2P, Siteseer, SurfLen Music CDNOW, CoCoA, Ringo, Music.Yahoo.com Movie Netfilx.com, Moviefinder.com, MovieLens, Reel.com News GroupLens, PHOAKS, P-Tango
In theoretical literature, recommendation system has become an independent
discipline, contains areas like E‐commerce, Network Economics, sociology etc. Recent
years, research on recommendation system increase rapidly: 1) ACM set up
conference: ACM recommender system; 2) Papers about recommendation system
increased year by year in Top conference on human computer interaction, data
retrieval and machine learning (SIGCHI, KDD, SIGIR, WWW etc.). 3) Top Journals (such
as IEEE Trans. on knowledge and Data Engineering, ACM Trans. on Information
system) have collected several papers on recommendation system. The research
institute (researcher) advanced in recommendation system include: New York
University (Alexander Tuzhilin), GroupLens group in University of Minnesota (Joseph
A. Konstan, John Riedl etc.), University of Michigan (Paul Resnick), Carnegie‐Mellon
University (Jaime Callan), Microsoft Research (Ryen W. White) etc. Besides, University
of Michigan offered recommendation system course begin from 2006.
Previous Work on Web Page Recommendation System
1. Fab
Marko Balabanović and Yoav Shoham implemented a hybrid content‐based,
collaborative system in 1997 [7]. As a part of digital library project, Fab aims to help
users filter useful information from huge quantity of Internet information. The
system combined Content Based Recommendation and Collaborative Filtering
Recommendation to create a hybrid recommendation system. The process of
recommending can be divided into two phases: 1. Collect information and set up
manageable database; 2. Select certain information for certain user.
Three parts composed fab: Collection Agent, Selection Agent and Center Router.
Every agent maintains a profile contains words of web pages that have been rated.
The profile of Collection Agent represents its current topic, whereas a selection
agent’s profile represents a single user’s interests. Pages found by the collection
agents are sent to the central router, which forwards them on to those users whose
profiles they match above some threshold. The users are required to assign
appropriate ratings from a 7‐point scale. The user’s ratings are used to update their
personal selection agent’s profile, and are also forwarded back to the originating
collection agents, which will use them to adapt their profiles. Additionally, any highly
4
rated pages are passed directly to the users’ nearest neighbors—other people with
similar profiles. In conclusion, by combined two recommendation algorithms, Fab
can filter large scale, rapidly changed information and process dynamic feedback.
2. Siteseer
James Ruker and Marcos J. Polanco developed Siteseer system in 1997 [8]. Siteseer is
a web page recommendation system that utilizes user bookmarks (including Hotlist
and Favorites) and the organization of bookmarks. Bookmarks represent users’
interests, especially classified bookmarks that reflect a conscious behavior of user.
Siteseer contrast bookmarks from different user, calculate users similarity based on
URL similarity in their bookmarks. Then the system introduces different URLs
between neighbors.
3. ifWeb
Fabio A. Asnicar and Carlo Tasso introduce the ifWeb system in 1997 [9]. ifWeb is a
content based web page recommendation system. Other than Fab, ifWeb use a
multi‐dimensional description of the web page. ifWeb not only utilizes the key words
in web pages, but also exploit domain name, size of HTML file, number of images etc.
to represents the content of the web pages.
Analysis of Pervious Web Page Recommendation System
Because of variety of limitations, there is no instance of web page recommendation
system achieved commercial success. The main defects are listed below:
1. The description of user profile and knowledge are unitary. For instance, Syskill &
Webert [10] and Fab [4] are all use several key words to represent the content of
the web pages. Accordingly, the matching between candidate items and users are
all based on text matching methods.
2. Over‐fit problem caused by only using content‐based recommendation algorithm.
Exclusive content‐based recommendation algorithm can only recommend item
similar with the one that the user has browse before. Over‐fit problem is that the
recommendations are either too similar with previously items or unrelated at all.
3. The cold start problems include new user problem and new item problem. When
new users or new items are added into the system, the system cannot make
recommendation until new users have given enough feedback rating or the new
items have received enough feedback rating. Sparse problem means that when
the number of user is too small compare to the number of items, there is always
some items did not receive any feedback. The condition that the difference
between any two users is too large can also cause sparse problem.
5
1.3Objectives
The objectives of this thesis is to design and specify a new user profile based on both
the text and image content of the web pages, and the thesis also implements a
prototype of the web page recommendation system using this user profile. Finally,
the thesis shows the performance of new recommendation system by experiments.
1.4ThesisOverview
General idea and methods on user profile will be introduced in chapter 2. Chapter 3
will give a new user profile that represents interest of both text and image content of
web pages. Chapter 3 also includes the modeling and updating method of user
profile. Chapter 4 introduces the architecture of the whole system and the
implements of the system. Future work is discussed in chapter 5.
2U
The
mod
defi
inte
plan
[13]
dep
cog
with
Use
incr
goo
use
syst
intro
2.1
1. M
Man
from
new
2. U
Use
The
exam
the
UserPro
ere is no sta
del used for
ines it as “a
eraction wit
n (with whic
]. In conclus
pict user’s in
nized by ma
h particular
er profile pla
reasing reco
od user prof
r’s preferen
tem provide
oduce some
1Classifi
Manual Mod
nual Model
m predefine
w user is req
User Sample
er sample m
e general me
mple, Amaz
user is aske
ofile
ndard defin
r capturing,
description
h the outer
ch to reach
sion, in a re
nterests and
achine and
r data struct
ays a core ro
ommendatio
file system c
nce. from in
es the user w
e critical tec
ication
deling by Us
ing means u
ed options.
quired provi
Figu
e Modeling
modeling req
ethod is usi
zon.com req
ed to input
nition of use
recording a
n of users’ u
r world” [12
the goal), b
commenda
d requireme
calculable.
ture.
ole in recom
on accuracy
can analyze
teraction w
with the mo
chnology ab
ser
user provid
For exampl
iding locatio
ure 2.1 my.ya
quired users
ng the rated
quires new
some key w
6
er profile. Ya
and managi
understandi
2]. Alfred Ko
beliefs and k
ation system
ents in a per
It is an algo
mmendation
y and impro
the browsi
with the syst
ost proper r
bout model
es preferen
e, my.yahoo
on, interest
hoo.com new
s provide sa
d items whi
user to sele
words for se
angjun Pei d
ing user’s in
ing of the o
obsa defines
knowledge
m, user prof
riod of time
orithm orien
n system. It
oving the eff
ng history o
tem. Based
recommend
ing user pro
nce informat
o.com, duri
s etc. (see f
w user registe
amples whic
ich user has
ect a preset
earching. Aft
defines user
nterest” [11
uter world
s it as a set
about a par
ile refers to
e. User prof
nted formal
is the key f
ficiency of t
of the user,
on these fe
dation. This
ofile.
tion manua
ng the regis
figure 2.1).
ering
ch they are
s browsed a
category o
ter searchin
r profile as
1]. Zhiwei G
and the
of user’s go
rticular dom
o a model
file should b
description
factor for
the system.
and infer
eatures, the
chapter wi
ally or select
stering proc
interested
as samples.
of goods. Th
ng, the user
“a
uan
oal,
main
be
n
A
ll
ts
cess,
in.
For
en
r
nee
with
3. A
Aut
proc
feed
web
2.2
1.
2.
3.
ed to rate ite
h the search
Automatic M
omatic Mod
cess but the
dback. For e
b site in its h
2TheDa
Data from
The log file
and downl
Data from
As an inter
user’s beh
Data from
Data repre
ems in the s
h result (see
Fig
Modeling by
deling mean
e system es
example, th
home page.
Figure 2.3
ataSourc
Web Sever
e on web se
load behavi
Proxy Seve
rmediate no
avior of bro
Client
esents user’
search resu
e Figure 2.2
gure 2.2 samp
y System
ns the user
tablishes th
e Google Ch
. The data is
most freque
ce
erver record
ors etc.
r
ode betwee
owsing mult
s interest o
7
lt. The proc
2).
ple modeling
does not ex
he user prof
hrome brow
s collected
ently visiting w
ds the web p
en user and
tiple web sit
n client side
cess iterates
of Amazon.c
xplicitly par
file based o
wser lists th
implicitly (s
web site lists
page URL, t
web server
tes.
e includes:
s until the u
om
rticipate in t
n the user’s
e most freq
ee figure 2.
in Chrome
ime user br
r, proxy serv
user satisfies
the modelin
s implicit
quently visit
.3).
rowsed, upl
ver records
s
ng
ted
oad
the
8
1) User’s browsing history and Internet temporary files. The local history or
temporary files record the web site the user has browsed recently and the
time of browsing;
2) User’s bookmark. Bookmark represents user’s interest, especially the
classified bookmarks which reflect a conscious behaviors of users;
3) Searching keywords; 4) Browsing behavior of the user, which include the time the user stay on each
page, keyboard and mouse operating, printing or saving the page, adding
bookmarks etc;
5) Cookies and forms saved by the browser;
6) Documents download by the user.
Among the data sources mentioned above, searching key words can only represents
the current interest but not the long‐term interest. Cookies are difficult to
understand without particular knowledge of the web server. Generally, user only add
pages they interested in into bookmarks, thus bookmark can well represent user’s
interest. However, bookmark is too few compare to the history files or temporary
files. In fact, users do not always add every interesting page into bookmarks. So
modeling based on bookmarks can not reflect the overall interest of the user.
Contrast to the bookmarks, the history file can better represent user’s interest. The
browsing history file is saved implicitly by the browser. The system can establishes
user profile without explicit participation of user. Of course this data source also has
some drawbacks. Web pages in the history folder may not all interested the user. For
example, the hyperlink cannot depict the page well, the user may find the page is not
interesting after open the link. That means the history files contain some
interference. The system should rule out that interference when utilizing history files
as data source.
The browsing behavior can also reflect user’s interest. When user stays a relatively
long time on a certain page, it is inferred that the user may interest in the page. To
present user’s interest, browsing behavior should be utilized together with page
which is browsing.
The log on web server or proxy server not only records the pages the user has
browsed but also records the varied behavior when browsing. Log on proxy server
always records all web sites the user has browsing. Thus it can represent user’s
interest completely. On the other hand, web server only records the visitation of the
particular web site. But it has no idea about other sites. So web server only proper
for user modeling on the particular site.
The document download or saved by user can also represent his/her interest.
Normally, user only downloads and save documents he/she interested in. Besides, in
order to help managing and accessing, the downloaded files are always classified.
9
The information collect from those classified files can reflect topics user concerned.
To sum up, the sever log, browsing history and the browsing behavior can represent
the user’s interest most completely. Bookmarks may not reflect overall interest but
still represent the user’s concern.
2.3Representation
According to [14], [15], [16], traditional reprehensive method include:
1. Topic words representation method
This method use topic words to represent user’s interest. Topic words always
represent a particular domain. For example, topic words like “Sports” or “News”.
This method is always used together with manual modeling. For instance,
my.yahoo.com records the selection of preset options like “Sports”, “Technology”,
and “Finance” etc. Then the information
2. List of key words method
This method uses a list of key words to present user’s interest. For example,
assume a user interest in football, and then the user profile may like {football,
Word Cup, Messi, UEFA Champions League}. The key words can determine by
user or learn by system. The typical recommendation system use key word list is
Webwatcher. Webwatcher required that the user should input interested key
words first. Then it recommends web pages to the user when browsing.
3. Methods based on Vector Space Model
This method use vectors in the vector space of key words to represent user’s
interest. Vector space model is a common method to represent document. Every
document can be present as , , , , … , , in which is
the items (word or phrase), is the weight of in . To use the method
represent user profile, , , … , represnt items user interested in, and
, , … , represent the degree of interest of each items.
4. Bookmark Method
Users always add web pages they are interested in into bookmark (including
Hotlist and Favorites) in order to visit again. The system use this method include
Siteseer, Open Bookmark and online bookmark service.
5. Methods based on User‐item Matrix
User‐item matrix method uses a R ∗ matrix to represent user profile [17]. In
the matrix, m is the number of users,n is the number of items。Every element r in
the matrix represents rating of the user (the row) to the item (the column).
Generally r is an integer, for example from 1 to 5. Empty value means user has
not rating the item yet. Systems based on collaborative filtering are suitable for
using this method.
10
All the methods mentioned above did not consider the multi‐media content
contained in the web page. This thesis combined user profile modeling with image
retrieval in order to utilize the information contained in images to refine the
representation of users’ interest.
11
3MultiDimensionalUserProfile
3.1RelativitybetweenTextandImageContentinWebPages
Images transfer information by painting language which is more direct and has larger
information capacity. Besides, images are more accuracy than text. It is harder to
tamper or twist images. Thus most of web pages on Internet use images to better
represent the content. However, images in web page are not all related to the main
topic of the page. There are lots of images such as advertisement, UI elements or
logos in the web page. The first part of this chapter will introduce an experiment to
find out the relativity between text and image content in web pages.
3.1.1ExperimentalDesign
The experiment analyzes the relativity of images and text through automatic scan
244 web pages. The key points of the experiment include:
1) The selection of web pages This experiment has two groups of web pages. Group randomly selects 100
web pages from the Internet. Group selects 144 web pages from the data
source THE SYSKILL AND WEBERT WEB PAGE RATINGS [33].
2) Image analysis
This experiment analyzes the HTML file of selected web pages. Calculate the
number of <img> tag to get the number of images. The attributes include src, alt,
title in <img> tag represent the image content from the web page producer’s
view. The src attribute is used for set the source file of the image, the alt attribute
is used for specifying alternative text (alt text) that is to be rendered when the
element to which it is applied cannot be rendered. The title attribute set the title
of the image.
3) Relativity judgment
Sort the words in web page by the word frequency. Compare the top 5 frequently
occurred words with the words in src, alt, title attributes in <img> tag. Once
matched word is found, the image is considered relative to the topic of the web
page.
3.1.2ExperimentResult
In group 99% web pages have images, more than 78% pages has at least one
image relative with the page topic. Among all the pages, 14.7% were relative with the
page topic. See figure 3.1
The
ima
rele
3.1
The
but
hav
be h
In p
ima
accu
3.2
As m
Bec
mod
e data sourc
ges are rela
evant to the
1.3Analy
e result show
most pages
e no alt, titl
higher than
previous wo
ges was no
uracy user p
2Vector
mentioned
ause this m
del to repre
Figure 3.
e [33] has o
ative to the
page topic
Figure
sisandC
ws that alth
s at least ha
le attribute,
it in the ex
rk of web p
t considere
profile whic
rSpaceM
in 2.3, vecto
method is ca
esent the pa
1 images rele
only 34.2% p
page topic.
.
e 3.2 images r
Conclusio
ough the m
ave one ima
, which dec
periment.
page recomm
d. The last p
ch combined
ModelB
or space is a
lculable and
art of the us
12
evant ratio in
pages have
. See figure
relevant ratio
on
most of imag
age relative
clines the ra
mendation
part of this
d with infor
Basedon
a common
d operable,
ser profile e
random caug
images. Am
3.2. Among
o in data sour
ges are not
to its topic
tio. It is sur
system, the
chapter wi
rmation of i
nHTML
method for
this thesis
extract from
Pages irrelev
Pages image
ght pages
mong all the
g all the pag
ce [33]
relevant to
. In fact, ma
re that the a
e informatio
ll introduce
mage conte
Structu
represent u
will use vec
m text conte
only havevant images
have relevenes
Pages have
Pages have images
Pages only images
e images 44
ges, 17% we
page’s topi
any <img> t
actual ratio
on containe
e a more
ent.
ure
user profile
ctor space
ent.
nt
no images
relevant
irrelevant
4.8%
ere
ic,
ags
will
d in
e.
13
Vector space model use the vector like , , , , … , , to present
user’s interest. In the vector, is the item, which is a word depicting the user’s
interest. Because the web pages interested by the user may contain a lot of word
have little help to represent user’s interest. Besides, along with the increasing of user
description file, memory space and calculating cost are also increasing. Thus how to
select items and their weights is the central problem of vector space model.
In this thesis, vector space model is based on HTML structure and the word
frequency. The more frequently occurred words have higher weight. Besides, words
in different tags of HTML file have distinguishing importance on representing the
topic the page. Generally, words in the title may have a closer relation to the topic
then the words occur in the other part of the web page. Considering the HTML
structure, words in different tags should be assigned with different weight. See table
3.1.
Table 3.1 tags affect item weights
Tag Specification Weight
<title></title> Title of the page 10
<h1></h1>,<h2></h2>,…<h6></h6> Headings of the
page
6 , in which is the level
of the heading
<body></body> Text body 1
<em></em>, <strong></strong> Emphasis 1
The weight computing equation of key word t is
∑ (3.1)
In which, k presents the type of the tag, is the times key word t occurs in the
tag k. is the corresponding weight in table 3.1. The function is
1, in tag <em> or <strong> ( )
0, not in tag <em> and <strong>
Tt
T
(3.2)
Now key word t and its weight can be calculated. After calculating all the words in
the page, vector V is gained by assemble the highest n weight words.
V , , , , … , , (3.3)
14
3.3Imageinformationextractionandrepresentation
3.3.1CommonMethodsofContentBasedImageRetrieval(CBIR)
Image retrieval system is a system for browsing, searching and retrieving images from
large database of digital images. The traditional methods utilize text description of
images, but this method has great disadvantages: first, images are generally rich in
detail and extended meaning. It is difficult to describe by a few keywords or a simple
comment; Second, different people will have a different understanding of the same
image, which makes it difficult to use the text label for responding to user queries
accurately; Third, the image text annotation can only be done by hand, which is
feasible only when the number of images is small. However, if the total number of
images grow too fast, the annotation by hand will become very difficult.
Content‐based Image Retrieval provides a good solution to these problems by
extracting the characteristics from the image itself. This features extract from images
are objective and comprehensive. Besides, the entire process can be done
automatically. The speed and accuracy of the retrieval are increased. The common
methods of CBIR include:
1. Retrieval based on color
Compare to the shape, color has a rotation invariance and scale invariance
[18]. The basic idea of color based retrieval is to utilize feature of the color
distribution. Color histogram is a common method of color based retrieval. This
method represents color distribution by a histogram and attributes the distance
between the images to the distance between its color histogram. Therefore
image retrieval becomes color histogram matching. This method has
disadvantages: because the color histogram does not maintain any spatial
information, the search results were not accurate. Color histograms of two
completely different images may very similar.
2. Retrieval based on texture
Texture is an important feature of objects, which reflecting the changes of surface
color and grayscale. Retrieval based on texture can be divided into three
categories: statistical methods, structural methods and spectral methods.
Statistical method is to identify the numerical characteristics of the images, such
as Fourier spectrum [23], co‐occurrence matrix (co‐occurrence matrix)
[24], Markov random field models [25‐26]. Structural approach assumes that the
texture pattern has certain texture primitive arranged with certain rules. This is
only proper for some regular images.
3. Retrieval Based on Shape Feature
Shape is one of the essential characteristics for characterizing objects. Shape is
also the initial stage that people learn about things. The problem of image
retrieval based on object shape is segmenting objects from the image by
15
appropriate image segmentation method. The key is to find characteristics of
shape consistent with the human eye perception. The traditional shape‐based
retrieval is based on shape features composed by shape feature vector. The
classic description of the shape in Image analysis include: Fourier descriptors,
moment invariants and various simple form factor (size, roundness, eccentricity,
etc.), spindle orientation (major axis orientation) and so on.
3.3.2 Color Features Extraction and Representation based on
Image Block
Considering the computing complexity, this thesis selects color feature to represent
the images. In order to reduce the error, this thesis also utilizes image block division
to represent shape features in some extent.
1. Color Feature Extraction and Representation
Color histogram is widely used in many image retrieval systems to represent color
features. It concerns the proportion of different colors in the whole image but
does not care the spatial location of each color. It cannot describe the object in
images. Color histogram is particularly suitable to describe the image which is
difficult to segment object automatically.
The color histogram can base on different color space and coordinate system. The
most common color space is RGB color space, because most of the digital image
is expressed in this color space. However, RGB space structure does not consist
with people's subjective color similarity judgments. Therefore, the RGB color
space is often converted into HSV color space.
HSV (hue, saturation, value) color space model corresponding to conical subset in
the cylindrical coordinate system. The top of the cone corresponding to V =
1. The top surface contains the sides of R = 1, G = 1, B = 1 in RGB model and
represented by the bright colors. Hue H is determined by the angle rotated
around the V axis. Red corresponds to the angle of 0 °, green corresponds to the
angle of 120 °, and blue corresponds to the angle of 240 °. In the HSV color model,
each color and its complementary color differed 180°. Saturation S values from 0
to 1, therefore the top surface of the cone has the radius of 1. In HSV color model,
the color domain is one hundred percent color saturation, and its saturation is
generally less than one hundred percent. At the cone vertex (i.e. the origin), V = 0,
H and S is not defined, represents black. At the center of the top surface of the
cone S = 0, V = 1, H is not defined, represents white. Axis from this point to the
origin represents gray dimming light, i.e. the grayscale. For these points, S = 0, H
is not defined. It can be said, HSV model’s V‐axis corresponds to main diagonal in
RGB color space. The circumference on the top surface of the cone is the solid
color which has V = 1, S = 1. HSV color model corresponds to the way the painter
mixes the p
one solid co
change the
black can p
conical spa
The formula
In the equa
RGB values
space is div
histogram.
number of
calculated H
180 bins, w
pigment. Pa
olor. To cha
e hue painte
produce vari
ce model, a
a convert R
m
[
v
s v
h
' [
' [
' [
r v
g
b
ation, , ,r g
s should be
vided into se
This proces
pixels falls w
H, S, V three
which means
inters chan
nge the val
ers add blac
iety of color
as shown in
Figur
RGB space i
max( , , )
min( , ,
5 ' if
1 ' if
1 ' if g
3 ' if
3 ' if
r g b
v r g
b r
g r
r
b g
g b
5 ' othe
] [ m
[ ] [ m
] [ m
r
v r v
v g v
v b v
[0,1],b h
normalized
everal smal
ss is called c
within each
e‐channel c
s adjacent a
16
ge the hue
ue, painters
ck. Adding d
rs. HSV colo
Figure 3.3.
re 3.3 HSV co
into HSV sp
, )] /
max( , ,
max( , ,
max( , ,
max( ,
max( ,
b v
r g
r g
r g
r g
b r g
erwise
min( , , )]
min( , , )
min( , , )]
r g b
r g b
r g b
[0,6], and
. In order to
l color rang
color quant
h bin, histog
color histog
angles of hu
and value t
s mixed soli
different pro
or space can
olor space
pace is
, ) and
, ) and
, ) and
, ) and
, ) and
b g
b g
b b
b b
b
]
r
d , [0,1]s v
o calculate t
ges. Each ran
ization. The
gram can be
ram. Hue h
ues are mer
to get differ
d color with
oportions of
n be illustrat
min( , , )
min( , , )
min( , , )
min( , , )
r g b
r g b
r g b
r g b
min( , , )
r g b
. Before cal
the color hi
nge is a bin
en, by count
e obtained. T
istogram is
ged into on
rent color fr
h white; to
f white and
ted by a
)
)
)
)
)
lculating, th
istogram, co
in the
ting the
This thesis
divided into
ne bin.
rom
(3.4)
he
olor
o
17
Saturation and brightness has 256 bins, which can be gained by multiply original S,
V by 255. Thus, histogram is obtained by counting each pixel's H, S, V, then
counting number of pixels fall within each bin. Expressed by a vector:
1 2 180
0 1 255
0 1 255
histH { , ,..., }
histS { , ,..., }
histV { , ,..., }
h h h
s s s
v v v
(3.5)
In the equation, , ,i j kh s v is the number of pixels fall into each bin.
2. Image Block
The disadvantage of representing image by color features is that two completely
different images may have a similar distribution of the image. For example, two
different images in Figure 3.4 have 9 pixels each, but they have the same color
distribution.
Figure 3.4 different images have the same color distribution
The image block may avoid this problem in certain extent. This method is to split
the whole image into small pieces, then extract color features from each piece
separately. Then the color features of each block matches correspondingly. By
doing so, the problem will be limited into each block to avoid differences in the
overall image is too large. The number of blocks depends on balance among the
complexity of computing and the desired effect. The problem can also be
eliminated by the text description in the web page. Taking the computing
complexity into account, this thesis segment an image into 4 × 4 sub‐blocks. So
each image is represented by three 16‐dimensional image feature vectors. See
equation (3.6), (3.7), (3.8)
I H , H ,… , H (3.6)
I S , S , … , S (3.7)
I V , V , … , V (3.8)
18
3.4UserProfileDesign
3.4.1InterestVector
Vector T [equation (3.9)] can describe a user interested topic. It composed by a
keyword ‐ value pairs and the image feature vector. The key word represent the topic,
the image feature vector represent relative image. If relative image to the topic does
not exist, the image feature vector part will be set a null value.
T={key words,weight, I , I , I } (3.9)
The final user profile is composed by several interest vectors.
3.4.2SimilarInterestJudgment
Because the interest vector is composed by two parts: key words and weight
extracted from the text content; Image feature vector extracted from image content
relative to the topic, the distance between two interest vectors are composed by the
distance of key words and distance between image feature vectors.
1. Distance between key words with weight
Considering the computing complexity, this thesis does not contain semantic
analysis of the key words. The same key words are considered to have distance 0
and the different words are thought to have distance∞. Thus when merge
similar interest vectors, only vectors with same topic will be merged.
2. Distance between color histogram
If considering the height of histogram bins as distribution of a discrete
random variable, then the distance of color histogram can be represented by
correlation coefficient of the two random variables. The benefit of using
correlation coefficient is the ability to handle negative correlation, which means
complementary colors can also be recognized when the images are similarity. It is
calculated as follows:
2 2 2 2
( , ) ( ) ( ) ( )
( ) ( ) ( ) ( )X Y
Cov X Y E XY E X E Y
E X E X E Y E Y
(3.10)
If the two histograms are positive correlated, is positive. When the two
histograms are perfect positive correlation, 1. If the two histograms are
negative correlated, ρ is negative, and 1 when they are perfect negative correlated. The absolute value of correlation coefficient is closer to 1, the two
19
histograms are more closely related, and the images have higher similarity. If it is
closer to 0, the two histograms are less closely related and the images have lower
similarity.
3. Distance between image feature vector
Now distance of color histograms in each block can be calculated correspondingly.
To calculate the distance between image feature vectors, all blocks are
considered equally important. The distance between image feature vectors is the
mean value of distances between each corresponding block:
16
1
1, { , , }
161
( , , )3
j iji
H S V
D j H S V
D D D D
(3.11)
4. The distance between interest vector
Interest vector distance F is defined as:
T , T =∞,differentkeywordsD,samekeywords (3.12)
If the distance is less than the threshold S, then the two vectors are similar.
3.4.3UserModelingAlgorithm
Algorithm 3.1 gives the user modeling algorithm.
3.4.4DistancebetweenUsers
In this thesis, distance between users is calculated only based on user profile. The
user rating is used for updating the user profile. The distance can be calculated as
follows:
1U and 2U are two different user profiles. They are both composed by
some interest vectors. Calculate the distance between them is divided into the
following steps:
1. Digitalize Vector
Assume user profile 1U contains the key words 1 1,..., , ,...,m kt t x x , the user profile
2U contains key words 1 1,..., , ,...,k nx x v v , in which 1,..., kx x is common (same or
similar, see 3.4.2) interest vectors contained by both user profile. The two user
profile can be converted into two equal‐length digital vector by the way
illustrated in Table 4.1. Different key words are replaced by its
corresponding weight. In order to reflect the difference between the two user
profiles on the same or similar key words, they are replaced by corresponding
20
weight combined the distance between the two interest vectors. If the key word
does not exist in one of the user profile, the corresponding position is set to 0.
Table 3.2 digitalize user profile
User Profile … … …
1U … 0.5 … 0.5 0 … 0
2U 0 … 0 ′ 0.5 … ′ 0.5 ′ … ′
2. Calculate the Cosine Distance
The cosine distance between user profiles 1U and 2U can be calculated after
digitalization.
2 21 2 1, 2, 1, 2,1 1
1
cos( , )K
K K
i i i ii ii
L U U U U U U
(3.13)
In the formula, K is the number of interest vectors in the user profile. In this
thesis it is 20.
3.4.5SimilarUserClustering
This thesis uses an improved K‐means algorithm to cluster the similar user.
Algorithm 3.2
Input: user profile set U
Output: user profile clusters
1. Randomly select logK user profile as the initial center of the cluster;
2. For k from logK to K
3. Calculating distance between other profiles and the cluster centers, attribute
the user profile to the nearest cluster;
4. Select the profile farthest from each cluster center as the center of new
cluster;
5. k = 2 * k;
6. End
21
3.4.6UserFeedback
This thesis includes several aspects of user feedback rating:
1. The explicitly rating by user
After user clicks the link in the recommendation result list to access the page, the
system required the user to assign appropriate ratings from a 5‐point scale.
2. Users browsing behavior
User’s browsing behavior can imply user's preferences. For example, when
browsing the user may add bookmark, download, copy or do other operations.
The user browsing behavior can be seen as a kind of feedback on the
recommendation result. This thesis considers that variety of behaviors represents
different degree of user’s interest, as: Add Bookmark> Download> Save> Copy.
presents the weight of user’s browsing behavior. Then the page behavior
vector can be represent as: { (represent adding bookmark), (represent
download), (represent save page), (represent copy)}. Their values are set
as 4,3,2,1 separately. The weight of behavior is only two possibilities, either 0 or
the specified value. For example, if a user added a page into bookmark and also
copied some content, but not download and save, then the user behavior vector
is {4,0,0,1}.
The score of user’s browsing behavior would be sum of the operation weight:
1 (3.14) Because once the user is reading a page, even none of the four browsing
operation was done, it does not mean that users is completely not interested in
the page. Therefore, a constant 1 is added at the end of the equation.
At least the feedback is mainly based on the user to explicitly rating. If the user
has no give any rating on the website, the score is gained from the user browsing
behavior. Both situations will give a 1‐5 score feedback. Feedback score occurred
only after user clicks one of the links in the recommendation result list to enter
the page. Pages in results list without user access will get a score of 0.
3.4.7UpdatingUserProfile
User profile should be able to reflect changes in user interest. Thus user profile needs
to be updated and maintain. On the other hand, user profile can be refined based on
user’s feedback. In this thesis, updating of user profile consists of two aspects:
1. Regular updating by system.
According to the user’s new browsing behavior in the most recent period, the
system captures the changes of user’s interest and automatically updates the
user profile. User model update algorithm is described as Algorithm 3.3.
22
2. Updating based on user feedback ratings
Combining with collaborative filtering, the system required user to rate the
results. At the same time, the system captures the user’s browsing behavior on
pages in the results list. Utilizing these rating and browsing behavior, the system
refines the user profile. See algorithm 3.4.
If a web page gets a feedback scoreless then 3, add interest of vectors extracted
from the page into user profile. If the interest vector is already exist in the user
profile, then decrease the corresponding weight in the user profile. If feedback
score is greater than or equal to 3, add interest of vectors extracted from the
page into user profile. If the interest vector is already exist in the user profile,
then increase the corresponding weight in the user profile.
23
Algorithm 3.1
Input: set of web pages W
Output: user profile U
⑴Preprocessing on W;
WHILE S != ∅
{
⑵∀P ∈ W;
⑶Analyze P,counting frequency of each words in different tags (table 3.1);
⑷Calculate weight of each words using equation 3.1;
⑸Calculate vector V using equation 3.3;
⑹Sort Vs by their weight, keep top 20 vector weight pairs;
⑺Judge the relativity between image and key words (use the alt, title, src
attributes like chapter 3.11);
⑻Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative to at
least one key word;
⑼Composing the image feature vectors with the key word‐weight pair to
get interest vector T 1 20 . If one image is related to multiple keywords,
then add the image feature vectors into all the interest vectors.
⑽Add interest vector T 1 20 into interest vector set T;
⑾Delete P from W;
}
WHILE T != ∅
{
⑿∀T , T ∈ T, ifF T , T , then T andT aresimilar. (the distance
F is calculated by equation 3.13,S is the threshold);
⒀Merge similar interest vector T , T :
(i) Sum the weight of the two vectors,
(ii) Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;
⒁Add the interest vector (after merging or not need to merge) into U,and
delete it from T;
}
⒂Sort U by weight,keep the top μinterest.
24
Algorithm 3.2
Input: User profile U,temporary web page set W’
Output: New user profile U’
⑴Preprocessing on W’;
WHILE W’ != ∅
{
⑵∀P′ ∈ W′;
⑶ Analyze P’, counting frequency of each words in different tags (table 3.1);
⑷ Calculate weight of each words using equation 3.1;
⑸ Calculate vector V’ using equation 3.3;
⑹ Sort V’s by their weight, keep top 20 vector weight pairs;
⑺ Judge the relativity between image and key words (use the alt, title, src
attributes like chapter 3.11);
⑻ Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative with at
least one key word;
⑼ Composing the image feature vectors with the key word‐weight pair to
get interest vector T 1 20 . If one image is related to multiple keywords,
then add the image feature vectors into all the interest vectors;
⑽ Add interest vector T 1 20 into interest vector set T’;
⑾ Delete P’ from W’;
}
WHILE T’ != ∅
{
⑿∀T ∈ T’,∀T ∈ U,ifF T , T ,then T andT aresimilar. (the
distance F is calculated by equation 3.13,S is the threshold);
⒀ Merge similar interest vector T , T :
(i) Sum the weight of the two vectors,
(ii) Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;
⒁Else if ∃T ∈ T’,∀T ∈ U,F T , T ,then adding T intoU;
}
⒂ Sort U by weight, keep the top μinterest, U’ = U。
25
Algorithm 3.3
Input:User Profile U,Feedback Score set S , Interest Vector Set T of the result
pages
Output:New User profile U’
1. For every interest vector T in T
2. ∃s ∈ S, sisthefeedbackscoreofpagescontainT
3. IF s < 3 THEN
4. ∀ ∈ ,IF ∃ , , , , ∈ , F( , ) < D THEN
5. S;
6. ELSE
7. insert into U;
8. END
9. ELSE IF s > 3 THEN
10. ∀ ∈ ,IF ∃ , , , , ∈ , F( , ) < D THEN
11. S;
12. ELSE
13. insert into U;
14. END
15. END
26
4AHybridSystem‐CombinedContentbasedand
CollaborativeRecommendations
4.1AdvantageofHybridSystem
This thesis uses a hybrid recommendation approach, combining content‐based
recommendation with collaborative filtering. The advantage is the ability to avoid the
problem caused by only using single recommendation methods:
1. Over‐fit problem cause by only using content‐based recommendation.
Content based method makes recommendation based on the relevancy of
candidate items and the user profile. Over‐fit problem is that the
recommendations are either too similar with the items browse previously or not
related at all. Over‐fit problem comes from the incompleteness of the data.
When combine the content based method with collaborative filtering, the system
fully utilize data from other similar users.
2. Feature extraction problem of content‐based method.
When the features of the candidate item is difficult to extract or description, or
only can get the general characteristics of the object but cannot capture the
precise characteristics of the item, using content‐based recommendation is
difficult to obtain accurate interest of user. The collaborative filtering can be used
to predict the user’s interest based on other similar users’ feedback evaluation on
the same item.
3. New item problem caused by only use collaborative filtering.
The problem is when new items are added into the system, the system cannot
make recommendation until the new items receive enough feedback
rating. Utilizing content‐based recommendation can directly make
recommendation based on the new item.
4. Sparse problem caused by only use collaborative filtering.
Sparse problem means that when the number of user is too small compare to the
number of items, there are always some items did not receive any feedback.
Combining content‐based method can directly make recommendation based on
the content of the item which has not received any the user evaluation. In the
extreme cases, the system can still provide service even only has one user.
5. Sparse problem caused by large selection differences.
27
The problem is the rating difference between any two users is very large or some
user’s preference is too special compared to other users. When combined with
content‐based recommendation, even for special interest users, the system can
make recommendation without refer to other users’ interests.
4.2SystemArchitecture
Distributor
Feedback Controller
Selector
User Profile Database
Collector
User
Similar User
Web Pages
Client
Server
Figure 4.1 system architecture
System uses the Client‐Server architecture. The server includes three main modules:
the collector responsible for collection of pages from the Internet. Distributor
responsible for matching the Web pages and user profiles than make
recommendation use content‐based method. User profile database storing the
user profile.
Selector receives recommendation result from the distributor and filter out pages
have been visited by the user and extra pages on the same website. Feedback
controller collects user feedback rating and capture user browsing behavior. Once a
page gets feedback score more than 4 points, it will directly be recommended
to other similar users belonging to the same cluster. That is the way of collaborative
filtering in this thesis. In addition, feedback controller is also responsible for updating
the use profile based on user feedback.
28
4.3SystemFlow
System flow is as follows:
First of all, after user access the system, the server checks the existence of the
system cookie after receiving the client request. If the cookie exists, then the user
automatically log on. If there is no cookie, then send the login page to the user. In the
login page, registered users can manually log in, not registered users can jump to the
register page.
After login successfully, the system collect the user’s browsing history. If the user is
not a new user, then the system rule out the browsing record before the last login time
in the cookie, and use the recent browsing history to update the user. If the cookie
does not exist, the user profile is not updated this time. Otherwise, for new users, the
system calculates the user profile based on all the browsing history. Then the system
calculates the distance between new user and cluster center already exist and add
the new user to the proper cluster.
In the next step, the system matches the pages in the database and the user profile,
and then recommends the 10 best matches. The result list is contained in a page
send back to the user as http response. The user browses the result page while the
system collects the feedback and user’s browsing behavior. Then the system update
the user profile based on feedback. If the one page gets a score more than 4, the
page is also recommended to all the similar users. See figure 4.2
29
Figure 4.2 system flow
4.4KeyTechnologyUsedintheSystem
4.4.1ChinesewordSegmentation
Difference from English, Chinese words has no space or other separator between the
words. In order to count the word frequency, the system needs to segment the
Chinese text into separate Chinese words. Existing segmentation algorithm can be
divided into three categories: segmentation based on the string matching,
segmentation based on understanding and segmentation based on statistical
methods.
1. Segmentation based on string matching
30
This method is also known as mechanical segmentation method, which is based
on a certain strategy that matching Chinese string with entries in a "sufficiently
large" machine dictionary. If a word was found both in the dictionary and the
string, then matching is successful (identify the word). The methods can be
divided into positive match and reverse match according to the scanning
direction, or divided into largest (longest) match and the smallest (minimum)
matching according to the difference length of matching priority. The common
methods are the combination of the methods mentioned above, as follows:
1) The forward maximum matching method (from left to right);
2) The reverse maximum matching method (from right to left direction);
3) Least segmentation (segment minimum number of words from the string);
4) bi‐directional maximum matching (from left to right then from right to left)
2. Segmentation based on understanding
This segmentation method is to let the computer simulate the way through which
people understanding the sentences to identify the word. The basic idea is doing
syntax, semantic analysis while segmentation to deal with ambiguity. It usually
contains three parts: the segmentation subsystem, Syntax and Semantics
subsystem, control subsystem. Under the coordination of the total control
subsystem, the segmentation subsystem can obtain syntactic and semantic
information about words and sentences to make judgments on word
segmentation ambiguity. It simulates the process that people understanding the
sentence. The segmentation method requires a large amount of language
knowledge and information. Because knowledge of Chinese is complex and
general, it is hard to organize the language information into the form which can
be recognized by the machine directly. Currently segmentation based on
understanding is still on test.
3. Segmentation based on statistical methods
From the formal point of view, the word is stable combinations of characters. The
more the adjacent characters occurred together in a context, the more likely they
form a word. This method analyzes the co‐occurrence of the adjacent characters.
When the frequency of co‐occurrence of the adjacent characters is higher than a
threshold, the characters may constitute a word. This method only based on the
statistical frequency of group of characters in the corpus, no segmentation
dictionary is needed. Thus the method is also called dictionary free methods.
Because Chinese segmentation is not the focus of this thesis, the system used an
open source project called ICTCLAS (Institute of Computing Technology, Chinese
Lexical Analysis System) [38]. ICTCLAS3.0 reaches a segmentation speed of
996KB/s, segmentation accuracy of 98.45%. The size of API is less than 200KB, all
the dictionary data in less than 3MB. ICTCLAS support of Linux, FreeBSD and
Windows, and has the version of C / C + +, C #, Delphi, Java and other
mainstream development language.
31
4.4.2 Calculation of Chinese compound word
Chinese Compound is defined as: Nominal Compound word is a group of continuous
words which equivalent to simple nouns in the overall function, with semantic
integrity, does not contain function words [39]. It is like the phrase in English, such as
"计算机操作系统” (means computer operating system) which is combined by “计算
机”(computer), “操作”(operating) and “系统”(system). In Chinese, compound word
is seen as a single word. If the compound word is segmented into "computer",
"operations" and "system", it will cause misunderstanding when calculating the
vector space of the document.
How to correctly identify the compound word has been a hot research topic in
information retrieval, machine translation and text classification. There are many
sophisticated methods, including:
1. Statistical methods
⑴ Approach based on the frequency
If two words or more words occurred together for a lot of times, the possibility
that they constitute a compound word is higher. At least means the words
together represent a special meaning. The actual use of the method is often
combined with linguistic and heuristic rules to improve the precise and recall.
⑵ Hypothesis testing
Using statistical methods to determine whether it is by chance the words
combined a compound word. Judge under what condition that the words do not
constitute a compound word.
⑶ Likelihood ratio
Use likelihood ratio to indicate how larger one possibility of the way through
which compound word is constitute is than other possibilities.
⑷Relative frequency compared
Compound words in a domain are often occurred in the same domain, but are
rarely occurred in other domains. According to this features of compound words,
comparison of occurrence frequency of compound word in two or more different
corpus can also help finding compound words.
2. Based on rules
Rule‐based methods use the context information and the internal components to
indentify the compound word. But due to the complex of linguistic rules, the
rules extraction is rely on artificial work. Thus linguistic rules are difficult to use
accurately. Now rule‐based analysis is used rarely.
3. Statistical and rule based methods
Statistical and rule based approach is the current trend of compound word
research. Because of the complexity of natural language, statistical methods
cannot identify the low frequency compound words. The artificial rules are
32
difficult to extract. Thus, many researches try to combine the two methods to
extract and identify the compound word.
This hybrid method combined the statistical knowledge and linguistic knowledge
(syntactic and semantic information). In specific implementations, there are a
variety of forms. For example, first use statistical methods to identify candidate
compound words, and then use the linguistic knowledge to filter.
This thesis uses the statistical based methods [38]. The probability of whether
two words constitute compound words can be calculated as the fellow:
xy x y xyf f f f f (5.1)
In Which, xf is the frequency of word X, and yf is the frequency of the word Y ,
xyf is the frequency of the compound word XY. When f > 1%, the XY should be
considered as a compound word.
4.4.3 Capture User Browsing Behavior
In IE, the user browsing behavior can be captured by BHO (Browser Helper Objects).
1. Principle of interaction between BHO and the Internet Explorer
Internet Explorer and BHO communicate via COM interfaces. BHO is a COM
object implement a specific interface. Developed BHO plug‐in is registered in the
registry under a certain key. When the browser starts, Internet Explorer checks
the key and the loads all objects under the key. Internet Explorer initializes the
object and checks the corresponding interface. If the interface is founded,
Internet Explorer uses the methods provided by the interface to pass its
IUnknown pointer to the BHO object.
Browser may found a number of CLSID (Class ID) in the registry, and establishes a
process instance for each CLSID. As a result, these objects are loaded into the
same memory and start to running, as they were native components. Internet
Explorer has a COM feature, in order to hook the browser's event, BHO need to
create a COM‐based communication channel and implement an interface called
IObjectWithSite. Through the interface IObjectWithSite, Internet Explorer can
pass its IUnknown pointer. BHO is able to store the interface and look up the
other interface needed, such as IWebBrowser2, IDispatch, and
IconnectionPointContainer and so on.
BHO object is loaded when browser’s main window displays, and is unload when
the main destructs. No matter the browser is started under what kind of
33
command, BHO object is loaded. Only when explicitly run several iexplorer.exe,
multiple copies of Internet Explorer is created and multiple BHO is loaded. When
opened a new Internet Explorer window from an existed Internet Explorer, every
window only creates a new thread rather than creates a new process, therefore,
BHO will not be re‐loaded.
2. Process of capturing Internet Explorer browsing behavior
The process from loading BHO with booting IE to unloading BHO with exiting IE
can be divided into four phases:
(1) Stage of loading BHO This phase can be summarized as four processes: Internet Explorer start, IE
find the sub key under a directory in the registry and load BHO. IE request the
IObjectwithSite interface of BHO. When getting the interface, IE passes its
own IUnknown interface to BHO.
(2) Stage of establishing connections This stage is the premise for getting all kinds of events of IE. At this stage,
BHO first send request for IConnectionPointContainer and IWebBrowser
through IUnknown interface. If the request is successful, BHO send request
for establish a connection through IConnectionPointContainer to IE browser.
BHO also send request to establish connection with HTML window and HTML
document through IWebBrowser interface. Then IE create the connections
mentioned above.
(3) Stage of event processing IE send events ID through the IDispatch interface BHO exposed. BHO handle
events according to the events ID. The process repeat until IE is closed.
(4) Stage of closing When IE browser is closed, BHO will be unloaded
4.5TheBriefIntroductionofthePrototype
4.5.1 Server Modules
In fact the modules are running in background on sever. The graphic interface of
those modules is used for unit test.
Figure 4.3 shows the of HTML pages analysis module. This module uses the recursive
method to analyze nested HTML tags. The module ignores other tags but only
reserves content in tags of table 3.1, and classifies the content by the tags. The
example page in figure 4.3 comes from: http://baike.baidu.com/view/47277.htm
Figu
Sect
segm
ure 4.4 show
tion 4.3.1, t
mentation i
ws the Chin
this module
is based on
Figure
ese word se
e utilizes the
the results
34
e 4.3 HTML a
egmentatio
e componen
of the HTM
nalysis
n module. A
nts of the p
ML analysis.
As describe
roject ICTCL
d in
LAS. The
Calc
seg
web
wor
wor
culating wo
mentation,
b page, suc
rds frequenc
rds were filte
rds frequen
there are m
h as some
cy and weig
ered out. Th
Figure 4.4
ncy and wei
many words
meaningles
ght created
he module i
35
4 Chinese Seg
ght is base
have little h
ss auxiliary
a stop word
is shown in
gmentation
d on segme
help to desc
word. The m
ds list, acco
Figure 4.5.
entation res
cribe the co
module for c
ording to wh
.
sults. After
ontent of the
calculating
hich the use
e
eless
Figu
mod
ure 4.6 show
dule for con
Figure
ws the calcu
ntrast the co
4.5 Words Fr
ulation of th
olor histogr
36
requency and
he color hist
ram.
d Weight Calc
togram, and
culation
d Figure 4.77 shows the
Figure 4.6 Calc
Figure
37
culating the C
4.7 Contrast
Color Histogra
Images
am
4.5
Use
New
5.2 Client
er first acces
w user can r
t Interfac
ss into the l
register in th
ce
ogin page, s
Figu
he register
Figure
38
see Figure 4
ure 4.8 Login
page, see F
e 4.9 Registe
4.8.
Page
igure 4.9.
r Page
Afte
Whe
ratin
er login, the
en user acc
ng, see Figu
e system giv
cess one link
ure 4.11.
ves recomm
Figure 4.10
k in the resu
Figur
39
mendations,
0 Recommend
ult list, the s
re 4.11 Rating
see Figure
dation Page
system requ
g Page
4.10.
uires the usser to give
40
4.6TheExperiment
In this thesis, experiments adopt WebSpider for crawl the web page. The
crawler will collect the web pages, and it save web pages into the database. In the
experiment, WebSpider firstly collected web pages for more than 72 hours, and it
gets more than 500,000 pages. Then 10 volunteers were asked to use the system.
They provided their browsing history also rating on the recommendation results. For
each user, the user profile was updated at least three rounds of iteration. The result
is shown in Figure 4.12
Figure 4.12 Experiment Result
This thesis adopts the ndpm metric for measuring the recommendation performance.
In order to measure ndpm, an ideal ranking must be defined. An ideal ranking of
some pages for source S is one where the user prefers every page from S to every
page not from S. It does not matter how the user ranks the pages from S relative to
one another, nor the pages not from S. The greater the preference the user
expresses for pages from S over the other pages supplied, the smaller the ndpm
distance between the user’s actual ranking and the ideal ranking for S.
When few feedbacks were given, the recommendation combination of image
analysis was more accuracy. Given a large number of feedbacks, the accuracy of text
content exclusive recommendation and recommendation combination of image
analysis were similar. This result is reasonable: When the user gave more feedback
score, the user profile already becomes accurate. However, when the user is given
less feedback, recommendation combination of image analysis can improved the
accuracy. In the case of the lack of user information, pictures information could help
to refine the use profile.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
5 10 15 20 25
ndpm
The number of rated Web ite
Recommend with ImageInformation
Recommendation only UseText Information
41
5 Conclusions
5.1Result
With the rapid growing of Internet, information on Internet has increased
extremely. Large amount of information actually caused the problem of information
overload. The search engine may help to solve the problem. Some features of the
search engine constraints its effects. Personalized recommendation system may
become a better solution. It does not depend on user provided keywords, but
recommendation system guides the users to find required results. The traditional
web page recommendation is based on the text content of Web pages. However
Web page contains a large number of images, traditional methods did not utilize the
image, as the result, part of the information is a wasted. This thesis designed a
recommendation system combined content‐based image analysis. Main tasks are:
Analysis of the traditional user profile includes modeling, representation methods
and data source. Combined with content‐based image analysis, a new user profile is
designed. The user profile contains the image information that the user interested in
to improve information utilization and increase recommendation accuracy.
The thesis states the algorithm of using user feedback score to iterative update user
profile. That make the user profile can reflect the change of user's interest by time.
Proposes a hybrid recommendation system combined the content‐based method
and collaborative filtering.
Implemented the prototype and carry out the experiment.
5.2FutureWork
Although this study has achieved initial success, but still long way to go, there are
many pending further research work, which are briefly discussed below:
Image features extraction only focus on color feature, since running time is critical
metrics of system performance. The color cannot represent all the features of an
image. For example, the same item can be a variety of colors. In this way, images has
the same color distribution may describe a completely different scene. Although this
paper attempts to adopt image block to reduce false positive, but the performance
still cannot reach a satisfactory accuracy in use case study. Future work should
include further analyze of image texture and shape, refined image feature extraction
and expression.
42
In this thesis, the way combined features of web text and images are relatively rough.
Determining the correlation between images and text only based on <img> tag’s src,
title and alt attributes. However, non‐standard written HTML files cannot be valid for
analysis. A possible improvement is using image content features to directly retrieve
relative web page.
User profile lack of semantic analysis of text vector but just simply matches with the
key words. If the key words cannot match with each other, they are considered to
represent different interests. Further research can increase the ability of the system
to understand the semantic meaning the text.
Although the combination of content‐based and collaborative filtering solved some
problems in recommendation systems such as sparse problems and new item
problem, there are still some problems in recommendation system to be resolved
such as security issues and criteria for system evaluation.
43
References
[1] China Internet Network Information Center, Statistical Report on Internet
Development in China, 2010,
http://www.cnnic.net.cn/uploadfiles/pdf/2010/1/15/101600.pdf date visited
“2010‐11‐05”
[2] Shuning Li. Research on the Information Overload Problem in Web Information
Environment, Information Science, 2005,Vol.23(10):1587‐1590
[3] Yang, C.C.; Chen, Hsinchun; Honga, Kay (2003). "Visualization of large category
map for Internet browsing". Decision Support Systems 35 (1): 89–
102. doi:10.1016/S0167‐9236(02)00101‐X
[4] Hailing Xu, Xiao Wu, Xiaodong Li etc. Comparison Study of Internet
Recommendation System. Journal of Software, 2009,Vol.20(2):350‐362
[5] Jianguo Liu, Tao Zhou, Binghong Wang. Advances in Personalized
Recommendation system. Progress in Natural Science, 2009, Vol.19(1):1‐15.
[6] Resnick P, Varian HR. Recommender systems. Communications of the ACM, 1997,
Vol.40(3):56−58
[7] Marko Balabanović, Yoav Shoham, Content Based, Collaborative
Recommendation, Computaions of ACM, 1997, Vol.40(3): 66‐72
[8] James Ruker,Marcos J. Polanco. Siteseer: Personalized Navigation for the web.
Communications of the ACM, 1997, Vol.40(3):73‐75
[9] Fabio A. Asnicar, Carlo Tasso. ifWeb: a Prototype of User Model‐Based Intelligent
Agent for Document Filtering and Navigation in the World Wide Web. In Proc. of
6th International Conference on User Modelling (2‐5 June 1997)
[10] Michael Pazzani, Jack Muramatsu, Daniel Billsus. Syskill & Webert: Identifying
interesting web sites. AAAI Technical Report SS‐96‐05
[11] Yangjun Pei. Research on User Interest Profile in Personalized Service System.
Master Thesis, Chongqin University, 2005
[12] Zhiweiguan. User oriented Intelligent Human Computer Interaction, Doctoral
Thesis. Institue of Software Chinese Academy of Science, 2000
[13] Alfred kobsa. User Modeling in Dialog Systems. Potentials and Hazards[J]. AI&
Society, 1990, Vol.4(3):214—240
[14] Yuxiang Yan. Research on Recommendation System Based on Semantic web.
Master Thesis, Taiyuan University of Technology, 2010
[15] Xialuo. Research of The User Model in Web Mining. Master Thesis. Sichuan
Normal University, 2009
[16] Liying Xiao. Research on Personalized User Profile on Internet. Master Thesis,
Central South University, 2003
[17] Badrul Sarwar, George Karypis, Joseph Konstan, et al. Item‐Based Collaborative
Filtering Recommendation Algorithms. Proceedings of the 10th international
conference on World Wide Web, 2001
[18] Knil K. Jain, Aditya Vailaya. Image Retrieval using Color and Shape. Pattern
44
Recognition, 1996, Vol.29, Issue 8: 1233‐1244
[19] Weicheng Liu, Hongji Sun. Summary on Content Based Image Retrieval. 2002,
Information Science, Vol.20(4):431‐437
[20] Guanming Lu, Content Based Image and Video Retrieval. 2002, Journal of
Nanjing University of Posts and Telecommunications (Natural Science), Vol.22(2):
23‐26
[21] Shaoli Wang, Li Zhang, Jing Fu etc. A System of Query Based on the Content of
Image and Video. Computer Engineering and Applications, 2001, Vol.37(7):
113‐117
[22] Wenlong Qu, Weidong li, Bingyu Yang. Overview of Image Mining Research.
Computer Englineering and Applications, 2004, Vol.40(5): 1‐3
[23] B. S Manjunath, W. Y. Ma.Texture features for browsing and retrieval of image
data. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1996,
Vol.18(8): 837‐842
[24] Mari Partio, Bogdan Cramariuc, Moncef Gabbouj, et al. Rock Texture Retrieval
Using Gray Level Co‐occurrence Matrix. 5th Nordic Signal Processing Symposium
October 4‐7, 2002
[25] George R. Cross, Anil K. Jain. Markov Random Field Texture Models. Pattern
Analysis and Machine Intelligence, IEEE Transactions on. 1983, Vol.PAMI‐5(1):
25‐39
[26] R. Chellappa, S. Chatterjee. Acoustics, Speech and Signal Processing, IEEE
Transactions on, 1985, Vol.33(4):959‐963
[27] Haitao J., Abdel S. H. Scene change detection techniques for video database
system [J]. Multimedia System, 1998(6):186‐195
[28] Patel N. V., Sethi I. K. Video Shot Detection and Characterization for Video
Database [J]. Pattern Recognition, 1997, Vol.30(4): 583‐592
[29] A Nagasaka, et al. Automatic Video Indexing and Full Video Search for Object
Appearances[C]. Second Working Conference on Visual Database Systems, IFIP
WG2.6, 1991: 119‐133.
[30] Zhang H. J, et al. Video Parsing, Retrieval and Browsing: An Integrated and
Content‐based Solution [A]. Proceed in ACM Multimedia’95 [C]. 1995:15‐24
[31] Song M. H., Kwon, T. H. On Detection of Gradual Scene Changes for Parsing of
Video Data [J]. SPIE, 1997, Vol.33(12): 404‐409
[32] Zhang H. J., Wu Jianhua, et al. An Integrated System for Content‐Based Video
Retrieval and Browsing[J]. Pattern Recognition. 1997, Vol.30(4): 643‐657
[33] Michael Pazzani, Syskill and Webert Web Page Ratings,
http://kdd.ics.uci.edu/databases/SyskillWebert/SyskillWebert.data.html, date
visited “2010‐11‐10”
[34] Nicholas Kushmerick, Internet Advertisements Data Set,
http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, date visited
“2010‐11‐10”
[35] The 4 Universities Data Set,
http://www‐2.cs.cmu.edu/afs/cs.cmu.edu/project/theo‐20/www/data/, date
45
visited “2010‐11‐10”
[36] M. Stricker, M. Orengo. Similarity of color images. SPIE Storage and Retrieval for
Image and Video Databases III. 1995, Vol. 2185:381‐392
[37] Chinese Segmentation. http://baike.baidu.com/view/19109.htm, , date visited
“2010‐12‐03”
[38] ICTCLAS. http://ictclas.org/index.html, , date visited “2010‐12‐03”
[39] Changxiong Chen. Compounds Phrase Analysis and Application in Information
Retrieval. Master Thesis. Shanghai JiaoTong University, 2008
[40] Zhi Cai. Research on Intelligent Information Capturing on World Wide Web.
Master Thesis., University of Science and Technology of China, 2002
[41] Yao, Y. Y. Measuring retrieval effectiveness based on user preference of
documents. J. Amer. Soc. Info. Sci. 1995, Vol.46( 2): 133–145