Upload
teresa-sebastian
View
216
Download
0
Embed Size (px)
Citation preview
8/3/2019 Web Miningppt
1/29
1
Web Mining Taxonomy
8/3/2019 Web Miningppt
2/29
Why Web Usage Mining?
Explosive growth of E-commerce
Provides an cost-efficient way doing business
Amazon.com: online Wal-Mart
Hidden Useful information
Visitors profiles can be discovered
Measuring online marketing efforts, launching
marketing campaigns, etc.
8/3/2019 Web Miningppt
3/29
How to perform Web Usage Mining
Obtain web traffic data from
Web server log files
Corporate relational databases
Registration forms
Apply data mining techniques and other Webmining techniques
Two categories:
Pattern Discovery Tools
Pattern Analysis Tools
8/3/2019 Web Miningppt
4/29
Pattern Analysis Tools
Answer Questions like:
How are people using this site?
which Pages are being accessed most
frequently?
This requires the analysis of the structure of
hyperlinks and the contents of the pages
8/3/2019 Web Miningppt
5/29
Pattern Analysis Tools
O/P of Analysis
The frequency of visits
per document
Most recent visit per
document
Frequency of use of
each hyperlink Most recent use of each
hyperlink
Techniques:
Visualization techniques
OLAP techniques
Data & Knowledge
Querying Usability analysis
8/3/2019 Web Miningppt
6/29
Pattern Discovery Tools
Data Pre-processing
Filtering/clean Web log files
eliminate outliers and irrelevant items
Integration of Web Usage data from:
Web Server Logs
Referral logs
Registration file Corporate Database
8/3/2019 Web Miningppt
7/29
Pattern Discovery Techniques
Converting IP addresses to Domain Names
Domain Name System does the conversion
Discover information from visitors domain names:
Ex: .ca(Canada), .cn(China), etc
Converting URLs to Page Titles
Page Title: between and
8/3/2019 Web Miningppt
8/29
Pattern Discovery Techniques
Path Analysis
Uses Graph Model
Provide insights to navigational problems
Example of info. Discovered by Path analysis:
78% company-> whats new->sample-> order
60% left sites after 4 or less page references
=> most important info must be within the first 4
pages of site entry points.
8/3/2019 Web Miningppt
9/29
Filter
Pattern Discovery Techniques
Grouping
Groups similar info. to help draw higher-level
conclusions
Ex: all URLs containing the word Yahoo
Filtering
Allows to answer specific questions like:
how many visitors to the site in this week?
8/3/2019 Web Miningppt
10/29
Pattern Discovery Techniques
Dynamic Site Analysis
Dynamic html links to the database, and
requires parameters appended to URLs
http://search.netscape.com/cgi-
in/search?search=Federal+Tax+Return+Form&c
p=ntserch
Knowledge: What the visitors looked for
What keywords S/B purchased from Search engineer
http://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserch8/3/2019 Web Miningppt
11/29
Pattern Discovery Techniques
Cookies
Randomly assigned ID by web server to browser
Cookies are beneficial to both web sitedevelopers and visitors
Cookie field entry in log file can be used by Web
traffic analysis software to track repeat visitors loyal customers.
8/3/2019 Web Miningppt
12/29
Pattern Discovery Techniques
Association Rules
help find spending patterns on related products
30% who accessed/company/products/bread.html,
also accessed /company/products/milk.htm.
Sequential Patterns
help find inter-transaction patterns
50% who bought items in /pcworld/computers/, alsobought in /pcworld/accessories/ within 15 days
8/3/2019 Web Miningppt
13/29
Pattern Discovery Techniques
Clustering
Identifies visitors with common characteristics
based on visitors profiles
50% who applied discover platinum card in
/discovercard/customerService/newcard, were in
the 25-35 age group, with annual income between
$40,000 50,000.
8/3/2019 Web Miningppt
14/29
Pattern Discovery Techniques
Decision Trees
a flow chart of questions leading to a decision
Ex: car buying decision tree
What Brand?
What Year?
What Type?
2000 Model Honda
Accord EX
8/3/2019 Web Miningppt
15/29
Week 1: Data Mining II 15
Web Content Mining
Extends work of basic search engines
Search Engines
IR application
Keyword based
Similarity between query and document
Crawlers
Indexing Profiles
Link analysis
8/3/2019 Web Miningppt
16/29
Week 1: Data Mining II 16
Crawlers Robot (spider) traverses the hypertext structure in
the Web. Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler visits entire Web (?) andreplaces index
Periodic Crawler visits portions of the Web andupdates subset of index
Incremental Crawler selectively searches the Weband incrementally modifies index
Focused Crawler visits pages related to a particularsubject
8/3/2019 Web Miningppt
17/29
Week 1: Data Mining II 17
Focused Crawler
Only visit links from a page if that page isdetermined to be relevant.
Classifier is static after learning phase.
Components: Classifier which assigns relevance score to each
page based on crawl topic.
Distiller to identify hub pages. Crawler visits pages to based on crawler and
distiller scores.
8/3/2019 Web Miningppt
18/29
Week 1: Data Mining II 18
Focused Crawler
Classifier to related documents to topics
Classifier also determines how useful outgoing
links are
Hub Pages contain links to many relevant
pages. Must be visited even if not high
relevance score.
8/3/2019 Web Miningppt
19/29
Week 1: Data Mining II 19
Focused Crawler
8/3/2019 Web Miningppt
20/29
Week 1: Data Mining II 20
Context Focused Crawler
Context Graph:
Context graph created for each seed document .
Root is the seed document.
Nodes at each level show documents with links todocuments at next higher level.
Updated during crawl itself .
Approach:
1. Construct context graph and classifiers using seeddocuments as training data.
2. Perform crawling using classifiers and context graphcreated.
8/3/2019 Web Miningppt
21/29
Week 1: Data Mining II 21
Context Graph
8/3/2019 Web Miningppt
22/29
Week 1: Data Mining II 22
Virtual Web View Multiple Layered DataBase (MLDB) built on top of the Web.
Each layer of the database is more generalized (and smaller)and centralized than the one beneath it.
Upper layers of MLDB are structured and can be accessed withSQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to place in firstlayer of MLDB.
Higher levels contain more summarized data obtained throughgeneralizations of the lower levels.
8/3/2019 Web Miningppt
23/29
Week 1: Data Mining II 23
Personalization
Web access or contents tuned to better fit the
desires of each user.
Manual techniques identify users preferences
based on profiles or demographics. Collaborative filtering identifies preferences based
on ratings from similar users.
Content based filtering retrieves pages based on
similarity between pages and user profiles.
8/3/2019 Web Miningppt
24/29
Week 1: Data Mining II 24
Web Structure Mining
Mine structure (links, graph) of the Web
Techniques
PageRank
CLEVER
Create a model of the Web organization.
May be combined with content mining tomore effectively retrieve important pages.
8/3/2019 Web Miningppt
25/29
Week 1: Data Mining II 25
PageRank
Used by Google Prioritize pages returned from search by
looking at Web structure.
Importance of page is calculated based onnumber of pages which point to itBacklinks.
Weighting is used to provide more
importance to backlinks coming formimportant pages.
8/3/2019 Web Miningppt
26/29
Week 1: Data Mining II 26
PageRank (contd)
PR(p) = c (PR(1)/N1+ + PR(n)/Nn)
PR(i): PageRank for a page i which points to target
page p.
Ni: number of links coming out of page i
8/3/2019 Web Miningppt
27/29
Week 1: Data Mining II 27
CLEVER
Identify authoritative and hub pages.
Authoritative Pages :
Highly important pages.
Best source for requested information.
Hub Pages :
Contain links to highly important pages.
8/3/2019 Web Miningppt
28/29
Week 1: Data Mining II 28
HITS
Hyperlink-Induces Topic Search
Based on a set of keywords, find set of
relevant pages R.
Identify hub and authority pages for these.
Expand R to a base set, B, of pages linked to or
from R.
Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
8/3/2019 Web Miningppt
29/29
Week 1: Data Mining II 29
HITS Algorithm