Web Miningppt

8/3/2019 Web Miningppt

1/29

1

Web Mining Taxonomy


2/29

Why Web Usage Mining?

Explosive growth of E-commerce

Provides an cost-efficient way doing business

Amazon.com: online Wal-Mart

Hidden Useful information

Visitors profiles can be discovered

Measuring online marketing efforts, launching

marketing campaigns, etc.


3/29

How to perform Web Usage Mining

Obtain web traffic data from

Web server log files

Corporate relational databases

Registration forms

Apply data mining techniques and other Webmining techniques

Two categories:

Pattern Discovery Tools

Pattern Analysis Tools


4/29


Answer Questions like:

How are people using this site?

which Pages are being accessed most

frequently?

This requires the analysis of the structure of

hyperlinks and the contents of the pages


5/29


O/P of Analysis

The frequency of visits

per document

Most recent visit per

document

Frequency of use of

each hyperlink Most recent use of each

hyperlink

Techniques:

Visualization techniques

OLAP techniques

Data & Knowledge

Querying Usability analysis


6/29

Pattern Discovery Tools

Data Pre-processing

Filtering/clean Web log files

eliminate outliers and irrelevant items

Integration of Web Usage data from:

Web Server Logs

Referral logs

Registration file Corporate Database


7/29

Pattern Discovery Techniques

Converting IP addresses to Domain Names

Domain Name System does the conversion

Discover information from visitors domain names:

Ex: .ca(Canada), .cn(China), etc

Converting URLs to Page Titles

Page Title: between and


8/29


Path Analysis

Uses Graph Model

Provide insights to navigational problems

Example of info. Discovered by Path analysis:

78% company-> whats new->sample-> order

60% left sites after 4 or less page references

=> most important info must be within the first 4

pages of site entry points.


9/29

Filter


Grouping

Groups similar info. to help draw higher-level

conclusions

Ex: all URLs containing the word Yahoo

Filtering

Allows to answer specific questions like:

how many visitors to the site in this week?


10/29


Dynamic Site Analysis

Dynamic html links to the database, and

requires parameters appended to URLs

http://search.netscape.com/cgi-

in/search?search=Federal+Tax+Return+Form&c

p=ntserch

Knowledge: What the visitors looked for

What keywords S/B purchased from Search engineer
http://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserch


11/29


Cookies

Randomly assigned ID by web server to browser

Cookies are beneficial to both web sitedevelopers and visitors

Cookie field entry in log file can be used by Web

traffic analysis software to track repeat visitors loyal customers.


12/29


Association Rules

help find spending patterns on related products

30% who accessed/company/products/bread.html,

also accessed /company/products/milk.htm.

Sequential Patterns

help find inter-transaction patterns

50% who bought items in /pcworld/computers/, alsobought in /pcworld/accessories/ within 15 days


13/29


Clustering

Identifies visitors with common characteristics

based on visitors profiles

50% who applied discover platinum card in

/discovercard/customerService/newcard, were in

the 25-35 age group, with annual income between

$40,000 50,000.


14/29


Decision Trees

a flow chart of questions leading to a decision

Ex: car buying decision tree

What Brand?

What Year?

What Type?

2000 Model Honda

Accord EX


15/29

Week 1: Data Mining II 15

Web Content Mining

Extends work of basic search engines

Search Engines

IR application

Keyword based

Similarity between query and document

Crawlers

Indexing Profiles

Link analysis


16/29


Crawlers Robot (spider) traverses the hypertext structure in

the Web. Collect information from visited pages

Used to construct indexes for search engines

Traditional Crawler visits entire Web (?) andreplaces index

Periodic Crawler visits portions of the Web andupdates subset of index

Incremental Crawler selectively searches the Weband incrementally modifies index

Focused Crawler visits pages related to a particularsubject


17/29


Focused Crawler

Only visit links from a page if that page isdetermined to be relevant.

Classifier is static after learning phase.

Components: Classifier which assigns relevance score to each

page based on crawl topic.

Distiller to identify hub pages. Crawler visits pages to based on crawler and

distiller scores.


18/29


Focused Crawler

Classifier to related documents to topics

Classifier also determines how useful outgoing

links are

Hub Pages contain links to many relevant

pages. Must be visited even if not high

relevance score.


19/29


Focused Crawler


20/29


Context Focused Crawler

Context Graph:

Context graph created for each seed document .

Root is the seed document.

Nodes at each level show documents with links todocuments at next higher level.

Updated during crawl itself .

Approach:

1. Construct context graph and classifiers using seeddocuments as training data.

2. Perform crawling using classifiers and context graphcreated.


21/29


Context Graph


22/29


Virtual Web View Multiple Layered DataBase (MLDB) built on top of the Web.

Each layer of the database is more generalized (and smaller)and centralized than the one beneath it.

Upper layers of MLDB are structured and can be accessed withSQL type queries.

Translation tools convert Web documents to XML.

Extraction tools extract desired information to place in firstlayer of MLDB.

Higher levels contain more summarized data obtained throughgeneralizations of the lower levels.


23/29


Personalization

Web access or contents tuned to better fit the

desires of each user.

Manual techniques identify users preferences

based on profiles or demographics. Collaborative filtering identifies preferences based

on ratings from similar users.

Content based filtering retrieves pages based on

similarity between pages and user profiles.


24/29


Web Structure Mining

Mine structure (links, graph) of the Web

Techniques

PageRank

CLEVER

Create a model of the Web organization.

May be combined with content mining tomore effectively retrieve important pages.


25/29


PageRank

Used by Google Prioritize pages returned from search by

looking at Web structure.

Importance of page is calculated based onnumber of pages which point to itBacklinks.

Weighting is used to provide more

importance to backlinks coming formimportant pages.


26/29


PageRank (contd)

PR(p) = c (PR(1)/N1+ + PR(n)/Nn)

PR(i): PageRank for a page i which points to target

page p.

Ni: number of links coming out of page i


27/29


CLEVER

Identify authoritative and hub pages.

Authoritative Pages :

Highly important pages.

Best source for requested information.

Hub Pages :

Contain links to highly important pages.


28/29


HITS

Hyperlink-Induces Topic Search

Based on a set of keywords, find set of

relevant pages R.

Identify hub and authority pages for these.

Expand R to a base set, B, of pages linked to or

from R.

Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.


29/29


HITS Algorithm

Documents

Web Miningppt