Web Miningppt

Embed Size (px)

Citation preview

  • 8/3/2019 Web Miningppt

    1/29

    1

    Web Mining Taxonomy

  • 8/3/2019 Web Miningppt

    2/29

    Why Web Usage Mining?

    Explosive growth of E-commerce

    Provides an cost-efficient way doing business

    Amazon.com: online Wal-Mart

    Hidden Useful information

    Visitors profiles can be discovered

    Measuring online marketing efforts, launching

    marketing campaigns, etc.

  • 8/3/2019 Web Miningppt

    3/29

    How to perform Web Usage Mining

    Obtain web traffic data from

    Web server log files

    Corporate relational databases

    Registration forms

    Apply data mining techniques and other Webmining techniques

    Two categories:

    Pattern Discovery Tools

    Pattern Analysis Tools

  • 8/3/2019 Web Miningppt

    4/29

    Pattern Analysis Tools

    Answer Questions like:

    How are people using this site?

    which Pages are being accessed most

    frequently?

    This requires the analysis of the structure of

    hyperlinks and the contents of the pages

  • 8/3/2019 Web Miningppt

    5/29

    Pattern Analysis Tools

    O/P of Analysis

    The frequency of visits

    per document

    Most recent visit per

    document

    Frequency of use of

    each hyperlink Most recent use of each

    hyperlink

    Techniques:

    Visualization techniques

    OLAP techniques

    Data & Knowledge

    Querying Usability analysis

  • 8/3/2019 Web Miningppt

    6/29

    Pattern Discovery Tools

    Data Pre-processing

    Filtering/clean Web log files

    eliminate outliers and irrelevant items

    Integration of Web Usage data from:

    Web Server Logs

    Referral logs

    Registration file Corporate Database

  • 8/3/2019 Web Miningppt

    7/29

    Pattern Discovery Techniques

    Converting IP addresses to Domain Names

    Domain Name System does the conversion

    Discover information from visitors domain names:

    Ex: .ca(Canada), .cn(China), etc

    Converting URLs to Page Titles

    Page Title: between and

  • 8/3/2019 Web Miningppt

    8/29

    Pattern Discovery Techniques

    Path Analysis

    Uses Graph Model

    Provide insights to navigational problems

    Example of info. Discovered by Path analysis:

    78% company-> whats new->sample-> order

    60% left sites after 4 or less page references

    => most important info must be within the first 4

    pages of site entry points.

  • 8/3/2019 Web Miningppt

    9/29

    Filter

    Pattern Discovery Techniques

    Grouping

    Groups similar info. to help draw higher-level

    conclusions

    Ex: all URLs containing the word Yahoo

    Filtering

    Allows to answer specific questions like:

    how many visitors to the site in this week?

  • 8/3/2019 Web Miningppt

    10/29

    Pattern Discovery Techniques

    Dynamic Site Analysis

    Dynamic html links to the database, and

    requires parameters appended to URLs

    http://search.netscape.com/cgi-

    in/search?search=Federal+Tax+Return+Form&c

    p=ntserch

    Knowledge: What the visitors looked for

    What keywords S/B purchased from Search engineer

    http://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserchhttp://search.netscape.com/cgi-in/search?search=Federal+Tax+Return+Form&cp=ntserch
  • 8/3/2019 Web Miningppt

    11/29

    Pattern Discovery Techniques

    Cookies

    Randomly assigned ID by web server to browser

    Cookies are beneficial to both web sitedevelopers and visitors

    Cookie field entry in log file can be used by Web

    traffic analysis software to track repeat visitors loyal customers.

  • 8/3/2019 Web Miningppt

    12/29

    Pattern Discovery Techniques

    Association Rules

    help find spending patterns on related products

    30% who accessed/company/products/bread.html,

    also accessed /company/products/milk.htm.

    Sequential Patterns

    help find inter-transaction patterns

    50% who bought items in /pcworld/computers/, alsobought in /pcworld/accessories/ within 15 days

  • 8/3/2019 Web Miningppt

    13/29

    Pattern Discovery Techniques

    Clustering

    Identifies visitors with common characteristics

    based on visitors profiles

    50% who applied discover platinum card in

    /discovercard/customerService/newcard, were in

    the 25-35 age group, with annual income between

    $40,000 50,000.

  • 8/3/2019 Web Miningppt

    14/29

    Pattern Discovery Techniques

    Decision Trees

    a flow chart of questions leading to a decision

    Ex: car buying decision tree

    What Brand?

    What Year?

    What Type?

    2000 Model Honda

    Accord EX

  • 8/3/2019 Web Miningppt

    15/29

    Week 1: Data Mining II 15

    Web Content Mining

    Extends work of basic search engines

    Search Engines

    IR application

    Keyword based

    Similarity between query and document

    Crawlers

    Indexing Profiles

    Link analysis

  • 8/3/2019 Web Miningppt

    16/29

    Week 1: Data Mining II 16

    Crawlers Robot (spider) traverses the hypertext structure in

    the Web. Collect information from visited pages

    Used to construct indexes for search engines

    Traditional Crawler visits entire Web (?) andreplaces index

    Periodic Crawler visits portions of the Web andupdates subset of index

    Incremental Crawler selectively searches the Weband incrementally modifies index

    Focused Crawler visits pages related to a particularsubject

  • 8/3/2019 Web Miningppt

    17/29

    Week 1: Data Mining II 17

    Focused Crawler

    Only visit links from a page if that page isdetermined to be relevant.

    Classifier is static after learning phase.

    Components: Classifier which assigns relevance score to each

    page based on crawl topic.

    Distiller to identify hub pages. Crawler visits pages to based on crawler and

    distiller scores.

  • 8/3/2019 Web Miningppt

    18/29

    Week 1: Data Mining II 18

    Focused Crawler

    Classifier to related documents to topics

    Classifier also determines how useful outgoing

    links are

    Hub Pages contain links to many relevant

    pages. Must be visited even if not high

    relevance score.

  • 8/3/2019 Web Miningppt

    19/29

    Week 1: Data Mining II 19

    Focused Crawler

  • 8/3/2019 Web Miningppt

    20/29

    Week 1: Data Mining II 20

    Context Focused Crawler

    Context Graph:

    Context graph created for each seed document .

    Root is the seed document.

    Nodes at each level show documents with links todocuments at next higher level.

    Updated during crawl itself .

    Approach:

    1. Construct context graph and classifiers using seeddocuments as training data.

    2. Perform crawling using classifiers and context graphcreated.

  • 8/3/2019 Web Miningppt

    21/29

    Week 1: Data Mining II 21

    Context Graph

  • 8/3/2019 Web Miningppt

    22/29

    Week 1: Data Mining II 22

    Virtual Web View Multiple Layered DataBase (MLDB) built on top of the Web.

    Each layer of the database is more generalized (and smaller)and centralized than the one beneath it.

    Upper layers of MLDB are structured and can be accessed withSQL type queries.

    Translation tools convert Web documents to XML.

    Extraction tools extract desired information to place in firstlayer of MLDB.

    Higher levels contain more summarized data obtained throughgeneralizations of the lower levels.

  • 8/3/2019 Web Miningppt

    23/29

    Week 1: Data Mining II 23

    Personalization

    Web access or contents tuned to better fit the

    desires of each user.

    Manual techniques identify users preferences

    based on profiles or demographics. Collaborative filtering identifies preferences based

    on ratings from similar users.

    Content based filtering retrieves pages based on

    similarity between pages and user profiles.

  • 8/3/2019 Web Miningppt

    24/29

    Week 1: Data Mining II 24

    Web Structure Mining

    Mine structure (links, graph) of the Web

    Techniques

    PageRank

    CLEVER

    Create a model of the Web organization.

    May be combined with content mining tomore effectively retrieve important pages.

  • 8/3/2019 Web Miningppt

    25/29

    Week 1: Data Mining II 25

    PageRank

    Used by Google Prioritize pages returned from search by

    looking at Web structure.

    Importance of page is calculated based onnumber of pages which point to itBacklinks.

    Weighting is used to provide more

    importance to backlinks coming formimportant pages.

  • 8/3/2019 Web Miningppt

    26/29

    Week 1: Data Mining II 26

    PageRank (contd)

    PR(p) = c (PR(1)/N1+ + PR(n)/Nn)

    PR(i): PageRank for a page i which points to target

    page p.

    Ni: number of links coming out of page i

  • 8/3/2019 Web Miningppt

    27/29

    Week 1: Data Mining II 27

    CLEVER

    Identify authoritative and hub pages.

    Authoritative Pages :

    Highly important pages.

    Best source for requested information.

    Hub Pages :

    Contain links to highly important pages.

  • 8/3/2019 Web Miningppt

    28/29

    Week 1: Data Mining II 28

    HITS

    Hyperlink-Induces Topic Search

    Based on a set of keywords, find set of

    relevant pages R.

    Identify hub and authority pages for these.

    Expand R to a base set, B, of pages linked to or

    from R.

    Calculate weights for authorities and hubs.

    Pages with highest ranks in R are returned.

  • 8/3/2019 Web Miningppt

    29/29

    Week 1: Data Mining II 29

    HITS Algorithm