46
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring 2014 - CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference - 2000

Mining High-Speed Data Streams

  • Upload
    silver

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Mining High-Speed Data Streams. Pedro Domingos Geoff Hulten. Sixth ACM SIGKDD International Conference - 2000. Presented by: Tyler J. Sawyer UVM Spring 2014 - CS 332 Data Mining. 2. Outline. → Introduction → Hoeffding Trees → The VFDT System → Performance Study → Conclusion / Summary - PowerPoint PPT Presentation

Citation preview

Page 1: Mining High-Speed Data Streams

Mining High-Speed Data Streams

Presented by: Tyler J. SawyerUVM Spring 2014 - CS 332 Data Mining

Pedro DomingosGeoff Hulten

Sixth ACM SIGKDD International Conference - 2000

Page 2: Mining High-Speed Data Streams

Outline→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

2

Page 3: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

3

Outline

Page 4: Mining High-Speed Data Streams

Introduction• In today’s society, the ability to extract and interpret knowledge and

data quickly and efficiently is an increasingly important task.

• Many organizations today have expandable databases that grow at a rate of several million records per day.

• Mining these databases yield the following:o Unique opportunities for data analysiso Complex challenges to overcome

4

Page 5: Mining High-Speed Data Streams

Introduction - Cont.• Knowledge Discovery Systems are limited by the following:

o Timeo Memoryo Sample Size

• Traditional Systems:o Amount of available data is smallo Systems use a fraction of their computation power to avoid overfitting

• Current Systems:o Bottleneck is time and memoryo Majority of sample data is unused; underfitting issues surface.

5

Page 6: Mining High-Speed Data Streams

Introduction - Cont.

• Today’s Algorithms:o Efficient, but cannot handle supermassive

databases.o Current Data Mining systems are not equipped to

handle the exponential increase of data expansiono New examples arrive at a higher rate than they can

be minedo → Data Corruption!

6

Page 7: Mining High-Speed Data Streams

Introduction - Cont.• Requirements for ‘Modern’ Algorithms:

o Operate continuously and indefinitelyo Incorporate new examples as they become availableo Never lose potentially valuable informationo Build a model using at most one scan of a database or dataseto Use only a fixed amount of main memory.o Require small, constant time per record.o Make a usable model that can be available at any point during

the algorithm’s runtime.

7

Page 8: Mining High-Speed Data Streams

Introduction - Cont.

• What can fulfill these requirements?o Incremental Learning Methods

Online Methods Successive Methods Sequential Methods

• While these methods are efficient, they are not always accurate.

• These methods rarely recover from a set of unfavorable early examples

8

Page 9: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

9

Outline

Page 10: Mining High-Speed Data Streams

Hoeffding Trees

• Classic Decision Tree Learnerso Examples: ID3, C4.5, CARTo Assumes examples can be stored simultaneously in main

memory; loss of learnable examples.

• Disk-based Decision Tree Learnerso Examples: SLIQ, SPRINTo Assumes examples are stored on disk.o Big Datasets easily fill disk and errors occur when the dataset

is too large to fit.

10

Page 11: Mining High-Speed Data Streams

Hoeffding Trees - Cont.

• A typical type of Classification Problemo Given : N training examples in the form (x,y)o y = discrete class labelo x = vector of d attributeso Goal: Produce a model, y = f(x), to predict classes y

of future examples x with high accuracy.

11

Page 12: Mining High-Speed Data Streams

Hoeffding Trees - Cont.• Challenge : Design a decision tree learner for

extremely large (potentially infinite) datasets with high accuracy and low computational cost.

• Given a stream of examples:o The first ones will be used to choose the root testo Succeeding ones will pass to corresponding leaveso Pick the best attributes at each leafo Continue process recursively

12

Page 13: Mining High-Speed Data Streams

Hoeffding Trees - Cont.

• But how do we decide how many examples are necessary at each node?o Use a statistical result!o the Hoeffding bound (Chernoff bound)

13

Page 14: Mining High-Speed Data Streams

Hoeffding Trees - Cont.• Hoeffding Bound :

o G: heuristic measure used to choose test attributes C4.5 information gain⇒ CART Gini index⇒ Assume G(.) is to be maximized

o G: heuristic measure after seeing n exampleso Xa: attribute with the highest observed G

o Xb: second-best attribute

o △G: difference between Xa and Xb

o △G = G(Xa) - G(Xb) > 0o δ: probability of choosing the wrong attribute

14

Page 15: Mining High-Speed Data Streams

Hoeffding Trees - Cont.• The Hoeffding Bound:

o after n examples, If △G > ϵ Xa is the best attribute with probability 1 - δ

• Node needs to accumulate examples from the stream until ϵ becomes smaller than △G.o R = range of a real numbered random variables, ro n = independent observations of this variable.

15

Page 16: Mining High-Speed Data Streams

Hoeffding Tree Algorithm

• Inputs:o S : sequence of exampleso X : set of discrete attributeso G(.) : split evaluation functiono δ : desired probability of choosing the wrong attribute at any

given node

• Output:o HT : A decision tree (Hoeffding Tree)

16

Page 17: Mining High-Speed Data Streams

Hoeffding Tree Algorithm - Cont.17

Page 18: Mining High-Speed Data Streams

Hoeffding Tree Algorithm - Cont.18

Page 19: Mining High-Speed Data Streams

Hoeffding Tree Algorithm - Cont.19

Page 20: Mining High-Speed Data Streams

Hoeffding Trees - Cont.• Hoeffding Tree Algorithm guarantees under realistic assumptions the trees

generated will be similar to batch learners.o p1 : Leaf Probability (assume this is a constant)

o HTδ : Tree produced by HT algorithm with desired δ given an infinite sequence of examples, S.

o DT* : Decision tree produced by choosing at each node the attribute with the best G.

o △i : Intentional disagreement between two decision trees. P(x) : Probability that the attribute vector x will be observed. l(x) : indicator function (1 : True, 0 : False) ⇒ △i (DT1, DT2) = Σx P(x) l [Path1(x) ≠ Path2(x)]

• Theorem 1:E[ i (HT△ δ,DT*)] < δ / p

20

Page 21: Mining High-Speed Data Streams

Hoeffding Trees - Cont.• Suppose Xa and Xb differ by roughly 10%.

• According to o δ = 0.1% requires only 380 exampleso δ = 0.0001% requires only 345 more examples.

• An exponential improvement in δ can be obtained with a linear increase in the number of examples.

21

Page 22: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

22

Outline

Page 23: Mining High-Speed Data Streams

The VFDT System

• Very Fast Decision Tree learner (VFDT)• A decision tree learning system• Based on the Hoeffding Tree algorithm• VFDT allows the use of either information gain or the

Gini index as the attribute evaluation measure.

23

Page 24: Mining High-Speed Data Streams

The VFDT System - Cont.

• Includes a number of refinements to the Hoeffding Tree algorithm:o Tieso G-Computationo Memoryo Poor Attributeso Initializationo Rescans

24

Page 25: Mining High-Speed Data Streams

The VFDT System - Ties• Two or more attributes may have similar G’s

• A large number of examples may be required to decide between them with high confidence.

• In this case, the chosen attribute makes little difference.

• In a VFDT, we specify a user-threshold, τ

• Thus, if G < ϵ < τ : split on current best attribute.△

25

Page 26: Mining High-Speed Data Streams

The VFDT System - G-Computation • The most significant part of the time cost per example is

recomputing G.

• Computing a G value for every new example is inefficient.

• In a VFDT, users can specify an nmin value.

• nmin : Number of new examples that must accumulate at a leaf before recomputing G.

26

Page 27: Mining High-Speed Data Streams

The VFDT System - Memory • a VFDT’s memory use is dominated by the memory

required to keep counts for all growing leaves.

• If the maximum available memory is reached, VFDT deactivates the least promising leaves.

• The least promising leaves are considered to be the ones with the lowest values of plel.

• When a leaf is deactivated, its memory is freed, except for a single number used to store the value of plel.

27

Page 28: Mining High-Speed Data Streams

The VFDT System - Poor Attributes• a VFDT’s memory usage is also minimized by dropping

early on attributes that do not look promising.

• As soon as the difference between an attribute’s G and the best one’s becomes greater than ϵ, then the attribute can be dropped.

• The memory used to store the corresponding counts can also be freed.

28

Page 29: Mining High-Speed Data Streams

The VFDT System - Initialization • VFDT can be initialized with the tree produced by a

conventional RAM-based learner on a small subset of the data.

• The tree can either be input as it is or over-pruned.

• Gives VFDT a “head start”

29

Page 30: Mining High-Speed Data Streams

The VFDT System - Rescans • VFDT can rescan previously-seen examples.

• Rescans are activated if:o The data arrives slowly enough that time allows for rescanso The dataset is finite and small enough that it is feasible

• VFDT will never grow a tree smaller than ones produced by other algorithms.

30

Page 31: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

31

Outline

Page 32: Mining High-Speed Data Streams

Synthetic Data Study• Comparing VFDT with C4.5 Release 8

• Restricted Two Systems to using the same amount of RAM

• VFDT used information gain as the G function.o 14 concepts were used, all with 2 classes and 100 attributes.o For each level after the first 3:

A fraction f of all the nodes were replaced by leaves The rest became splits on a random attribute.

o At depth of 18, all the nodes were replaced with leaves.o Each leaf was randomly assigned a class.

• Stream of training examples were then generatedo Sampling uniformly from the instance space.o Assigning classes according to the target tree.o Various levels of class and attribute noise was added.

32

Page 33: Mining High-Speed Data Streams

Synthetic Data Study - Cont.Accuracy as a function of the number of training examples

δ = 10-7 nmin = 200 τ = 5%

33

Page 34: Mining High-Speed Data Streams

Synthetic Data Study - Cont.Tree Size as a function of the number of training examples

δ = 10-7 nmin = 200 τ = 5%

34

Page 35: Mining High-Speed Data Streams

Synthetic Data Study - Cont.Accuracy as a function of the noise level

C4.5 : 100k examples, VFDT: 20 million examples

35

Page 36: Mining High-Speed Data Streams

Lesion StudyEffect of Initializing VFDT with C4.5 with and without pruning

36

Page 37: Mining High-Speed Data Streams

Web Data - Trial Run• Application of VFDT to mine the stream of Web Page Requests

• Test Location : The Entire University of Washington Campus

• δ = 10-7, nmin = 200, τ = 5%

• Statistics for mining 1.6 million examples:o VFDT took 1450 seconds to do one pass over the training datao 983 seconds were spent reading data from the disko C4.5 took 24 hours to mine 1.6 million examples.

37

Page 38: Mining High-Speed Data Streams

Web Data - Trial Run ResultsVFDT Performance on Web Data

38

Page 39: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

39

Outline

Page 40: Mining High-Speed Data Streams

Conclusion - Hoeffding Trees• A method for learning online

• Learns from the increasingly common high-volume data streams

• Allows learning in very small constant time per example

• Strong guarantees of high asymptotic similarities to corresponding batch trees.

40

Page 41: Mining High-Speed Data Streams

Conclusion - VFDT Systems• A high-performance data mining system

• Based on Hoeffding trees

• Empirical studies show its effectiveness in taking advantage of massive numbers of examples

• Practical, efficient, and accurate.

41

Page 42: Mining High-Speed Data Streams

→ Introduction→ Hoeffding Trees→ The VFDT System→ Performance Study→ Conclusion / Summary→ Review Questions

42

Outline

Page 43: Mining High-Speed Data Streams

Review Questions - 1 of 3• Question: Name four challenges that modern

algorithms have to overcome today.o Answer: See Slide 7.o Operate continuously and indefinitelyo Incorporate new examples as they become availableo Never lose potentially valuable informationo Build a model using at most one scan of a database or dataseto Use only a fixed amount of main memory.o Require small, constant time per record.o Make a usable model that can be available at any point during the

algorithm’s runtime.

43

Page 44: Mining High-Speed Data Streams

Review Questions - 2 of 3• Question: List the input requirements of the HT-

Algorithm, and state what output is generated.o Answer: See Slide 16o Inputs:

S : sequence of examples X : set of discrete attributes G(.) : split evaluation function δ : desired probability of choosing the wrong attribute at any given

nodeo Output:

HT : A decision tree (Hoeffding Tree)

44

Page 45: Mining High-Speed Data Streams

Review Questions - 3 of 3• Question: How is memory management handled

differently in a VFDT than a Hoeffding Tree?o Answer: See Slide 27 (& 28).o VFDT’s memory use is dominated by the memory required to keep

counts for all growing leaves.o If the maximum available memory is reached, VFDT deactivates the

least promising leaves.o The least promising leaves are considered to be the ones with the

lowest values of plel.o When a leaf is deactivated, its memory is freed, except for a single

number used to store the value of plel.o Might also state early-on attributes are dropped for memory efficiency

45

Page 46: Mining High-Speed Data Streams

Any Questions?46

?