127
Practical Machine Learning in Spark Chih-Chieh Hung Tamkang University

AI與大數據數據處理 Spark實戰(20171216)

Embed Size (px)

Citation preview

Page 1: AI與大數據數據處理 Spark實戰(20171216)

Practical Machine Learning in Spark

Chih-Chieh Hung

Tamkang University

Page 2: AI與大數據數據處理 Spark實戰(20171216)

Chih-Chieh Hung 洪智傑

• Tamkang University (Assistant Professor)

2016-

• Rakuten Inc., Japan (Data Scientist)

2013-2015

• Yahoo! Inc., Taiwan (Research Engineer)

2011-2013

• Microsoft Research Asia, China (Research Intern)

2010

Page 3: AI與大數據數據處理 Spark實戰(20171216)

Something About Big Data

Page 4: AI與大數據數據處理 Spark實戰(20171216)
Page 5: AI與大數據數據處理 Spark實戰(20171216)
Page 6: AI與大數據數據處理 Spark實戰(20171216)

Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity, and complexity require new

architecture, techniques, algorithms, and analytics to manage it and

extract value and hidden knowledge from it…

6

Page 7: AI與大數據數據處理 Spark實戰(20171216)

Scale (Volume)

• Data Volume• 44x increase from 2009 to 2020

• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Page 8: AI與大數據數據處理 Spark實戰(20171216)

Complexity (Varity)

• Various formats, types, and structures

• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

8

To extract knowledge all these types of data need to linked together

Page 9: AI與大數據數據處理 Spark實戰(20171216)

Speed (Velocity)

• Data is begin generated fast and need to be processed fast

• Online Data Analytics

• Late decisions missing opportunities

Page 10: AI與大數據數據處理 Spark實戰(20171216)

Four V Challenges in Big Data

*. http://www-05.ibm.com/fr/events/netezzaDM_2012/Solutions_Big_Data.pdf

Page 11: AI與大數據數據處理 Spark實戰(20171216)

Apache Hadoop Stack

Page 12: AI與大數據數據處理 Spark實戰(20171216)

Apache Hadoop

• The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

• Three major modules:• Hadoop Distributed File System (HDFS™): A

distributed file system that provides high-throughput access to application data.

• Hadoop YARN: A framework for job scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Page 13: AI與大數據數據處理 Spark實戰(20171216)

Hadoop Components: HDFS• File system

• Sit on top of a native file system• Based on Google’s GFS• Provide redundant storage

• Read/Write• Good at large, sequential reads• Files are “Write once”

• Components• DataNodes: metadata of files• NameNodes: actual blocks• Secondary NameNode: merges the fsimage

and the edits log files periodically and keeps edits log size within a limit

Page 14: AI與大數據數據處理 Spark實戰(20171216)

Hadoop Components: YARN

• Manage resource (Data operating system).

• YARN = Yet Another Resource Negotiator

• Manage and monitor workloads

• Maintain a multi-tenant platform.

• Implement security control.

• Support multiple processing models in addition to MapReduce.

Page 15: AI與大數據數據處理 Spark實戰(20171216)

Hadoop Components: MapReduce

• Process data in cluster.

• Two phases: Map + Reduce• Between the two is the “shuffle-and-sort” stage

• Map• Operates on a discrete portion of the overall dataset

• Reduce• After all maps are complete, the intermediate data are separated to nodes

which perform the Reduce phase.

Page 16: AI與大數據數據處理 Spark實戰(20171216)

The MapReduce Framework

Page 17: AI與大數據數據處理 Spark實戰(20171216)

MapReduce Algorithm For Word Count

• Input and Output

Page 18: AI與大數據數據處理 Spark實戰(20171216)

Step 1: Design Mapper (Must Implement)

• Write the mapper: output the key-value pair <word, 1>

Page 19: AI與大數據數據處理 Spark實戰(20171216)

Step 2: Sort and Shuffle (Don’t Need to Do)

• The values with the same key will send to the same reducer.

Page 20: AI與大數據數據處理 Spark實戰(20171216)

Step 3: Design Reducer (Must Implement)

• Write reducer as: (word, sum of all the values)

Page 21: AI與大數據數據處理 Spark實戰(20171216)

Spark

Page 22: AI與大數據數據處理 Spark實戰(20171216)

What is Spark?

Efficient• General execution graphs

• In-memory storage

Usable• Rich APIs in Java, Scala, Python

• Interactive shell

• Fast and Expressive Cluster Computing System Compatible with Apache Hadoop

Page 23: AI與大數據數據處理 Spark實戰(20171216)

Key Concepts

Resilient Distributed Datasets• Collections of objects spread across a

cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations• Transformations

(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

• Write programs in terms of transformations on distributed datasets

Page 24: AI與大數據數據處理 Spark實戰(20171216)

Language Support

Standalone Programs

•Python, Scala, & Java

Interactive Shells

• Python & Scala

Performance

• Java & Scala are faster due to static typing

• …but Python is often fine

Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count()

JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);

}}).count();

Page 25: AI與大數據數據處理 Spark實戰(20171216)

Spark Ecosystem

Page 26: AI與大數據數據處理 Spark實戰(20171216)

import sysfrom pyspark import SparkContext

if __name__ == "__main__":sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)lines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda s: s.split(“ ”)) \.map(lambda word: (word, 1)) \.reduceByKey(lambda x, y: x + y)

counts.saveAsTextFile(sys.argv[2])

An Simple Example of Spark App

sc

RDD ops

Page 27: AI與大數據數據處理 Spark實戰(20171216)

SparkContext

• Main entry point

• SparkContext is the object that manages the connection to the clusters in Spark and coordinates running processes on the clusters themselves. SparkContextconnects to cluster managers, which manage the actual executors that run the specific computations

Page 28: AI與大數據數據處理 Spark實戰(20171216)

SparkContext

• Main entry point to Spark functionality

• Available in shell as variable sc• In standalone programs, you’d make your own (see later for details)

Page 29: AI與大數據數據處理 Spark實戰(20171216)

Create SparkContext: Local Mode

• Very simple

Page 30: AI與大數據數據處理 Spark實戰(20171216)

Create SparkContext: Cluster Mode

• Need to write SparkConf about the clusters

Page 31: AI與大數據數據處理 Spark實戰(20171216)

Resilient Distributed Datasets (RDD)

• An RDD is Spark's representation of a dataset that is distributed across the RAM, or memory, of lots of machines.

• An RDD object is essentially a collection of elements that you can use to hold lists of tuples, dictionaries, lists, etc.

• Lazy Evaluation : the ability to lazily evaluate code, postponing running a calculation until absolutely necessary.

Page 32: AI與大數據數據處理 Spark實戰(20171216)

Working with RDDs

Page 33: AI與大數據數據處理 Spark實戰(20171216)

Transformation and Actions in Spark

• RDDs have actions, which return values, and transformations, which return pointers to new RDDs.

• RDDs’ value is only updated once that RDD is computed as part of an action

Page 34: AI與大數據數據處理 Spark實戰(20171216)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

messages.filter(lambda s: “php” in s).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Full-text search of Wikipedia• 60GB on 20 EC2 machine• 0.5 sec vs. 20s for on-disk

Page 35: AI與大數據數據處理 Spark實戰(20171216)

Creating RDDs

# Turn a Python collection into an RDD>sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3>sc.textFile(“file.txt”)>sc.textFile(“directory/*.txt”)>sc.textFile(“hdfs://namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)>sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Page 36: AI與大數據數據處理 Spark實戰(20171216)

Most Widely-Used Action and Transformation

Page 37: AI與大數據數據處理 Spark實戰(20171216)

Transformation

Page 38: AI與大數據數據處理 Spark實戰(20171216)

Basic Transformations

>nums = sc.parallelize([1, 2, 3])

# Pass each element through a function>squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate>even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others>nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of numbers 0, 1, …, x-1)

Page 39: AI與大數據數據處理 Spark實戰(20171216)

map() and flatMap()

• map()

map() transformation applies changes on each line of the RDD and returns the transformed RDD as iterable of iterables i.e. each line is equivalent to a iterable and the entire RDD is itself a list

Page 40: AI與大數據數據處理 Spark實戰(20171216)

map() and flatMap()

• flatMap()

This transformation apply changes to each line same as map but the return is not a iterable of iterables but it is only an iterable holding entire RDD contents.

Page 41: AI與大數據數據處理 Spark實戰(20171216)

map() and flatMap() examples>lines.take(2)

[‘#good d#ay #’,

‘#good #weather’]>words = lines.map(lambda lines: lines.split(' '))

[[‘#good’, ‘d#ay’, ’#’],

[‘#good’, ‘#weather’]]>words = lines. flatMap(lambda lines: lines.split(' '))

[‘#good’, ‘d#ay’, ‘#’, ‘#good’, ‘#weather’]

Page 42: AI與大數據數據處理 Spark實戰(20171216)

Filter()

• Filter() transformation is used to reduce the old RDD based on some condition.

Page 43: AI與大數據數據處理 Spark實戰(20171216)

Filter() example

• How to filter out hashtags from words>hashtags = words.filter(lambda word: word.startswith("#")).filter(lambda word: word != "#")

[‘#good’, ‘#good’, ‘#weather’]

Page 44: AI與大數據數據處理 Spark實戰(20171216)

Join()

• Return a RDD containing all pairs of elements having the same key in the original RDDs

Page 45: AI與大數據數據處理 Spark實戰(20171216)

Join() Example

Page 46: AI與大數據數據處理 Spark實戰(20171216)

KeyBy()

• Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-defined function.

Page 47: AI與大數據數據處理 Spark實戰(20171216)

KeyBy() examples

Page 48: AI與大數據數據處理 Spark實戰(20171216)

GroupBy()

• Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.

Page 49: AI與大數據數據處理 Spark實戰(20171216)

GroupBy() example

Page 50: AI與大數據數據處理 Spark實戰(20171216)

GroupByKey()

• Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.

Page 51: AI與大數據數據處理 Spark實戰(20171216)

GroupByKey() example

Page 52: AI與大數據數據處理 Spark實戰(20171216)

ReduceByKey()

• reduceByKey(f) combines tuples with the same key using the function we specify f.

>hashtagsNum = hashtags.map(lambda word: (word, 1))

[(‘#good’,1), (‘#good’, 1), (‘#weather’, 1)]

>hashtagsCount = hashtagsNum.reduceByKey(lambda a,b: a+b)

[(‘#good’,2), (‘#weather’, 1)]

Page 53: AI與大數據數據處理 Spark實戰(20171216)

The Difference between GroupByKey() and ReduceByKey()

Page 54: AI與大數據數據處理 Spark實戰(20171216)

Example: Word Count

> lines = sc.textFile(“hamlet.txt”)

> counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda word => (word, 1)).reduceByKey(lambda x, y: x + y)

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

Page 55: AI與大數據數據處理 Spark實戰(20171216)

Actions

Page 56: AI與大數據數據處理 Spark實戰(20171216)

Basic Actions

>nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection>nums.collect() # => [1, 2, 3]

# Return first K elements>nums.take(2) # => [1, 2]

# Count number of elements>nums.count() # => 3

# Merge elements with an associative function>nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file>nums.saveAsTextFile(“hdfs://file.txt”)

Page 57: AI與大數據數據處理 Spark實戰(20171216)

Collect()

• Return all elements in the RDD to the driver in a single list

• Do not do that if you work on a big RDD.

Page 58: AI與大數據數據處理 Spark實戰(20171216)

Reduce()

• Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver.

Page 59: AI與大數據數據處理 Spark實戰(20171216)

Aggregate()

• Aggregate all elements of the RDD by:• Applying a user function seqOp to combine elements with user-supplied

objects

• Then combining those user-defined results via a second user function combOp

• And finally returning a result to the driver

Page 60: AI與大數據數據處理 Spark實戰(20171216)

Aggregate(): Using the seqOp in each partition

Page 61: AI與大數據數據處理 Spark實戰(20171216)

Aggregate(): Using combOp among Partitions

Page 62: AI與大數據數據處理 Spark實戰(20171216)

Aggregate() example

Page 63: AI與大數據數據處理 Spark實戰(20171216)

More RDD Operators

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• groupByKey

• cogroup

• cross

• zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Page 64: AI與大數據數據處理 Spark實戰(20171216)

Lab 1

Page 65: AI與大數據數據處理 Spark實戰(20171216)

Example: PageRank

• Good example of a more complex algorithm• Multiple stages of map & reduce

• Benefits from Spark’s in-memory caching• Multiple iterations over the same data

Page 66: AI與大數據數據處理 Spark實戰(20171216)

Basic Idea

Give pages ranks (scores) based on links to them

• Links from many pages high rank

• Link from a high-rank page high rank

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Page 67: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1.0 1.0

1.0

1.0

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

Page 68: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

1.0 1.0

1.0

1.0

1

0.5

0.5

0.5

1

0.5

Page 69: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

0.58 1.0

1.85

0.58

Page 70: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

0.58

0.29

0.29

0.5

1.850.58 1.0

1.85

0.58

0.5

Page 71: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

0.39 1.72

1.31

0.58

. . .

Page 72: AI與大數據數據處理 Spark實戰(20171216)

Algorithm

1. Start each page at a rank of 1

2. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs

0.46 1.37

1.44

0.73

Final state:

Page 73: AI與大數據數據處理 Spark實戰(20171216)

Lab 2

Page 74: AI與大數據數據處理 Spark實戰(20171216)

Machine Learning in 30 min

Page 75: AI與大數據數據處理 Spark實戰(20171216)

Machine Learning is…

• Machine learning is about predicting the future based on the past.

-- Hal Daume III

TrainingData

model/predictor

past

model/predictor

future

TestingData

Page 76: AI與大數據數據處理 Spark實戰(20171216)

Machine Learning Types

Page 77: AI與大數據數據處理 Spark實戰(20171216)

Supervised vs. Unsupervised Learning

Page 78: AI與大數據數據處理 Spark實戰(20171216)

Reinforcement Learning

Page 79: AI與大數據數據處理 Spark實戰(20171216)

General Flow for Machine Learning

Page 80: AI與大數據數據處理 Spark實戰(20171216)

Training Data, Testing Data, Validation Data

• Training data: used to train a model (we have)

• Testing data: test the performance of a model (we don’t have)

• Validation data: “artificial” testing data (we have)

Page 81: AI與大數據數據處理 Spark實戰(20171216)

Model Evaluation: What Are We Seeking?

• Minimize the error between training data and the model

Page 82: AI與大數據數據處理 Spark實戰(20171216)

Example: The Error of The Model

Page 83: AI與大數據數據處理 Spark實戰(20171216)

General Flow of Training and Testing

Page 84: AI與大數據數據處理 Spark實戰(20171216)

Classification Concept

Page 85: AI與大數據數據處理 Spark實戰(20171216)

Supervised Learning in A Nutshell

• Try to think how you learn when you were a baby. Mom taught you…

Page 86: AI與大數據數據處理 Spark實戰(20171216)

Supervised Learning in A Nutshell

• What is it?

Page 87: AI與大數據數據處理 Spark實戰(20171216)

Supervised Learning in A Nutshell

• Training data • Testing data

Label

FeaturesFeatures

Rabbit!

Label (We Guessed)

Page 88: AI與大數據數據處理 Spark實戰(20171216)

Handwritten Recognition

• Input: 1. hand-written words and labels, 2. a hand-written word W

• Output: the label of W

?

Page 89: AI與大數據數據處理 Spark實戰(20171216)

General Classification Flow

Page 90: AI與大數據數據處理 Spark實戰(20171216)

Before Hands-on

Page 91: AI與大數據數據處理 Spark實戰(20171216)

What is MLlib

• MLlib is an Apache Spark component focusing on machine learning:• MLlib is Spark’s core ML library

• Developed by MLbase team in AMPLab

• 80+ contributions from various organization

• Support Scala, Python, and Java APIs

Page 92: AI與大數據數據處理 Spark實戰(20171216)

Spark Ecosystem

Page 93: AI與大數據數據處理 Spark實戰(20171216)

Algorithms in MLlib

• Statistics: Description, correlation

• Clustering: k-means

• Classification: SVMs, naive Bayes, decision tree, logistic regression

• Regression: linear regression (+lasso, +ridge)

• Dimensionality: SVD, PCA

• Optimization Primitives: SGD, Parallel Gradient

• Collaborative filtering: ALS

Page 94: AI與大數據數據處理 Spark實戰(20171216)

Why Mllib

• Scalability

• Performance

• user-friendly documentation and APIs

• Cost of maintenance

Page 95: AI與大數據數據處理 Spark實戰(20171216)

Performance

Page 96: AI與大數據數據處理 Spark實戰(20171216)

Data Type

• Dense vector

• Sparse vector

• Labeled point

Page 97: AI與大數據數據處理 Spark實戰(20171216)

Dense & Sparse

• Raw Data:

ID A B C D E F

1 1 0 0 0 0 3

2 0 1 0 1 0 2

3 1 1 1 0 1 1

Page 98: AI與大數據數據處理 Spark實戰(20171216)

Dense vs Sparse

• A case study- number of example: 12 million- number of features: 500- sparsity: 10%

• Not only save storage, but also received a 4x speed up

Dense Sparse

Storge 47GB 7GB

Time 240s 58s

Page 99: AI與大數據數據處理 Spark實戰(20171216)

Labeled Point

• Dummy variable (1,0)

• Categorical variable (0, 1, 2, …)

from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

Page 100: AI與大數據數據處理 Spark實戰(20171216)

Descriptive Statistics

• Supported function:- count- max- min- mean- variance…

• Supported data types- Dense- Sparse- Labeled Point

Page 101: AI與大數據數據處理 Spark實戰(20171216)

Example

from pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np

## example data(2 x 2 matrix at least)data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])

## to RDDdistData = sc.parallelize(data)

## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

Page 102: AI與大數據數據處理 Spark實戰(20171216)

Classification Algorithms

Page 103: AI與大數據數據處理 Spark實戰(20171216)

1. Naïve Bayesian Classification

• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

• MAP (maximum posteriori) hypothesis

)()()|(

)|(DP

hPhDPDhP

.)()|(maxarg)|(maxarg hPhDPHh

DhPHhMAP

h

Page 104: AI與大數據數據處理 Spark實戰(20171216)

Play-Tennis Example

• Given a training set and an unseen sample X = <rain, hot, high, false>, what class will X be?Outlook Temperature Humidity Windy Class

sunny hot high false N

sunny hot high true N

overcast hot high false P

rain mild high false P

rain cool normal false P

rain cool normal true N

overcast cool normal true P

sunny mild high false N

sunny cool normal false P

rain mild normal false P

sunny mild normal true P

overcast mild high true P

overcast hot normal false P

rain mild high true N

Page 105: AI與大數據數據處理 Spark實戰(20171216)

Training Step: Compute Probabilities

• We can compute:

Outlook Temperature Humidity Windy Class

sunny hot high false N

sunny hot high true N

overcast hot high false P

rain mild high false P

rain cool normal false P

rain cool normal true N

overcast cool normal true P

sunny mild high false N

sunny cool normal false P

rain mild normal false P

sunny mild normal true P

overcast mild high true P

overcast hot normal false P

rain mild high true N P(true|n) = 3/5P(true|p) = 3/9P(false|n) = 2/5P(false|p) = 6/9

P(high|n) = 4/5P(high|p) = 3/9P(normal|n) = 2/5P(normal|p) = 6/9

P(hot|n) = 2/5P(hot|p) = 2/9P(mild|n) = 2/5P(mild|p) = 4/9P(cool|n) = 1/5P(cool|p) = 3/9

P(rain|n) = 2/5P(rain|p) = 3/9P(overcast|n) = 0P(overcast|p) = 4/9P(sunny|n) = 3/5P(sunny|p) = 2/9

windy

humidity

temperature

outlook

P(n) = 5/14

P(p) = 9/14

Page 106: AI與大數據數據處理 Spark實戰(20171216)

Prediction Step

• An unseen sample X = <rain, hot, high, false>

1. P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

2. P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (don’t play)

Page 107: AI與大數據數據處理 Spark實戰(20171216)

Try It on Spark

• Download Experimental Data:

https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_naive_bayes_data.txt

• Download the Example Code of Naïve Bayes Classification:

https://raw.githubusercontent.com/apache/spark/master/examples/src/main/python/mllib/naive_bayes_example.py

Page 108: AI與大數據數據處理 Spark實戰(20171216)

Experimental Data

0,1 0 0

0,2 0 0

0,3 0 0

0,4 0 0

1,0 1 0

1,0 2 0

1,0 3 0

1,0 4 0

2,0 0 1

2,0 0 2

2,0 0 3

2,0 0 4

Feature Vector: (0,2,0)Class Label: 1

Page 109: AI與大數據數據處理 Spark實戰(20171216)

Naïve Bayes in Spark

• Step 1: Prepare data

• Step 2: NaiveBayes.train()

• Step 3: NaiveBayes.predict()

• Step 4: Evaluation1

2

3

4

*. Full Version: https://spark.apache.org/docs/latest/mllib-naive-bayes.html

Page 110: AI與大數據數據處理 Spark實戰(20171216)

2. Decision Tree

• Decision tree • A flow-chart-like tree structure• Internal node denotes a test on an attribute• Branch represents an outcome of the test• Leaf nodes represent class labels or class distribution

• Decision tree generation consists of two phases• Tree construction

• At start, all the training examples are at the root• Partition examples recursively based on selected attributes

• Tree pruning• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample• Test the attribute values of the sample against the decision tree

Page 111: AI與大數據數據處理 Spark實戰(20171216)

Example: Predict the Buys_Computer

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

Page 112: AI與大數據數據處理 Spark實戰(20171216)

Decision Tree

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 113: AI與大數據數據處理 Spark實戰(20171216)

Build A Decision Tree

• Step 1: All data in Root

• Step 2: Split the node which can lead to more pure sub-nodes

• Step 3: Repeat until terminal conditions meet

Page 114: AI與大數據數據處理 Spark實戰(20171216)

Measures for Purity

• Information Gain, Gini Index,…

• Example

Page 115: AI與大數據數據處理 Spark實戰(20171216)

Terminal Conditions

Page 116: AI與大數據數據處理 Spark實戰(20171216)

Decision Tree in Spark

• Step 1: Prepare data

• Step 2: DT.trainClassifier()

• Step 3: DT.predict()

• Step 4: Evaluation

*. Full Version: https://spark.apache.org/docs/latest/mllib-decision-tree.html

1

2

3

4

Page 117: AI與大數據數據處理 Spark實戰(20171216)

Ensemble Decision-Tree-based Algorithms

• Random Forest

Pick random subsets to build trees

• AdaBoost

Improve trees sequentially

Page 118: AI與大數據數據處理 Spark實戰(20171216)

3. Logistic Regression

• A classification algorithm

Page 119: AI與大數據數據處理 Spark實戰(20171216)

Hypotheses function

• hypotheses:

Page 120: AI與大數據數據處理 Spark實戰(20171216)

When outcome is only 1/0

Page 121: AI與大數據數據處理 Spark實戰(20171216)

Logistic Regression in Spark

• Step 1: Prepare data

• Step 2: LR.train()

• Step 3: LR.predict()

• Step 4: Evaluation

*. Full Version: https://spark.apache.org/docs/latest/mllib-linear-methods.html#classification

1

2

3

4

Page 122: AI與大數據數據處理 Spark實戰(20171216)

4. Support Vector Machine (SVM)

• SVMs maximize the margin around the separating hyperplane.

• The decision function is fully specified by a subset of training samples, the support vectors.

122

Sec. 15.1

Page 123: AI與大數據數據處理 Spark實戰(20171216)

How About Data Are Not Linear Separable?

• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable.

Sec. 15.2.3

KERNEL FUNCTION

Page 124: AI與大數據數據處理 Spark實戰(20171216)

Kernels

• Why use kernels?• Make non-separable problem separable.

• Map data into better representational space

• Common kernels• Linear

• Polynomial K(x,z) = (1+xTz)d

• Radial basis function (RBF)

124

Sec. 15.2.3

RBF

Page 125: AI與大數據數據處理 Spark實戰(20171216)

SVM with Different Kernels

Page 126: AI與大數據數據處理 Spark實戰(20171216)

SVM in Spark

• Step 1: Prepare data

• Step 2: SVM.train()

• Step 3: SVM.predict()

• Step 4: Evaluation1

2

3

4

*. Full Version: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms

Page 127: AI與大數據數據處理 Spark實戰(20171216)

Lab 3