2014.7.10 ([email protected])„-이용한... · 순서 오전 도입 빅데이터 분석 개요 데이터와 데이터베이스 빅데이터와 빅데이터 분석 r 기초 개요,

R ()

2014.7.10

([email protected])

v.3

mailto:[email protected]

R ,

- /

R (1) -

, ,

R (2) Clickstream Profiling

R (3)

&

2

!

!

(Though Experiment)

http://www.openwith.net/?page_id=766

4

http://www.openwith.net/?page_id=766

Prelude:

Berlin on Tolstoy

Analyzed Tolstoy in a popular essay in 1953 based on Archilochus

What was Tolstoy?

Many talents, like a fox

Believed we should be hedgehogs!

Fox compared against Hedgehog

The hedgehog knows one big thing... ,

"The fox knows many little things... ,

.

8

(1)

Plato Aristotle

Michelangelo Da Vinci

Marx, Churchll, Hitler ???

9

(2)

Laozi Sima Qian

Mao Zhou, Deng,

10

The Fox and the hedgehog in a project life

vs.

hedgehog risk

Systems Thinking: A Foxy Approach

OODA: A fox dressed like a hedgehog

11

Devops

:

++ ~

Cross-functional team,

Widely-shared metrics,

Automating repetitive tasks,

Post-mortem, Regular release,

+ +

I.

II.BI (Business Intelligence)

III. Hadoop

IV.Data Science

14

(/, /, /, )

: Excel

DBMS/DW

RDBMS/SQL, NoSQL, ...

ETL, CDC,

15

BI

BI

16

BI BSC Balanced Scorecard. .

VBM Value-based Management. .

ABC Activity Based Costing. .

BI OLAP On-line Analytical Processing.

ERP,CRM ERP, CRM, SCM BI

BI

ETL Extraction-Translation-Loading.

DW Data Warehouse. (repository)

BI Portal .

Hadoop

Tidal Wave 3VC

Supercomputer High-throughput computing

2 :

, (grid computing)

(MPP)

Scale-Up vs. Scale-Out

BI (Business Intelligence) DW/OLAP/

17

Hadoop

Google!

Nutch/Lucene 2006

(Flat linearity)

18

1990 Excite, Alta

Vista, Yahoo,

2000 Google ; PageRank,

GFS/MapReduce

2003~4 Google Papers

2005 Hadoop (D. Cutting & Cafarella)

2006 Apache

Google Papers (2003~2010)

Percolator: Handling individual updates

Dremel: Online visualizations

Google File : a distributed file system

MapReduce : to compute their search indices

Pregel: Scalable graph computing

Dremel: Online visualizations 19

Big Picture

20

Framework

21

Hadoop Ecosystems

22

Major Influencers

Open Source Tipping Point

Google

Before Google vs. After Google

Data Science , ,

OR (Operations Research)

////...

(Statistical Inference), (Parametric),

, , , /Expert System ()

Data Science : HPC + Google shock

, Graph , topology,

AI (ANN, SVM, )

, Semantic web

Fusion of Python or R?, DS + Cloud BDaaS,

24

I.

II.R

III.R

IV.

V.

26

I. R

27

R

+ S . () + packages

/ (DSL :Domain Specific Language) (Windows, Unix, MacOS). .

. (Functional Programming)

, (loop) ,

Script , Interpreter (OOP)

Generic Polymorphic. object

I-1 28

II. R

29

R

R (Workspace)

(Assignment)

Batch

30

R

R CRAN (Comprehensive R Archive Network)

http://www.cran.r-project.org/

31

http://www.cran.r-project.org/http://www.cran.r-project.org/http://www.cran.r-project.org/

RStudio

GUI R

RStudio

R Commander

Rstudio

32

(comment)

#

: help.start() # help(seq) # seq ?seq # seq RSiteSearch("lm") # help mailing lists

History history() # 25

savehistory(file="myfile") # (".Rhistory )

loadhistory(file="myfile") #

33

: rnorm(10)

mean(abs(rnorm(100)); hist(rnorm(10))

R . dataset :

data( ) # Load package .

dataset : help(datasetname)

Session option options() # option

getwd() #

dir.create("d:/Rtraining"); setwd("d:/Rtraining") # \ /

getwd()

34

source( )

session script ( ) source("myfile.R") # script (.R .r)

lm(mpg~wt, data=mtcars)

.

object fit

sink( ) - (redirect) sink("myfile", append=FALSE, split=FALSE) #

sink() #

append option override () or append

split option . # : ( )

sink("c:/projects/output.txt")

# : ( , )

sink("myfile.txt", append=TRUE, split=TRUE)

, pdf(mygraph.pdf)

36

(Package)

Package

= R , .

Packages ( ). install.packages(package)

CRAN Mirror . (e.g. Korea)

session load (session ) library(package)

Library

Package load library() # library packages

search() # load packages

Package help(package=)

37

Customization

R Rprofile.site . MS Windows: C:\Program Files\R\R-n.n.n\etc directory.

Rprofile .

Rprofile.site

>

Rprofile.site 2 .First( ) R session

.Last( ) R session

38

Batch

(non-interactively) MS Windows ( )

C:/Program Files/R/R-3.0.2/R.exe CMD BATCH C:/Rtraining/a.R

Linux R CMD BATCH [options] my_script.R [outfile]

> sqrt(-2)

[1] NaN

:

In sqrt(-2) : NaN

> q()

39

III. R

40

R

(Operators)

(Merge)

apply()

41

(Assignment)

R =, x+y

> print(x+y)

> x=pi

> x

> rm(x)

42

#

age

Import from: csv

mydata

R

Variable

Continuous (nominal, ratio)

Ordinal

Nominal (categorical)

(identifier), (date)

Factor

45

(Mode) (numeric) -

(character) -

(logical) - TRUE, FALSE

FALSE 0, 0 TRUE

(: imaginary number)

Raw (byte)

R (data structure) Vector, matrix

Array, Data frame

List, Class 46

(dataset)

data vector c()

> Rev_2012 = c(110,105,120,140) # :

> Rev_2013 = c(105,115,140,135)

> Revenue = cbind(Rev_2012, Rev_2013) # column

> Revenue

Rev_2012 Rev_2013

[1,] 110 105

[2,] 105 115

[3,] 120 140

[4,] 140 135

>

47

R vectors (numerical, character, logical)

1

R scalar . (= vector)

(string) mode single-element vector

Matrices

2

Arrays

3

data frames

Column mode ( )

Lists

48

is.numeric() is.character()

is.vector() is.matrix()

is.data.frame()

~

(numeric)

as.numeric() FALSE 0 1,2 1,2

(logical) as.logical() 0 FALSE

as.character() 1,2 - 1,2 FALSE FALSE

Factor as.factor() (factor)

Vector as.vector()

Matrix as.matrix() Matrix

as.dataframe() 49

- Vector

mode

a x = c(1,3,5,7)

> x

[1] 1 3 5 7

> family = c("", "","","")

> family

[1] "" "" "" ""

> c(T,T,F,T)

[1] TRUE TRUE FALSE TRUE

50

Vector indexing Vector (elements) ([ ] )

a[c(2,4)] # 2 4 > new_a new_a

[1] 1.0 5.3 6.0 -2.0 4.0

vector : seq() sequence rep() - vector

Vector (Vectorized Operations) = vector element

: Vector In, Vector Out or Vector In, Matrix Out

51

Recycling : 2 vector c(1,2) + c(5,6,7)

Filtering > z w 0]

> w

[1] 5 3

subset()

NA Null NA ; (missing value)

Null : undefined value ( X)

52

: Vector

R matrix

vector %*%, +

cbind() rbind()

library(MASS) ginv()

t() (transpose)

: x y

53

Matrices

= row column vector mode( or )

column

(nrow=, ncol = )

mymatrix

Matrix row column apply()

apply(m, dimcode, f, fargs) m = matrix,

dimcode = 1: row , 2: column ,

f= , fargs = optional argts

> m # row

> apply(m, 1, mean)

[1] 3.5 4.5 5.5 6.5 7.5

> # column

> apply(m, 2, mean)

[1] 3 8

> # 2

> apply(m, 1:2, function(x) x/2)

56

Matrix > x x x

rbind(), cbind() > B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)

> C = matrix(c(7, 4, 2), nrow=3, ncol=1)

> cbind(B, C)

57

Matrix vector Matrix vector + Matrix

> z z

[,1] [,2]

[1,] 1 5

[2,] 2 6

[3,] 3 7

[4,] 4 8

> length(z)

[1] 8

> class(z)

[1] "matrix"

> attributes(z)

$dim

[1] 4 2

58

Array

Matrices 2 . : 4 x 3 x 2 3 1~24

> x x[1,,]

[,1] [,2] [,3]

[1,] 1 13 25

[2,] 5 17 29

[3,] 9 21 33

> x[,,1]

[,1] [,2] [,3]

[1,] 1 5 9

[2,] 2 6 10

[3,] 3 7 11

[4,] 4 8 12

59

List

(ordered collection of objects). (unrelated)

n = c(2, 3, 5)

s = c("aa", "bb", "cc", "dd", "ee")

b = c(TRUE, FALSE, TRUE, FALSE, FALSE)

x = list(n, s, b, 3) # x contains copies of n, s, b

x[2]

x[c(2, 4)]

[[]] . x[[3]]

60

List , > z z

> z$c z

> # .

> z[[4]] lapply(list(2:5,35:39), median) # list

> sapply(list(2:5, 35:39), median) # vector/matrix

61

Data Frame

= list special case

Column (, , factor ) .

d

list . vector

vector .

merge 2 merge() # merge two data frames by ID

total

Factor

(nominal or categorical) [ 1... k ] vector

factor() ordered() option .

x

Factor tapply()

Vector > ages party tapply(ages, party, mean)

57 30 34

split()

split(x,f) x () > g split(1:7, g)

65

(contingency table)

2-way contingency table

: > trial colnames(trial) rownames(trial) trial.table trial.table

sick healthy

risk 34 9

no_risk 11 32

66

Dataset

ls() # objects

names(mydata) # mydata

str(mydata) # mydata

levels(mydata$v1) # mydata v1 factor level

dim(object) # object (dimensions)

class(object) # object (numeric, matrix, data frame, ) class

mydata # mydata

head(mydata, n=10) # mydata 10 row

tail(mydata, n=5) # mydata 5 row

67

(Operators)

Binary vector, matrix scalar .

Arithmetic Operators

+

-

*

/

^ or **

()

x %% y (x mod y) 5%%2 is 1

x %/% y

integer division 5%/%2 is 2

68

< less than

greater than

>= greater than or equal to

== exactly equal to

!= not equal to

!x Not x

x | y x OR y

x & y x AND y

isTRUE(x) test if X is TRUE

69

substr(x, start=n1,

stop=n2)

vector substring

grep(pattern, x ,

ignore.case=FALSE,

fixed=FALSE)

Search for pattern in x. fixed =FALSE pattern . fixed=TRUE pattern index grep("A", c("b","A","c"), fixed=TRUE) 2

sub(pattern,

replacement, x,

ignore.case =FALSE,

fixed=FALSE)

x pattern . fixed=FALSE pattern . fixed = T pattern . sub("\\s",".","Hello There") "Hello.There"

strsplit(x, split) element (Split).

strsplit("abc", "") 3 vector . , "a","b","c"

paste(..., sep="") sep (Concatenate)

toupper(x)

tolower(x)

70

seq(from , to, by) (sequence) indices

expr { } .

if-else if (cond) expr

if (cond) expr1 else expr2

for for (var in seq) expr

while while (cond) expr

switch switch(expr, ...)

ifelse ifelse(test,yes,no)

72

order( )

ASCENDING.

sorting # : mtcars

attach(mtcars)

# sort by mpg

newdata

IV. R

74

R

R

plot()

Plots

(Dot) Plots

(Bar) Plots

(Line Charts)

(Pie Charts)

(Boxplots)

Scatter Plots

75

R

R

demo(graphics); > demo(persp)

plot(c(1,2,3),c(1,2,4))

Nile

mean(Nile)

sd(Nile)

hist(Nile)

76

plot()

plot( ) object (plot)

Generic density, data frame,

: plot(x,y, arguments)

attach(mtcars)

plot(wt, mpg)

abline(lm(mpg~wt))

title("Regression of MPG on Weight")

plot() :

77

Option

type = type=p (point) type=l (line) type=b type=o type=h type=s (step)

xlim =

ylim =

x y . xlim = c(1,10) xlim = range(x)

xlab =

ylba =

x y (label)

main = (main title).

sub = (subtitle).

bg=

bty= 78

pch

lty

Option

pch =

lty = 1: (solid line) 2: (dashed) 3: : (dotted) 4: dot-dash

col= : red,green,blue

mar = c(bottom, left, top, right) . c(5,4,4,2) + 0.1

asp = Apsect ratio (= y/x )

79

: par(mfrow = c(2,2)) # mfrow multiple plot plot(x,y, type="b", main = "cosie ", sub = "type = b") plot(x,y, type="o", las = 1, bty = "u", sub = "type = o")

plot(x,y, type="h", bty = "7", sub = "type = h")

plot(x,y, type="s", bty = "n", sub = "type = s")

80

abline()

abline(a,b) # =a, =b

abline(h=y) #

abline(v=x) # abline(lm.obj) # lm.obj

: data(cars)

attach(cars)

par(mfrow=c(2,2))

plot(speed, dist, pch=1); abline(v=15.4)

plot(speed, dist, pch=2); abline(h=43)

plot(speed, dist, pch=3); abline(-14,3)

plot(speed, dist, pch=8); abline(v=15.4); abline(h=43)

81

plotting

Plotting (Dot Plot) dotchart(x, labels=)

x vector, labels .

groups= option x factor . dotchart(mtcars$mpg, labels = row.names(mtcars), cex=.7,

main=" ", xlab = "Gallon mile ")

82

# Dotplot: , (: mpg, group), (by cylinder)

x

(Bar) Plots

barplot(height)

height vector matrix.

If (height vector)

.

If (height matrix AND option beside=FALSE)

bar height column stacked sub-bars )

If (height matrix AND beside=TRUE)

Column

option names.arg=( ) label

option horiz=TRUE barplot

, Bar plot bar plotting. (mean, median, sd )

aggregate( ) barplot( )

84

# Simple Bar Plot

counts

Stacked Bar Plot counts

Grouped Bar Plot counts

(Line Charts)

(Line Charts) lines(x, y, type=)

x y vector

type=

Type Description

p

l

o overplotted points lines

b, c (join) points ("c )

s, S stair steps

h histogram-like vertical lines

n

88

lines( ) plot(x, y) .

: plot( ) plots the (x,y) points. plot( ) type="n" option plotting axes, titles .

: x

plot( ) type= options

x

pie(x, labels=)

x non-negative numeric vector ( slice )

labels=

slice vector # Simple Pie Chart

slices

Pie

# Pie Chart with %

slices

Box Plot

plot Box-and-whisker plot , , ,Q1,Q3

Boxplot .

boxplot(x, data= )

x formula, data=

formula (: y~group ), horizontal=TRUE

94

# Cylinder MPG

attach(mtcars)

boxplot(mpg~cyl,data=mtcars, main=" Milage ",

xlab="Cylinder ", ylab="Miles Per Gallon")

detach(mtcars)

95

(, Scatterplots)

.

Scatterplot plot(x, y) (x, y numeric vector plot )

plot(wt, mpg, main="Scatterplot ",

xlab= ", ylab="Miles Per Gallon ", pch=19)

96

V. R

97

R

(, , )

()

Crosstabs

98

abs(x)

sqrt(x)

ceiling(x) ceiling(3.475) 4

floor(x) floor(3.475) 3

trunc(x) trunc(5.99) 5

round(x, digits=n) round(3.475, digits=2) 3.48

cos(x), sin(x), tan(x) acos(x), cosh(x), acosh(x)

log(x)

log10(x)

exp(x) e^x

factorial(x) factorial(5) 120

99

() (random sample) simulation

(d/p/q/r) +

d: (density)

p: (probability)

q: 4 (quantile)

r: (random number)

100

dnorm(x) (default m=0 sd=1)

pnorm(q) (area under the normal curve to the right of q)

qnorm(p) normal quantile , p percentile

rnorm(n, m=0,sd=1) n (random normal deviates)

dbinom(x, size, prob)

pbinom(q, size, prob)

qbinom(p, size, prob)

rbinom(n, size, prob)

(size = , prob = )

dpois(x, lamda)

ppois(q, lamda)

qpois(p, lamda)

rpois(n, lamda)

poisson (m=std=lamda) # lamda=4 0,1, or 2 event dpois(0:2, 4)

dunif(x, min=0, max=1)

punif(q, min=0, max=1)

qunif(p, min=0, max=1)

runif(n, min=0, max=1)

(uniform distribution) #10 uniform random variates

x

na.rm . Object vector .

103

mean(x, trim=0,

na.rm=FALSE)

object x # trimmed mean, 5% mx

(Descriptive Statistics)

= (summary statistics)

sapply( ) # mydata . ,

sapply(mydata, mean, na.rm=TRUE)

sapply :

mean, sd, var, min, max, median, range, and quantile.

(histogram, density plot, ) .

summary(mydata) # , , 1/3, ,

fivenum(x) # Tukey min,lower-hinge, median,upper-hinge,max

105

Histograms hist(x)

x plotting vector

freq=FALSE option breaks= option bin

Histogram .

#

hist(mtcars$mpg)

# .

hist(mtcars$mpg, breaks=12, col="red")

106

Plot

(Kernel Density) Plots plot(density(x)) , x vector.

# Kernel Density Plot

d

Kernel Density Group sm package sm.density.compare(x, factor)

x vector, factor grouping . superimpose the kernal density plots of two or

more groups.

# MPG (cars with 4,6, or 8 cylinders) library(sm)

attach(mtcars)

# value label (factor . cyl=4,6,8 numeric ) cyl.f

(contingency table)

table( )

prop.table( )

margin.table( ) marginal

2-way contingency table (2 ) ;

110

cor( )

cov( )

: cor(x, use=, method= )

Option x Matrix data frame

use . Options: all.obs ( ), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)

method Options: pearson, spearman, kendall.

111

# mtcars /. listwise deletion

cor(mtcars, use="complete.obs", method="kendall")

cov(mtcars, use="complete.obs")

cor.test( ) correlation coefficient neither cor( ) or cov( ) produce tests of significance,

Hmisc package rcorr( ) pearson & spearman correlations/covariances , matrix pairwise deletion .

#

library(Hmisc)

rcorr(x, type="pearson") # pearson spearman

rcorr(as.matrix(mtcars)) # mtcars data frame

112

cor(X, Y) rcorr(X, Y) column X column Y

# mtcars Correlation matrix

# rows: mpg, cyl, disp

# columns:hp, drat, wt

x

(x) (y)

(=) 1 Yi = 0 + i xi + i

(least squares method)

, ,

data(women)

women

fit

I.

II.

III. Taxonomy

IV.

V.

VI.Underfitting Overfitting

VII.Data Exploration

116

Data Mining

Predictive Analysis

Data Analysis

Data Science

OLAP

BI

Analytics

Text Mining

SNA (Social Network Analysis)

Modeling

Prediction

Machine Learning

Statistical/Mathematical Analysis

KDD (Knowledge Discovery)

Decision Support System

Simulation

() (Data Analysis), (Data Mining)

117

Data Preparation

Data Exploration

Modeling ( )

Evaluation

Deployment

118

CRISP-DM

Cross-Industry Standard Process for DM

119

Dataset

124

Taxonomy

125

(Univariate)

Table

Barplot

Pie chart

Dot chart

Factor

Stem-and-leaf plot

Strip chart

, ,

Variation: Variance, , IQR

Histogram

Mode, Symmetry, Skew

Boxplot

126

2 (bivariate)

2-way Table (summarized/unsummarized)

Marginal distribution

2-way Contingency table

Boxplot

Densityplot

Strip chart

Q-Q (quantile-quantile) plot

Scatterplot

2 (correlation)

127

(multivariate)

R data frame list

Boxplot xtabs()

split() stack()

Lattice

128

Pearson

Spearman Rank

Kendal Rank

129

Functions

cor( ) function produces correlations

cov( ) function to produces covariances.

mtcars cor(mtcars, use="complete.obs", method="kendall")

cov(mtcars, use="complete.obs")

130

Option

x Matrix or data frame

use missing . Options are: all.obs (assumes no missing data), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)

method Correlation . Options are: Pearson, Spearman, kendall.

, .

Hmisc package rcorr( ) produces correlations/covariances and significance

levels for pearson and spearman correlations. # Correlations with significance levels

library(Hmisc)

rcorr(x, type="pearson") # pearson/spearman

rcorr(as.matrix(mtcars))

cor(X, Y) or rcorr(X, Y) --> X, Y column correlation.

# Correlation matrix from mtcars with mpg, cyl, and disp as rows

# and hp, drat, and wt as columns

x

1

2

2

2

1

2

132

(Machine Learning)

How do machines learn?

Abstraction Knowledge Representation 133

(Supervised ML) We know the labels and the number of classes

(Unsupervised ML) We do not know the labels and may not know the

number of classes

134

Classification

136

Underfitting Overfitting

Underfitting

137

Overfitting

training data noise .

138

Data Exploration

- ()

, () visualization

: Box Plot, Histogram, PCA, charting(Pareto, MV, ...)

139

R (1)

140

I.

II.

III.

141

I.

:

: KNN

R coding : KNN

142

(Classification)

(training set ) Class

Model as a function of the values of other attributes.

KNN (K-Nearest Neighbors)

Nave Bayes

Decision Tree

Regression

SVM (Support Vector Machine)`

143

Classification Marketing

(Target Marketing)

Market Segmentation -->

Fraud Detection

/

, , ...

(Attrition/Churn)

(model for loyalty)

(Sky survey)

(star or galaxy/ )

145

= A flow-chart-like tree structure

Leaf node

= class label or class label distribution

146

heuristic recursive partitioning.

Root node

target class feature tree branch

divide-and-conquer the nodes

() 3 : mainstream

hit/critics choice/box-office bust

: movie script pattern

scatter plot

films proposed shooting budget/the number of A-list celebrities for starring roles/the categories of success

147

C5.0 decision tree algorithm

best split feature split?

purity : entropy entropy =0: completely homogeneous

entropy =1: maximum disorder

: red (60%), white (40%) entropy

curve() function

150

split point ? IG (Information Gain) =split entropy split

entropy . , split ( ) entropy.

feature split homogeneity

IG .

IF IG=0 ; No reduction in entropy

ELSE IF max IG : Entropy prior to the split Entropy after the split=0 . , split completely homogeneous!

Pruning the decision tree

Decision tree can continue to grow indefinitely (, overly specific) pre-pruning post-pruning

151

Best Split Node Impurity

152

Tree Construction ()

All the training examples are at the root.

Tree Pruning () Data Noise branch

Tree Induction ( ) Greedy Strategy

Split the records based on an attribute test that optimizes certain criterion.

Issues Determine how to split the records

How to specify the attribute test conditions?

How to determine the best split?

Determine when to stop splitting

153

Node Impurity

Information Gain Entropy

ID3

Gain Ratio IG Splitinfo

C4.5

Gini Binary split

CART

154

Entropy = S impurity

S; a set of exmples

p; positive example

q; negative example

Gain(T,X)

= Entropy(T) Entropy(T,X)

155

Overfitting Prepruning

Tree construction (threshold) goodness measure

Postpruning

"Fully grown" tree branch get a sequence of progressively pruned trees

Training data "best pruned tree"

156

Information Gain Best Predictor?

158

Decision Tree Root node

159

Tree Rule

160

Decision Tree Regression

161

Entropy vs.

162

R

C5.0

163

KNN (K-Nearest Neighbors)

KNN classifies unlabeled examples based on their similarity

with examples in the training set

xu D , find the k closest labeled examples

in the training data set and assign xu to the class that appears most frequently within the k-subset

k-NNR only requires

k

A set of labeled examples (training data)

closeness

training dataset

, Unlabeled example test dataset .

164

KNN algorithm feature (feature space) 2-/3-/4-dimensional

165

k balance between overfitting and uderfitting.

k ; noisy data (, risk of ignoring small but important pattern

k ; noisy data . , accidentally mislabeled item

167

k 1-Nearest Neighbor

3-Nearest Neighbor

knn Min-max normalization

Z-score standardization

Voronoi Diagram

Decision surface formed by the training examples

169

Rescaling

Min-max

Z-score 170

R (KNN):

: /

1

2 /

3

4

5

10

(Radius, Texture, Perimeter, Area, Smoothness, ...)

171

R coding

172

II.

Clustering

: K-Means

R : Clustering

173

Cluster

class object object object

Clustering

class grouping .

Group (partition)

Group label unsupervised ML group .

actionable insight meaningful label !

174

Clustering

Scalability

attribute (// )

Attribute shape

High dimensionality, Noisy data , Interpretability

: ,

, , pattern

: taxonomy, ,

175

: infer specialty by examining their research publications

176

K Means

The k-means algorithm for clustering

initial assignment phase

k initial cluster center , initial guess locally optimal solution cluster .

3 cluster k=3

Update phase

Initial center new location shift (centroid) example cluster

177

: (i) cluster .

(ii) coordinates of the cluster centroids.

k (= cluster ) balance : overfitting vs. underfitting

a priori knowledge (= a priori belief)

randomly, business requirement , sqrt(n/2)

large dataset elbow point

179

Clustering

Partitioning Method

partition (k) (partitioning) , iteratively relocate (Object )

Density-based Method

(threshold) cluster

, for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

180

Hierarchical Method Agglomerative approach

bottom-up Group merging, until termination condition holds.

Divisive approach

top-down cluster split, until termination condition holds.

181

III.

Apriori

R coding

182

.

BreadMilk [sup = 5%, conf = 100%]

t (= a set of items) I.

I = {i1, i2, , im}: a set of items.

Trxn Database T = {t1, t2, , tn}.

: Market basket transactions: t1: {bread, cheese, milk}

t2: {apple, eggs, salt, yogurt}

tn: {biscuit, eggs, milk}

183

Rule (Support)

, Pr(A B) (1 )

Support=( A, B )/

('A=>B' = 'B=>A' )

2. (Confidence)

A , B Pr(B|A)

= (A, B )/(A )

'A=>B' 'B=>A'

3. (Improve)

= (A, B * )/ (A *A )

1 1 , 1 .

184

() .

, A B .

. , A B

185

n

countYXsupport

). (

countX

countYXconfidence

.

). (

Find all rules that satisfy the user-specified minimum

support (minsup) and minimum confidence (minconf).

Features Completeness: find all rules.

No target item(s) on the right-hand-side

Mining with data on hard disk (not in memory)

!

the Apriori Algorithm

186

Apriori

: (1 ) minimum support itemset

(frequent itemsets large itemsets).

(2 ) Use frequent itemsets to generate rules.

: (frequent itemset) {Chicken, Clothes, Milk} [sup = 3/7]

and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3]

187

1: Mining (frequent itemset)

= an itemset whose support is minsup.

:

apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets

188

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

2: rule Frequent itemsets association rules

One more step is needed to generate association rules

For each frequent itemset X,

For each proper nonempty subset A of X,

Let B = X - A

A B is an association rule if

Confidence(A B) minconf, support(AB) = support(A B) = support(X) confidence(AB) = support(A B) / support(A)

189

R coding

190

R (2) CLICKSTREAM PROFILING

191

Clickstream

Clickstream Data Warehouse by Mark Sweiger

Schema

Identify who you are from where you go

Click path

Web log

Page Tagging

: Google Analytics

Internet Traffic

Google Analytics. Yandex, Kontagent

Crowd-sourcing

Quiz: Which one is a more frustrated?

(Path Analysis)

Choice model of Browsing

Text

Markov

Path Analysis

Page

User Session

Probability of Viewing a Page

Transition Matrix

Predicting Purchase Conversion

Profiling:

Data Vectorization!!!

Clustering, SVM,

Page Contents = HTML Code + Regular Text

Tokenization & Lexical Parsing

HTML code,

, Stop word ,

term frequency (TF)

Result: Document Vector

Classifying Document Vector

Markov Chain

: Auto Insurance Risk :

low risk or high risk - 12

:

high risker 60% chance high risk

high risker 40% chance low risk.

low risker 15% chance high risk

low risker 85% chance low risk.

Task:

Set up a probability tree, transition diagram, and transition matrix to our process.

Matrix (time, sequence, trials, etc.)

probability tree

same state state .

.

mutual exclusive

Task

Find the probability of being in any given state may steps into the process.

Random guessing: 7%

Text Classification: 25%

+ Domain Model 41%

+ Browsing Model 78%

Source: http://www.andrew.cmu.edu/user/alm3/

http://www.andrew.cmu.edu/user/alm3/

Profiling

()

213

(churn analysis) (app)

Churn Rate

Most of the Apps Lose Half of their Peak Users within 3 Months

churn analysis

Business Objective: Reduce Customer Churn

Solution #1 .

Solution #2 .

Action Plans

App (eg. Gaming App, Social App)

List down the activities that users perform on your app

core feature

average life-time

Churn Criteria

Cut-off date

= app ~

: A app 2014531 inactive Cut-off Date 40 days

data points: app activities

app

app

core feature

1

2

3

(Preprocessing)

Variable (Feature)

Google Analytics R

Image source: Google Analytics Core Reporting API Dev Guide

, Classification Problem

Logistic Regression

Predictor(dependent) variable will be unique key(Visitor ID) for each visitors

Predicted label would be

1 : Visitor will churn vs. 0 : Visitor would not churn

Process

Random Train Test

Train Data-set

Test Data-set

Test Data

(Accuracy)

Confusion Matrix

Accuracy

= (No of Correctly Predicted Labels) / Total No of Labels

= (620 + 1024)/ (620 + 4 + 7 + 1024)

~ 99.34 %

User Segmentation

User types

(Market Segmentation using k-means)

Market Matching

Segmentation

dissecting the marketplace into submarkets that require different marketing mixes

Targeting

Process of reviewing market segments and deciding which one(s) to pursue

Positioning

Establishing a differentiating image for a product or service in relation to its competition

230

SNS 10

231

Segmentation

Geographic

Demographic Psychographic

Behavioral Geodemographic

R coding

232

R (3)

233

?

Non-Math/Stats Model

Representation of Some Phenomenon

Math/Stats Model

Describe Relationship between Variables

(Deterministic) Models (no randomness)

(Probabilistic) (with randomness)

234

Deterministic Models (no randomness)

Hypothesize Exact Relationships

Prediction Error

: (Body mass index: BMI)

BMI = Weight in Kilograms/ (Height in Meters)2

(with randomness) Hypothesize 2 Components

Deterministic

Random Error

: (Systolic blood pressure)

SBP = 6 x age(d) +

, Random Error (: Birthweight)

235

Probabilistic models

Regression Models

Corrleation models

Other models

236

(Regression) ( ) ()

Use equation to set up relationship

Numerical Dependent (Response) Variable

1 or More Numerical or Categorical Independent (Explanatory) Variables

1. Hypothesize Deterministic Component

Estimate Unknown Parameters

2. Random Error Term

Estimate Standard Deviation of Error

3. Fitted Model

4. Use Model for Prediction & Estimation

237

Specifying the deterministic component

1.

2. (Hypothesize Nature of Relationship)

Expected Effects (i.e., Coefficients Signs)

Functional Form (Linear or Non-Linear)

Interactions

1. (: Epidemiology)

2.

3. (Previous Research)

4. Common Sense

238

: Which model is more logical?

239

Years since seroconversion

CD+ counts

CD+ counts




CD+ counts

CD+ counts

Regression

240

Regression

(Simple)

(Multiple)

2 1

Linear Equation

241

R coding

(Simple Linear Regression)

lm()

: coef()

(fitted value): fitted()

(residual): residual()

: confint()

predict()

predict.glm(), predict.lm(), predict.nls()

summary()

F

ANOVA 242

(Multiple Linear Regression)

n : I()

(outlier)

243

R class ts

frequency=7: a weekly series

frequency=12: a monthly series

frequency=4: a quarterly series a

Time Series Decomposition

4 : Trend component: long term trend

Seasonal component: seasonal variation

Cyclical component: repeated but non-periodic fluctuations

Irregular component: the residuals

: AirPassengers : plot(AirPassengers)

apts

Popular models

Autoregressive moving average (ARMA)

Autoregressive integrated moving average (ARIMA)

# build an ARIMA model

fit

R

R / R

R

R

248

R

~ 1 Million records : R 1M ~ 1Billion : tuning >= 1Billion records: MapReduce

(: 10 K record hierarchical clustering 50 M )

R R object Sampling H/W upgrade: 64-bit 8TB RAM interpreter (c/c++ ) R --> Parallel R

249

R MapReduce -

R Hadoop job (map) .

Join, , sort reduce $ export HADOOP_HOME = /usr/lib/hadoop

$ ${HADOOP_HOME}/bin/hadoop fs -rmr output

$ ${HADOOP_HOME}/bin/hadoop fs

-put test-data/stocks.txt stocks.txt

$ ${HADOOP_HOME}/bin/hadoop \

jar ${HADOOP_HOME}/contrib/streaming/*.jar \

-D mapreduce.job.reduces=0 \

-inputformat org.apache.hadoop.mapred.TextInputformat \

- input stocks.txt \

- output output \

- mapper `pwd'/src/main/test/stock_day_avg.R \

- file `pwd`//src/main/test/stock_day_avg.R 250

R Map Reduce R

()

R .

$ cat test-data/stocks.txt | src/main/test/stock_day_avg.R |

sort --key 1,1 | src/main/test/stock_cma.R

Hadoop job .

251

R

Visualization

252

R ()

Shiny R framework

R Application

http://shiny.rstudio.com/ ( http://shiny.rstudio.com/gallery/ )

, Professional ()

253

!!

Data as a Strategic Value

Data Science

, , ,

254

Documents

2014.7.10 ([email protected])„-이용한... · 순서 오전 도입 빅데이터 분석 개요 데이터와 데이터베이스 빅데이터와 빅데이터 분석 r 기초 개요,