Upload
nguyenquynh
View
225
Download
0
Embed Size (px)
Citation preview
R ()
2014.7.10
v.3
mailto:[email protected]
R ,
- /
R (1) -
, ,
R (2) Clickstream Profiling
R (3)
&
2
3
!
!
(Though Experiment)
http://www.openwith.net/?page_id=766
4
http://www.openwith.net/?page_id=766
5
Prelude:
Berlin on Tolstoy
Analyzed Tolstoy in a popular essay in 1953 based on Archilochus
What was Tolstoy?
Many talents, like a fox
Believed we should be hedgehogs!
Fox compared against Hedgehog
The hedgehog knows one big thing... ,
"The fox knows many little things... ,
.
8
(1)
Plato Aristotle
Michelangelo Da Vinci
Marx, Churchll, Hitler ???
9
(2)
Laozi Sima Qian
Mao Zhou, Deng,
10
The Fox and the hedgehog in a project life
vs.
hedgehog risk
Systems Thinking: A Foxy Approach
OODA: A fox dressed like a hedgehog
11
Devops
:
++ ~
Cross-functional team,
Widely-shared metrics,
Automating repetitive tasks,
Post-mortem, Regular release,
+ +
13
I.
II.BI (Business Intelligence)
III. Hadoop
IV.Data Science
14
(/, /, /, )
: Excel
DBMS/DW
RDBMS/SQL, NoSQL, ...
ETL, CDC,
15
BI
BI
16
BI BSC Balanced Scorecard. .
VBM Value-based Management. .
ABC Activity Based Costing. .
BI OLAP On-line Analytical Processing.
ERP,CRM ERP, CRM, SCM BI
BI
ETL Extraction-Translation-Loading.
DW Data Warehouse. (repository)
BI Portal .
Hadoop
Tidal Wave 3VC
Supercomputer High-throughput computing
2 :
, (grid computing)
(MPP)
Scale-Up vs. Scale-Out
BI (Business Intelligence) DW/OLAP/
17
Hadoop
Google!
Nutch/Lucene 2006
(Flat linearity)
18
1990 Excite, Alta
Vista, Yahoo,
2000 Google ; PageRank,
GFS/MapReduce
2003~4 Google Papers
2005 Hadoop (D. Cutting & Cafarella)
2006 Apache
Google Papers (2003~2010)
Percolator: Handling individual updates
Dremel: Online visualizations
Google File : a distributed file system
MapReduce : to compute their search indices
Pregel: Scalable graph computing
Dremel: Online visualizations 19
Big Picture
20
Framework
21
Hadoop Ecosystems
22
Major Influencers
Open Source Tipping Point
Before Google vs. After Google
Data Science , ,
OR (Operations Research)
////...
(Statistical Inference), (Parametric),
, , , /Expert System ()
Data Science : HPC + Google shock
, Graph , topology,
AI (ANN, SVM, )
, Semantic web
Fusion of Python or R?, DS + Cloud BDaaS,
24
R
25
I.
II.R
III.R
IV.
V.
26
I. R
27
R
+ S . () + packages
/ (DSL :Domain Specific Language) (Windows, Unix, MacOS). .
. (Functional Programming)
, (loop) ,
Script , Interpreter (OOP)
Generic Polymorphic. object
I-1 28
II. R
29
R
R (Workspace)
(Assignment)
Batch
30
R
R CRAN (Comprehensive R Archive Network)
http://www.cran.r-project.org/
31
http://www.cran.r-project.org/http://www.cran.r-project.org/http://www.cran.r-project.org/
RStudio
GUI R
RStudio
R Commander
Rstudio
32
(comment)
#
: help.start() # help(seq) # seq ?seq # seq RSiteSearch("lm") # help mailing lists
History history() # 25
savehistory(file="myfile") # (".Rhistory )
loadhistory(file="myfile") #
33
: rnorm(10)
mean(abs(rnorm(100)); hist(rnorm(10))
R . dataset :
data( ) # Load package .
dataset : help(datasetname)
Session option options() # option
getwd() #
dir.create("d:/Rtraining"); setwd("d:/Rtraining") # \ /
getwd()
34
source( )
session script ( ) source("myfile.R") # script (.R .r)
lm(mpg~wt, data=mtcars)
.
object fit
sink( ) - (redirect) sink("myfile", append=FALSE, split=FALSE) #
sink() #
append option override () or append
split option . # : ( )
sink("c:/projects/output.txt")
# : ( , )
sink("myfile.txt", append=TRUE, split=TRUE)
, pdf(mygraph.pdf)
36
(Package)
Package
= R , .
Packages ( ). install.packages(package)
CRAN Mirror . (e.g. Korea)
session load (session ) library(package)
Library
Package load library() # library packages
search() # load packages
Package help(package=)
37
Customization
R Rprofile.site . MS Windows: C:\Program Files\R\R-n.n.n\etc directory.
Rprofile .
Rprofile.site
>
Rprofile.site 2 .First( ) R session
.Last( ) R session
38
Batch
(non-interactively) MS Windows ( )
C:/Program Files/R/R-3.0.2/R.exe CMD BATCH C:/Rtraining/a.R
Linux R CMD BATCH [options] my_script.R [outfile]
> sqrt(-2)
[1] NaN
:
In sqrt(-2) : NaN
> q()
39
III. R
40
R
(Operators)
(Merge)
apply()
41
(Assignment)
R =, x+y
> print(x+y)
> x=pi
> x
> rm(x)
42
#
age
Import from: csv
mydata
R
Variable
Continuous (nominal, ratio)
Ordinal
Nominal (categorical)
(identifier), (date)
Factor
45
(Mode) (numeric) -
(character) -
(logical) - TRUE, FALSE
FALSE 0, 0 TRUE
(: imaginary number)
Raw (byte)
R (data structure) Vector, matrix
Array, Data frame
List, Class 46
(dataset)
data vector c()
> Rev_2012 = c(110,105,120,140) # :
> Rev_2013 = c(105,115,140,135)
> Revenue = cbind(Rev_2012, Rev_2013) # column
> Revenue
Rev_2012 Rev_2013
[1,] 110 105
[2,] 105 115
[3,] 120 140
[4,] 140 135
>
47
R vectors (numerical, character, logical)
1
R scalar . (= vector)
(string) mode single-element vector
Matrices
2
Arrays
3
data frames
Column mode ( )
Lists
48
is.numeric() is.character()
is.vector() is.matrix()
is.data.frame()
~
(numeric)
as.numeric() FALSE 0 1,2 1,2
(logical) as.logical() 0 FALSE
as.character() 1,2 - 1,2 FALSE FALSE
Factor as.factor() (factor)
Vector as.vector()
Matrix as.matrix() Matrix
as.dataframe() 49
- Vector
mode
a x = c(1,3,5,7)
> x
[1] 1 3 5 7
> family = c("", "","","")
> family
[1] "" "" "" ""
> c(T,T,F,T)
[1] TRUE TRUE FALSE TRUE
50
Vector indexing Vector (elements) ([ ] )
a[c(2,4)] # 2 4 > new_a new_a
[1] 1.0 5.3 6.0 -2.0 4.0
vector : seq() sequence rep() - vector
Vector (Vectorized Operations) = vector element
: Vector In, Vector Out or Vector In, Matrix Out
51
Recycling : 2 vector c(1,2) + c(5,6,7)
Filtering > z w 0]
> w
[1] 5 3
subset()
NA Null NA ; (missing value)
Null : undefined value ( X)
52
: Vector
R matrix
vector %*%, +
cbind() rbind()
library(MASS) ginv()
t() (transpose)
: x y
53
Matrices
= row column vector mode( or )
column
(nrow=, ncol = )
mymatrix
cells
Matrix row column apply()
apply(m, dimcode, f, fargs) m = matrix,
dimcode = 1: row , 2: column ,
f= , fargs = optional argts
> m # row
> apply(m, 1, mean)
[1] 3.5 4.5 5.5 6.5 7.5
> # column
> apply(m, 2, mean)
[1] 3 8
> # 2
> apply(m, 1:2, function(x) x/2)
56
Matrix > x x x
rbind(), cbind() > B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
> C = matrix(c(7, 4, 2), nrow=3, ncol=1)
> cbind(B, C)
57
Matrix vector Matrix vector + Matrix
> z z
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
> length(z)
[1] 8
> class(z)
[1] "matrix"
> attributes(z)
$dim
[1] 4 2
58
Array
Matrices 2 . : 4 x 3 x 2 3 1~24
> x x[1,,]
[,1] [,2] [,3]
[1,] 1 13 25
[2,] 5 17 29
[3,] 9 21 33
> x[,,1]
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
59
List
(ordered collection of objects). (unrelated)
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
x[2]
x[c(2, 4)]
[[]] . x[[3]]
60
List , > z z
> z$c z
> # .
> z[[4]] lapply(list(2:5,35:39), median) # list
> sapply(list(2:5, 35:39), median) # vector/matrix
61
Data Frame
= list special case
Column (, , factor ) .
d
list . vector
vector .
merge 2 merge() # merge two data frames by ID
total
Factor
(nominal or categorical) [ 1... k ] vector
factor() ordered() option .
x
Factor tapply()
Vector > ages party tapply(ages, party, mean)
57 30 34
split()
split(x,f) x () > g split(1:7, g)
65
(contingency table)
2-way contingency table
: > trial colnames(trial) rownames(trial) trial.table trial.table
sick healthy
risk 34 9
no_risk 11 32
66
Dataset
ls() # objects
names(mydata) # mydata
str(mydata) # mydata
levels(mydata$v1) # mydata v1 factor level
dim(object) # object (dimensions)
class(object) # object (numeric, matrix, data frame, ) class
mydata # mydata
head(mydata, n=10) # mydata 10 row
tail(mydata, n=5) # mydata 5 row
67
(Operators)
Binary vector, matrix scalar .
Arithmetic Operators
+
-
*
/
^ or **
()
x %% y (x mod y) 5%%2 is 1
x %/% y
integer division 5%/%2 is 2
68
< less than
greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE
69
substr(x, start=n1,
stop=n2)
vector substring
grep(pattern, x ,
ignore.case=FALSE,
fixed=FALSE)
Search for pattern in x. fixed =FALSE pattern . fixed=TRUE pattern index grep("A", c("b","A","c"), fixed=TRUE) 2
sub(pattern,
replacement, x,
ignore.case =FALSE,
fixed=FALSE)
x pattern . fixed=FALSE pattern . fixed = T pattern . sub("\\s",".","Hello There") "Hello.There"
strsplit(x, split) element (Split).
strsplit("abc", "") 3 vector . , "a","b","c"
paste(..., sep="") sep (Concatenate)
toupper(x)
tolower(x)
70
seq(from , to, by) (sequence) indices
expr { } .
if-else if (cond) expr
if (cond) expr1 else expr2
for for (var in seq) expr
while while (cond) expr
switch switch(expr, ...)
ifelse ifelse(test,yes,no)
72
order( )
ASCENDING.
sorting # : mtcars
attach(mtcars)
# sort by mpg
newdata
IV. R
74
R
R
plot()
Plots
(Dot) Plots
(Bar) Plots
(Line Charts)
(Pie Charts)
(Boxplots)
Scatter Plots
75
R
R
demo(graphics); > demo(persp)
plot(c(1,2,3),c(1,2,4))
Nile
mean(Nile)
sd(Nile)
hist(Nile)
76
plot()
plot( ) object (plot)
Generic density, data frame,
: plot(x,y, arguments)
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
plot() :
77
Option
type = type=p (point) type=l (line) type=b type=o type=h type=s (step)
xlim =
ylim =
x y . xlim = c(1,10) xlim = range(x)
xlab =
ylba =
x y (label)
main = (main title).
sub = (subtitle).
bg=
bty= 78
pch
lty
Option
pch =
lty = 1: (solid line) 2: (dashed) 3: : (dotted) 4: dot-dash
col= : red,green,blue
mar = c(bottom, left, top, right) . c(5,4,4,2) + 0.1
asp = Apsect ratio (= y/x )
79
: par(mfrow = c(2,2)) # mfrow multiple plot plot(x,y, type="b", main = "cosie ", sub = "type = b") plot(x,y, type="o", las = 1, bty = "u", sub = "type = o")
plot(x,y, type="h", bty = "7", sub = "type = h")
plot(x,y, type="s", bty = "n", sub = "type = s")
80
abline()
abline(a,b) # =a, =b
abline(h=y) #
abline(v=x) # abline(lm.obj) # lm.obj
: data(cars)
attach(cars)
par(mfrow=c(2,2))
plot(speed, dist, pch=1); abline(v=15.4)
plot(speed, dist, pch=2); abline(h=43)
plot(speed, dist, pch=3); abline(-14,3)
plot(speed, dist, pch=8); abline(v=15.4); abline(h=43)
81
plotting
Plotting (Dot Plot) dotchart(x, labels=)
x vector, labels .
groups= option x factor . dotchart(mtcars$mpg, labels = row.names(mtcars), cex=.7,
main=" ", xlab = "Gallon mile ")
82
# Dotplot: , (: mpg, group), (by cylinder)
x
(Bar) Plots
barplot(height)
height vector matrix.
If (height vector)
.
If (height matrix AND option beside=FALSE)
bar height column stacked sub-bars )
If (height matrix AND beside=TRUE)
Column
option names.arg=( ) label
option horiz=TRUE barplot
, Bar plot bar plotting. (mean, median, sd )
aggregate( ) barplot( )
84
# Simple Bar Plot
counts
Stacked Bar Plot counts
Grouped Bar Plot counts
(Line Charts)
(Line Charts) lines(x, y, type=)
x y vector
type=
Type Description
p
l
o overplotted points lines
b, c (join) points ("c )
s, S stair steps
h histogram-like vertical lines
n
88
lines( ) plot(x, y) .
: plot( ) plots the (x,y) points. plot( ) type="n" option plotting axes, titles .
: x
90
plot( ) type= options
x
pie(x, labels=)
x non-negative numeric vector ( slice )
labels=
slice vector # Simple Pie Chart
slices
Pie
# Pie Chart with %
slices
Box Plot
plot Box-and-whisker plot , , ,Q1,Q3
Boxplot .
boxplot(x, data= )
x formula, data=
formula (: y~group ), horizontal=TRUE
94
# Cylinder MPG
attach(mtcars)
boxplot(mpg~cyl,data=mtcars, main=" Milage ",
xlab="Cylinder ", ylab="Miles Per Gallon")
detach(mtcars)
95
(, Scatterplots)
.
Scatterplot plot(x, y) (x, y numeric vector plot )
plot(wt, mpg, main="Scatterplot ",
xlab= ", ylab="Miles Per Gallon ", pch=19)
96
V. R
97
R
(, , )
()
Crosstabs
98
abs(x)
sqrt(x)
ceiling(x) ceiling(3.475) 4
floor(x) floor(3.475) 3
trunc(x) trunc(5.99) 5
round(x, digits=n) round(3.475, digits=2) 3.48
cos(x), sin(x), tan(x) acos(x), cosh(x), acosh(x)
log(x)
log10(x)
exp(x) e^x
factorial(x) factorial(5) 120
99
() (random sample) simulation
(d/p/q/r) +
d: (density)
p: (probability)
q: 4 (quantile)
r: (random number)
100
dnorm(x) (default m=0 sd=1)
pnorm(q) (area under the normal curve to the right of q)
qnorm(p) normal quantile , p percentile
rnorm(n, m=0,sd=1) n (random normal deviates)
dbinom(x, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
(size = , prob = )
dpois(x, lamda)
ppois(q, lamda)
qpois(p, lamda)
rpois(n, lamda)
poisson (m=std=lamda) # lamda=4 0,1, or 2 event dpois(0:2, 4)
dunif(x, min=0, max=1)
punif(q, min=0, max=1)
qunif(p, min=0, max=1)
runif(n, min=0, max=1)
(uniform distribution) #10 uniform random variates
x
102
na.rm . Object vector .
103
mean(x, trim=0,
na.rm=FALSE)
object x # trimmed mean, 5% mx
(Descriptive Statistics)
= (summary statistics)
sapply( ) # mydata . ,
sapply(mydata, mean, na.rm=TRUE)
sapply :
mean, sd, var, min, max, median, range, and quantile.
(histogram, density plot, ) .
summary(mydata) # , , 1/3, ,
fivenum(x) # Tukey min,lower-hinge, median,upper-hinge,max
105
Histograms hist(x)
x plotting vector
freq=FALSE option breaks= option bin
Histogram .
#
hist(mtcars$mpg)
# .
hist(mtcars$mpg, breaks=12, col="red")
106
Plot
(Kernel Density) Plots plot(density(x)) , x vector.
# Kernel Density Plot
d
Kernel Density Group sm package sm.density.compare(x, factor)
x vector, factor grouping . superimpose the kernal density plots of two or
more groups.
# MPG (cars with 4,6, or 8 cylinders) library(sm)
attach(mtcars)
# value label (factor . cyl=4,6,8 numeric ) cyl.f
109
(contingency table)
table( )
prop.table( )
margin.table( ) marginal
2-way contingency table (2 ) ;
110
cor( )
cov( )
: cor(x, use=, method= )
Option x Matrix data frame
use . Options: all.obs ( ), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)
method Options: pearson, spearman, kendall.
111
# mtcars /. listwise deletion
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
cor.test( ) correlation coefficient neither cor( ) or cov( ) produce tests of significance,
Hmisc package rcorr( ) pearson & spearman correlations/covariances , matrix pairwise deletion .
#
library(Hmisc)
rcorr(x, type="pearson") # pearson spearman
rcorr(as.matrix(mtcars)) # mtcars data frame
112
cor(X, Y) rcorr(X, Y) column X column Y
# mtcars Correlation matrix
# rows: mpg, cyl, disp
# columns:hp, drat, wt
x
(x) (y)
(=) 1 Yi = 0 + i xi + i
(least squares method)
, ,
data(women)
women
fit
115
I.
II.
III. Taxonomy
IV.
V.
VI.Underfitting Overfitting
VII.Data Exploration
116
Data Mining
Predictive Analysis
Data Analysis
Data Science
OLAP
BI
Analytics
Text Mining
SNA (Social Network Analysis)
Modeling
Prediction
Machine Learning
Statistical/Mathematical Analysis
KDD (Knowledge Discovery)
Decision Support System
Simulation
() (Data Analysis), (Data Mining)
117
Data Preparation
Data Exploration
Modeling ( )
Evaluation
Deployment
118
CRISP-DM
Cross-Industry Standard Process for DM
119
120
121
122
123
Dataset
124
Taxonomy
125
(Univariate)
Table
Barplot
Pie chart
Dot chart
Factor
Stem-and-leaf plot
Strip chart
, ,
Variation: Variance, , IQR
Histogram
Mode, Symmetry, Skew
Boxplot
126
2 (bivariate)
2-way Table (summarized/unsummarized)
Marginal distribution
2-way Contingency table
Boxplot
Densityplot
Strip chart
Q-Q (quantile-quantile) plot
Scatterplot
2 (correlation)
127
(multivariate)
R data frame list
Boxplot xtabs()
split() stack()
Lattice
128
Pearson
Spearman Rank
Kendal Rank
129
Functions
cor( ) function produces correlations
cov( ) function to produces covariances.
mtcars cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
130
Option
x Matrix or data frame
use missing . Options are: all.obs (assumes no missing data), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)
method Correlation . Options are: Pearson, Spearman, kendall.
, .
Hmisc package rcorr( ) produces correlations/covariances and significance
levels for pearson and spearman correlations. # Correlations with significance levels
library(Hmisc)
rcorr(x, type="pearson") # pearson/spearman
rcorr(as.matrix(mtcars))
cor(X, Y) or rcorr(X, Y) --> X, Y column correlation.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows
# and hp, drat, and wt as columns
x
1
2
2
2
1
2
132
(Machine Learning)
How do machines learn?
Abstraction Knowledge Representation 133
(Supervised ML) We know the labels and the number of classes
(Unsupervised ML) We do not know the labels and may not know the
number of classes
134
135
Classification
136
Underfitting Overfitting
Underfitting
137
Overfitting
training data noise .
138
Data Exploration
- ()
, () visualization
: Box Plot, Histogram, PCA, charting(Pareto, MV, ...)
139
R (1)
140
I.
II.
III.
141
I.
:
: KNN
R coding : KNN
142
(Classification)
(training set ) Class
Model as a function of the values of other attributes.
KNN (K-Nearest Neighbors)
Nave Bayes
Decision Tree
Regression
SVM (Support Vector Machine)`
143
144
Classification Marketing
(Target Marketing)
Market Segmentation -->
Fraud Detection
/
, , ...
(Attrition/Churn)
(model for loyalty)
(Sky survey)
(star or galaxy/ )
145
= A flow-chart-like tree structure
Leaf node
= class label or class label distribution
146
heuristic recursive partitioning.
Root node
target class feature tree branch
divide-and-conquer the nodes
() 3 : mainstream
hit/critics choice/box-office bust
: movie script pattern
scatter plot
films proposed shooting budget/the number of A-list celebrities for starring roles/the categories of success
147
148
149
C5.0 decision tree algorithm
best split feature split?
purity : entropy entropy =0: completely homogeneous
entropy =1: maximum disorder
: red (60%), white (40%) entropy
curve() function
150
split point ? IG (Information Gain) =split entropy split
entropy . , split ( ) entropy.
feature split homogeneity
IG .
IF IG=0 ; No reduction in entropy
ELSE IF max IG : Entropy prior to the split Entropy after the split=0 . , split completely homogeneous!
Pruning the decision tree
Decision tree can continue to grow indefinitely (, overly specific) pre-pruning post-pruning
151
Best Split Node Impurity
152
Tree Construction ()
All the training examples are at the root.
Tree Pruning () Data Noise branch
Tree Induction ( ) Greedy Strategy
Split the records based on an attribute test that optimizes certain criterion.
Issues Determine how to split the records
How to specify the attribute test conditions?
How to determine the best split?
Determine when to stop splitting
153
Node Impurity
Information Gain Entropy
ID3
Gain Ratio IG Splitinfo
C4.5
Gini Binary split
CART
154
Entropy = S impurity
S; a set of exmples
p; positive example
q; negative example
Gain(T,X)
= Entropy(T) Entropy(T,X)
155
Overfitting Prepruning
Tree construction (threshold) goodness measure
Postpruning
"Fully grown" tree branch get a sequence of progressively pruned trees
Training data "best pruned tree"
156
157
Information Gain Best Predictor?
158
Decision Tree Root node
159
Tree Rule
160
Decision Tree Regression
161
Entropy vs.
162
R
C5.0
163
KNN (K-Nearest Neighbors)
KNN classifies unlabeled examples based on their similarity
with examples in the training set
xu D , find the k closest labeled examples
in the training data set and assign xu to the class that appears most frequently within the k-subset
k-NNR only requires
k
A set of labeled examples (training data)
closeness
training dataset
, Unlabeled example test dataset .
164
KNN algorithm feature (feature space) 2-/3-/4-dimensional
165
166
k balance between overfitting and uderfitting.
k ; noisy data (, risk of ignoring small but important pattern
k ; noisy data . , accidentally mislabeled item
167
k 1-Nearest Neighbor
3-Nearest Neighbor
knn Min-max normalization
Z-score standardization
Voronoi Diagram
Decision surface formed by the training examples
169
Rescaling
Min-max
Z-score 170
R (KNN):
: /
1
2 /
3
4
5
10
(Radius, Texture, Perimeter, Area, Smoothness, ...)
171
R coding
172
II.
Clustering
: K-Means
R : Clustering
173
Cluster
class object object object
Clustering
class grouping .
Group (partition)
Group label unsupervised ML group .
actionable insight meaningful label !
174
Clustering
Scalability
attribute (// )
Attribute shape
High dimensionality, Noisy data , Interpretability
: ,
, , pattern
: taxonomy, ,
175
: infer specialty by examining their research publications
176
K Means
The k-means algorithm for clustering
initial assignment phase
k initial cluster center , initial guess locally optimal solution cluster .
3 cluster k=3
Update phase
Initial center new location shift (centroid) example cluster
177
178
: (i) cluster .
(ii) coordinates of the cluster centroids.
k (= cluster ) balance : overfitting vs. underfitting
a priori knowledge (= a priori belief)
randomly, business requirement , sqrt(n/2)
large dataset elbow point
179
Clustering
Partitioning Method
partition (k) (partitioning) , iteratively relocate (Object )
Density-based Method
(threshold) cluster
, for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.
180
Hierarchical Method Agglomerative approach
bottom-up Group merging, until termination condition holds.
Divisive approach
top-down cluster split, until termination condition holds.
181
III.
Apriori
R coding
182
.
BreadMilk [sup = 5%, conf = 100%]
t (= a set of items) I.
I = {i1, i2, , im}: a set of items.
Trxn Database T = {t1, t2, , tn}.
: Market basket transactions: t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
tn: {biscuit, eggs, milk}
183
Rule (Support)
, Pr(A B) (1 )
Support=( A, B )/
('A=>B' = 'B=>A' )
2. (Confidence)
A , B Pr(B|A)
= (A, B )/(A )
'A=>B' 'B=>A'
3. (Improve)
= (A, B * )/ (A *A )
1 1 , 1 .
184
() .
, A B .
. , A B
185
n
countYXsupport
). (
countX
countYXconfidence
.
). (
Find all rules that satisfy the user-specified minimum
support (minsup) and minimum confidence (minconf).
Features Completeness: find all rules.
No target item(s) on the right-hand-side
Mining with data on hard disk (not in memory)
!
the Apriori Algorithm
186
Apriori
: (1 ) minimum support itemset
(frequent itemsets large itemsets).
(2 ) Use frequent itemsets to generate rules.
: (frequent itemset) {Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
187
1: Mining (frequent itemset)
= an itemset whose support is minsup.
:
apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets
188
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD
2: rule Frequent itemsets association rules
One more step is needed to generate association rules
For each frequent itemset X,
For each proper nonempty subset A of X,
Let B = X - A
A B is an association rule if
Confidence(A B) minconf, support(AB) = support(A B) = support(X) confidence(AB) = support(A B) / support(A)
189
R coding
190
R (2) CLICKSTREAM PROFILING
191
Clickstream
Clickstream Data Warehouse by Mark Sweiger
Schema
Identify who you are from where you go
Click path
Web log
Page Tagging
: Google Analytics
Internet Traffic
Google Analytics. Yandex, Kontagent
Crowd-sourcing
Quiz: Which one is a more frustrated?
(Path Analysis)
Choice model of Browsing
Text
Markov
Path Analysis
Page
User Session
Probability of Viewing a Page
Transition Matrix
Predicting Purchase Conversion
Profiling:
Data Vectorization!!!
Clustering, SVM,
Page Contents = HTML Code + Regular Text
Tokenization & Lexical Parsing
HTML code,
, Stop word ,
term frequency (TF)
Result: Document Vector
Classifying Document Vector
Markov Chain
: Auto Insurance Risk :
low risk or high risk - 12
:
high risker 60% chance high risk
high risker 40% chance low risk.
low risker 15% chance high risk
low risker 85% chance low risk.
Task:
Set up a probability tree, transition diagram, and transition matrix to our process.
Matrix (time, sequence, trials, etc.)
probability tree
same state state .
.
mutual exclusive
Task
Find the probability of being in any given state may steps into the process.
Random guessing: 7%
Text Classification: 25%
+ Domain Model 41%
+ Browsing Model 78%
Source: http://www.andrew.cmu.edu/user/alm3/
http://www.andrew.cmu.edu/user/alm3/
Profiling
()
213
(churn analysis) (app)
215
Churn Rate
Most of the Apps Lose Half of their Peak Users within 3 Months
churn analysis
Business Objective: Reduce Customer Churn
Solution #1 .
Solution #2 .
Action Plans
App (eg. Gaming App, Social App)
List down the activities that users perform on your app
core feature
average life-time
Churn Criteria
Cut-off date
= app ~
: A app 2014531 inactive Cut-off Date 40 days
data points: app activities
app
app
core feature
1
2
3
(Preprocessing)
Variable (Feature)
Google Analytics R
Image source: Google Analytics Core Reporting API Dev Guide
app
, Classification Problem
Logistic Regression
Predictor(dependent) variable will be unique key(Visitor ID) for each visitors
Predicted label would be
1 : Visitor will churn vs. 0 : Visitor would not churn
Process
Random Train Test
Train Data-set
Test Data-set
Test Data
(Accuracy)
Confusion Matrix
Accuracy
= (No of Correctly Predicted Labels) / Total No of Labels
= (620 + 1024)/ (620 + 4 + 7 + 1024)
~ 99.34 %
User Segmentation
User types
(Market Segmentation using k-means)
Market Matching
Segmentation
dissecting the marketplace into submarkets that require different marketing mixes
Targeting
Process of reviewing market segments and deciding which one(s) to pursue
Positioning
Establishing a differentiating image for a product or service in relation to its competition
230
SNS 10
231
Segmentation
Geographic
Demographic Psychographic
Behavioral Geodemographic
R coding
232
R (3)
233
?
Non-Math/Stats Model
Representation of Some Phenomenon
Math/Stats Model
Describe Relationship between Variables
(Deterministic) Models (no randomness)
(Probabilistic) (with randomness)
234
Deterministic Models (no randomness)
Hypothesize Exact Relationships
Prediction Error
: (Body mass index: BMI)
BMI = Weight in Kilograms/ (Height in Meters)2
(with randomness) Hypothesize 2 Components
Deterministic
Random Error
: (Systolic blood pressure)
SBP = 6 x age(d) +
, Random Error (: Birthweight)
235
Probabilistic models
Regression Models
Corrleation models
Other models
236
(Regression) ( ) ()
Use equation to set up relationship
Numerical Dependent (Response) Variable
1 or More Numerical or Categorical Independent (Explanatory) Variables
1. Hypothesize Deterministic Component
Estimate Unknown Parameters
2. Random Error Term
Estimate Standard Deviation of Error
3. Fitted Model
4. Use Model for Prediction & Estimation
237
Specifying the deterministic component
1.
2. (Hypothesize Nature of Relationship)
Expected Effects (i.e., Coefficients Signs)
Functional Form (Linear or Non-Linear)
Interactions
1. (: Epidemiology)
2.
3. (Previous Research)
4. Common Sense
238
: Which model is more logical?
239
Years since seroconversion
CD+ counts
CD+ counts
Years since seroconversion
Years since seroconversion
Years since seroconversion
CD+ counts
CD+ counts
Regression
240
Regression
(Simple)
(Multiple)
2 1
Linear Equation
241
R coding
(Simple Linear Regression)
lm()
: coef()
(fitted value): fitted()
(residual): residual()
: confint()
predict()
predict.glm(), predict.lm(), predict.nls()
summary()
F
ANOVA 242
(Multiple Linear Regression)
n : I()
(outlier)
243
R class ts
frequency=7: a weekly series
frequency=12: a monthly series
frequency=4: a quarterly series a
Time Series Decomposition
4 : Trend component: long term trend
Seasonal component: seasonal variation
Cyclical component: repeated but non-periodic fluctuations
Irregular component: the residuals
: AirPassengers : plot(AirPassengers)
apts
Popular models
Autoregressive moving average (ARMA)
Autoregressive integrated moving average (ARIMA)
# build an ARIMA model
fit
247
R
R / R
R
R
248
R
~ 1 Million records : R 1M ~ 1Billion : tuning >= 1Billion records: MapReduce
(: 10 K record hierarchical clustering 50 M )
R R object Sampling H/W upgrade: 64-bit 8TB RAM interpreter (c/c++ ) R --> Parallel R
249
R MapReduce -
R Hadoop job (map) .
Join, , sort reduce $ export HADOOP_HOME = /usr/lib/hadoop
$ ${HADOOP_HOME}/bin/hadoop fs -rmr output
$ ${HADOOP_HOME}/bin/hadoop fs
-put test-data/stocks.txt stocks.txt
$ ${HADOOP_HOME}/bin/hadoop \
jar ${HADOOP_HOME}/contrib/streaming/*.jar \
-D mapreduce.job.reduces=0 \
-inputformat org.apache.hadoop.mapred.TextInputformat \
- input stocks.txt \
- output output \
- mapper `pwd'/src/main/test/stock_day_avg.R \
- file `pwd`//src/main/test/stock_day_avg.R 250
R Map Reduce R
()
R .
$ cat test-data/stocks.txt | src/main/test/stock_day_avg.R |
sort --key 1,1 | src/main/test/stock_cma.R
Hadoop job .
251
R
Visualization
252
R ()
Shiny R framework
R Application
http://shiny.rstudio.com/ ( http://shiny.rstudio.com/gallery/ )
, Professional ()
253
!!
Data as a Strategic Value
Data Science
, , ,
254
255