255
R이용한 ()데이터 분석 2014.7.10 윤형기 ([email protected]) v.3

2014.7.10 ([email protected])„-이용한... · 순서 오전 도입 빅데이터 분석 개요 데이터와 데이터베이스 빅데이터와 빅데이터 분석 r 기초 개요,

Embed Size (px)

Citation preview

  • R ()

    2014.7.10

    ([email protected])

    v.3

    mailto:[email protected]

  • R ,

    - /

    R (1) -

    , ,

    R (2) Clickstream Profiling

    R (3)

    &

    2

  • 3

  • !

    !

    (Though Experiment)

    http://www.openwith.net/?page_id=766

    4

    http://www.openwith.net/?page_id=766

  • 5

  • Prelude:

  • Berlin on Tolstoy

    Analyzed Tolstoy in a popular essay in 1953 based on Archilochus

    What was Tolstoy?

    Many talents, like a fox

    Believed we should be hedgehogs!

  • Fox compared against Hedgehog

    The hedgehog knows one big thing... ,

    "The fox knows many little things... ,

    .

    8

  • (1)

    Plato Aristotle

    Michelangelo Da Vinci

    Marx, Churchll, Hitler ???

    9

  • (2)

    Laozi Sima Qian

    Mao Zhou, Deng,

    10

  • The Fox and the hedgehog in a project life

    vs.

    hedgehog risk

    Systems Thinking: A Foxy Approach

    OODA: A fox dressed like a hedgehog

    11

  • Devops

    :

    ++ ~

    Cross-functional team,

    Widely-shared metrics,

    Automating repetitive tasks,

    Post-mortem, Regular release,

    + +

  • 13

  • I.

    II.BI (Business Intelligence)

    III. Hadoop

    IV.Data Science

    14

  • (/, /, /, )

    : Excel

    DBMS/DW

    RDBMS/SQL, NoSQL, ...

    ETL, CDC,

    15

  • BI

    BI

    16

    BI BSC Balanced Scorecard. .

    VBM Value-based Management. .

    ABC Activity Based Costing. .

    BI OLAP On-line Analytical Processing.

    ERP,CRM ERP, CRM, SCM BI

    BI

    ETL Extraction-Translation-Loading.

    DW Data Warehouse. (repository)

    BI Portal .

  • Hadoop

    Tidal Wave 3VC

    Supercomputer High-throughput computing

    2 :

    , (grid computing)

    (MPP)

    Scale-Up vs. Scale-Out

    BI (Business Intelligence) DW/OLAP/

    17

  • Hadoop

    Google!

    Nutch/Lucene 2006

    (Flat linearity)

    18

    1990 Excite, Alta

    Vista, Yahoo,

    2000 Google ; PageRank,

    GFS/MapReduce

    2003~4 Google Papers

    2005 Hadoop (D. Cutting & Cafarella)

    2006 Apache

  • Google Papers (2003~2010)

    Percolator: Handling individual updates

    Dremel: Online visualizations

    Google File : a distributed file system

    MapReduce : to compute their search indices

    Pregel: Scalable graph computing

    Dremel: Online visualizations 19

  • Big Picture

    20

  • Framework

    21

  • Hadoop Ecosystems

    22

  • Major Influencers

    Open Source Tipping Point

    Google

    Before Google vs. After Google

  • Data Science , ,

    OR (Operations Research)

    ////...

    (Statistical Inference), (Parametric),

    , , , /Expert System ()

    Data Science : HPC + Google shock

    , Graph , topology,

    AI (ANN, SVM, )

    , Semantic web

    Fusion of Python or R?, DS + Cloud BDaaS,

    24

  • R

    25

  • I.

    II.R

    III.R

    IV.

    V.

    26

  • I. R

    27

  • R

    + S . () + packages

    / (DSL :Domain Specific Language) (Windows, Unix, MacOS). .

    . (Functional Programming)

    , (loop) ,

    Script , Interpreter (OOP)

    Generic Polymorphic. object

    I-1 28

  • II. R

    29

  • R

    R (Workspace)

    (Assignment)

    Batch

    30

  • R

    R CRAN (Comprehensive R Archive Network)

    http://www.cran.r-project.org/

    31

    http://www.cran.r-project.org/http://www.cran.r-project.org/http://www.cran.r-project.org/

  • RStudio

    GUI R

    RStudio

    R Commander

    Rstudio

    32

  • (comment)

    #

    : help.start() # help(seq) # seq ?seq # seq RSiteSearch("lm") # help mailing lists

    History history() # 25

    savehistory(file="myfile") # (".Rhistory )

    loadhistory(file="myfile") #

    33

  • : rnorm(10)

    mean(abs(rnorm(100)); hist(rnorm(10))

    R . dataset :

    data( ) # Load package .

    dataset : help(datasetname)

    Session option options() # option

    getwd() #

    dir.create("d:/Rtraining"); setwd("d:/Rtraining") # \ /

    getwd()

    34

  • source( )

    session script ( ) source("myfile.R") # script (.R .r)

    lm(mpg~wt, data=mtcars)

    .

    object fit

  • sink( ) - (redirect) sink("myfile", append=FALSE, split=FALSE) #

    sink() #

    append option override () or append

    split option . # : ( )

    sink("c:/projects/output.txt")

    # : ( , )

    sink("myfile.txt", append=TRUE, split=TRUE)

    , pdf(mygraph.pdf)

    36

  • (Package)

    Package

    = R , .

    Packages ( ). install.packages(package)

    CRAN Mirror . (e.g. Korea)

    session load (session ) library(package)

    Library

    Package load library() # library packages

    search() # load packages

    Package help(package=)

    37

  • Customization

    R Rprofile.site . MS Windows: C:\Program Files\R\R-n.n.n\etc directory.

    Rprofile .

    Rprofile.site

    >

    Rprofile.site 2 .First( ) R session

    .Last( ) R session

    38

  • Batch

    (non-interactively) MS Windows ( )

    C:/Program Files/R/R-3.0.2/R.exe CMD BATCH C:/Rtraining/a.R

    Linux R CMD BATCH [options] my_script.R [outfile]

    > sqrt(-2)

    [1] NaN

    :

    In sqrt(-2) : NaN

    > q()

    39

  • III. R

    40

  • R

    (Operators)

    (Merge)

    apply()

    41

  • (Assignment)

    R =, x+y

    > print(x+y)

    > x=pi

    > x

    > rm(x)

    42

  • #

    age

  • Import from: csv

    mydata

  • R

    Variable

    Continuous (nominal, ratio)

    Ordinal

    Nominal (categorical)

    (identifier), (date)

    Factor

    45

  • (Mode) (numeric) -

    (character) -

    (logical) - TRUE, FALSE

    FALSE 0, 0 TRUE

    (: imaginary number)

    Raw (byte)

    R (data structure) Vector, matrix

    Array, Data frame

    List, Class 46

  • (dataset)

    data vector c()

    > Rev_2012 = c(110,105,120,140) # :

    > Rev_2013 = c(105,115,140,135)

    > Revenue = cbind(Rev_2012, Rev_2013) # column

    > Revenue

    Rev_2012 Rev_2013

    [1,] 110 105

    [2,] 105 115

    [3,] 120 140

    [4,] 140 135

    >

    47

  • R vectors (numerical, character, logical)

    1

    R scalar . (= vector)

    (string) mode single-element vector

    Matrices

    2

    Arrays

    3

    data frames

    Column mode ( )

    Lists

    48

  • is.numeric() is.character()

    is.vector() is.matrix()

    is.data.frame()

    ~

    (numeric)

    as.numeric() FALSE 0 1,2 1,2

    (logical) as.logical() 0 FALSE

    as.character() 1,2 - 1,2 FALSE FALSE

    Factor as.factor() (factor)

    Vector as.vector()

    Matrix as.matrix() Matrix

    as.dataframe() 49

  • - Vector

    mode

    a x = c(1,3,5,7)

    > x

    [1] 1 3 5 7

    > family = c("", "","","")

    > family

    [1] "" "" "" ""

    > c(T,T,F,T)

    [1] TRUE TRUE FALSE TRUE

    50

  • Vector indexing Vector (elements) ([ ] )

    a[c(2,4)] # 2 4 > new_a new_a

    [1] 1.0 5.3 6.0 -2.0 4.0

    vector : seq() sequence rep() - vector

    Vector (Vectorized Operations) = vector element

    : Vector In, Vector Out or Vector In, Matrix Out

    51

  • Recycling : 2 vector c(1,2) + c(5,6,7)

    Filtering > z w 0]

    > w

    [1] 5 3

    subset()

    NA Null NA ; (missing value)

    Null : undefined value ( X)

    52

  • : Vector

    R matrix

    vector %*%, +

    cbind() rbind()

    library(MASS) ginv()

    t() (transpose)

    : x y

    53

  • Matrices

    = row column vector mode( or )

    column

    (nrow=, ncol = )

    mymatrix

  • cells

  • Matrix row column apply()

    apply(m, dimcode, f, fargs) m = matrix,

    dimcode = 1: row , 2: column ,

    f= , fargs = optional argts

    > m # row

    > apply(m, 1, mean)

    [1] 3.5 4.5 5.5 6.5 7.5

    > # column

    > apply(m, 2, mean)

    [1] 3 8

    > # 2

    > apply(m, 1:2, function(x) x/2)

    56

  • Matrix > x x x

    rbind(), cbind() > B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)

    > C = matrix(c(7, 4, 2), nrow=3, ncol=1)

    > cbind(B, C)

    57

  • Matrix vector Matrix vector + Matrix

    > z z

    [,1] [,2]

    [1,] 1 5

    [2,] 2 6

    [3,] 3 7

    [4,] 4 8

    > length(z)

    [1] 8

    > class(z)

    [1] "matrix"

    > attributes(z)

    $dim

    [1] 4 2

    58

  • Array

    Matrices 2 . : 4 x 3 x 2 3 1~24

    > x x[1,,]

    [,1] [,2] [,3]

    [1,] 1 13 25

    [2,] 5 17 29

    [3,] 9 21 33

    > x[,,1]

    [,1] [,2] [,3]

    [1,] 1 5 9

    [2,] 2 6 10

    [3,] 3 7 11

    [4,] 4 8 12

    59

  • List

    (ordered collection of objects). (unrelated)

    n = c(2, 3, 5)

    s = c("aa", "bb", "cc", "dd", "ee")

    b = c(TRUE, FALSE, TRUE, FALSE, FALSE)

    x = list(n, s, b, 3) # x contains copies of n, s, b

    x[2]

    x[c(2, 4)]

    [[]] . x[[3]]

    60

  • List , > z z

    > z$c z

    > # .

    > z[[4]] lapply(list(2:5,35:39), median) # list

    > sapply(list(2:5, 35:39), median) # vector/matrix

    61

  • Data Frame

    = list special case

    Column (, , factor ) .

    d

  • list . vector

    vector .

    merge 2 merge() # merge two data frames by ID

    total

  • Factor

    (nominal or categorical) [ 1... k ] vector

    factor() ordered() option .

    x

  • Factor tapply()

    Vector > ages party tapply(ages, party, mean)

    57 30 34

    split()

    split(x,f) x () > g split(1:7, g)

    65

  • (contingency table)

    2-way contingency table

    : > trial colnames(trial) rownames(trial) trial.table trial.table

    sick healthy

    risk 34 9

    no_risk 11 32

    66

  • Dataset

    ls() # objects

    names(mydata) # mydata

    str(mydata) # mydata

    levels(mydata$v1) # mydata v1 factor level

    dim(object) # object (dimensions)

    class(object) # object (numeric, matrix, data frame, ) class

    mydata # mydata

    head(mydata, n=10) # mydata 10 row

    tail(mydata, n=5) # mydata 5 row

    67

  • (Operators)

    Binary vector, matrix scalar .

    Arithmetic Operators

    +

    -

    *

    /

    ^ or **

    ()

    x %% y (x mod y) 5%%2 is 1

    x %/% y

    integer division 5%/%2 is 2

    68

  • < less than

    greater than

    >= greater than or equal to

    == exactly equal to

    != not equal to

    !x Not x

    x | y x OR y

    x & y x AND y

    isTRUE(x) test if X is TRUE

    69

  • substr(x, start=n1,

    stop=n2)

    vector substring

    grep(pattern, x ,

    ignore.case=FALSE,

    fixed=FALSE)

    Search for pattern in x. fixed =FALSE pattern . fixed=TRUE pattern index grep("A", c("b","A","c"), fixed=TRUE) 2

    sub(pattern,

    replacement, x,

    ignore.case =FALSE,

    fixed=FALSE)

    x pattern . fixed=FALSE pattern . fixed = T pattern . sub("\\s",".","Hello There") "Hello.There"

    strsplit(x, split) element (Split).

    strsplit("abc", "") 3 vector . , "a","b","c"

    paste(..., sep="") sep (Concatenate)

    toupper(x)

    tolower(x)

    70

  • seq(from , to, by) (sequence) indices

  • expr { } .

    if-else if (cond) expr

    if (cond) expr1 else expr2

    for for (var in seq) expr

    while while (cond) expr

    switch switch(expr, ...)

    ifelse ifelse(test,yes,no)

    72

  • order( )

    ASCENDING.

    sorting # : mtcars

    attach(mtcars)

    # sort by mpg

    newdata

  • IV. R

    74

  • R

    R

    plot()

    Plots

    (Dot) Plots

    (Bar) Plots

    (Line Charts)

    (Pie Charts)

    (Boxplots)

    Scatter Plots

    75

  • R

    R

    demo(graphics); > demo(persp)

    plot(c(1,2,3),c(1,2,4))

    Nile

    mean(Nile)

    sd(Nile)

    hist(Nile)

    76

  • plot()

    plot( ) object (plot)

    Generic density, data frame,

    : plot(x,y, arguments)

    attach(mtcars)

    plot(wt, mpg)

    abline(lm(mpg~wt))

    title("Regression of MPG on Weight")

    plot() :

    77

  • Option

    type = type=p (point) type=l (line) type=b type=o type=h type=s (step)

    xlim =

    ylim =

    x y . xlim = c(1,10) xlim = range(x)

    xlab =

    ylba =

    x y (label)

    main = (main title).

    sub = (subtitle).

    bg=

    bty= 78

  • pch

    lty

    Option

    pch =

    lty = 1: (solid line) 2: (dashed) 3: : (dotted) 4: dot-dash

    col= : red,green,blue

    mar = c(bottom, left, top, right) . c(5,4,4,2) + 0.1

    asp = Apsect ratio (= y/x )

    79

  • : par(mfrow = c(2,2)) # mfrow multiple plot plot(x,y, type="b", main = "cosie ", sub = "type = b") plot(x,y, type="o", las = 1, bty = "u", sub = "type = o")

    plot(x,y, type="h", bty = "7", sub = "type = h")

    plot(x,y, type="s", bty = "n", sub = "type = s")

    80

  • abline()

    abline(a,b) # =a, =b

    abline(h=y) #

    abline(v=x) # abline(lm.obj) # lm.obj

    : data(cars)

    attach(cars)

    par(mfrow=c(2,2))

    plot(speed, dist, pch=1); abline(v=15.4)

    plot(speed, dist, pch=2); abline(h=43)

    plot(speed, dist, pch=3); abline(-14,3)

    plot(speed, dist, pch=8); abline(v=15.4); abline(h=43)

    81

  • plotting

    Plotting (Dot Plot) dotchart(x, labels=)

    x vector, labels .

    groups= option x factor . dotchart(mtcars$mpg, labels = row.names(mtcars), cex=.7,

    main=" ", xlab = "Gallon mile ")

    82

  • # Dotplot: , (: mpg, group), (by cylinder)

    x

  • (Bar) Plots

    barplot(height)

    height vector matrix.

    If (height vector)

    .

    If (height matrix AND option beside=FALSE)

    bar height column stacked sub-bars )

    If (height matrix AND beside=TRUE)

    Column

    option names.arg=( ) label

    option horiz=TRUE barplot

    , Bar plot bar plotting. (mean, median, sd )

    aggregate( ) barplot( )

    84

  • # Simple Bar Plot

    counts

  • Stacked Bar Plot counts

  • Grouped Bar Plot counts

  • (Line Charts)

    (Line Charts) lines(x, y, type=)

    x y vector

    type=

    Type Description

    p

    l

    o overplotted points lines

    b, c (join) points ("c )

    s, S stair steps

    h histogram-like vertical lines

    n

    88

  • lines( ) plot(x, y) .

    : plot( ) plots the (x,y) points. plot( ) type="n" option plotting axes, titles .

    : x

  • 90

  • plot( ) type= options

    x

  • pie(x, labels=)

    x non-negative numeric vector ( slice )

    labels=

    slice vector # Simple Pie Chart

    slices

  • Pie

    # Pie Chart with %

    slices

  • Box Plot

    plot Box-and-whisker plot , , ,Q1,Q3

    Boxplot .

    boxplot(x, data= )

    x formula, data=

    formula (: y~group ), horizontal=TRUE

    94

  • # Cylinder MPG

    attach(mtcars)

    boxplot(mpg~cyl,data=mtcars, main=" Milage ",

    xlab="Cylinder ", ylab="Miles Per Gallon")

    detach(mtcars)

    95

  • (, Scatterplots)

    .

    Scatterplot plot(x, y) (x, y numeric vector plot )

    plot(wt, mpg, main="Scatterplot ",

    xlab= ", ylab="Miles Per Gallon ", pch=19)

    96

  • V. R

    97

  • R

    (, , )

    ()

    Crosstabs

    98

  • abs(x)

    sqrt(x)

    ceiling(x) ceiling(3.475) 4

    floor(x) floor(3.475) 3

    trunc(x) trunc(5.99) 5

    round(x, digits=n) round(3.475, digits=2) 3.48

    cos(x), sin(x), tan(x) acos(x), cosh(x), acosh(x)

    log(x)

    log10(x)

    exp(x) e^x

    factorial(x) factorial(5) 120

    99

  • () (random sample) simulation

    (d/p/q/r) +

    d: (density)

    p: (probability)

    q: 4 (quantile)

    r: (random number)

    100

  • dnorm(x) (default m=0 sd=1)

    pnorm(q) (area under the normal curve to the right of q)

    qnorm(p) normal quantile , p percentile

    rnorm(n, m=0,sd=1) n (random normal deviates)

    dbinom(x, size, prob)

    pbinom(q, size, prob)

    qbinom(p, size, prob)

    rbinom(n, size, prob)

    (size = , prob = )

    dpois(x, lamda)

    ppois(q, lamda)

    qpois(p, lamda)

    rpois(n, lamda)

    poisson (m=std=lamda) # lamda=4 0,1, or 2 event dpois(0:2, 4)

    dunif(x, min=0, max=1)

    punif(q, min=0, max=1)

    qunif(p, min=0, max=1)

    runif(n, min=0, max=1)

    (uniform distribution) #10 uniform random variates

    x

  • 102

  • na.rm . Object vector .

    103

  • mean(x, trim=0,

    na.rm=FALSE)

    object x # trimmed mean, 5% mx

  • (Descriptive Statistics)

    = (summary statistics)

    sapply( ) # mydata . ,

    sapply(mydata, mean, na.rm=TRUE)

    sapply :

    mean, sd, var, min, max, median, range, and quantile.

    (histogram, density plot, ) .

    summary(mydata) # , , 1/3, ,

    fivenum(x) # Tukey min,lower-hinge, median,upper-hinge,max

    105

  • Histograms hist(x)

    x plotting vector

    freq=FALSE option breaks= option bin

    Histogram .

    #

    hist(mtcars$mpg)

    # .

    hist(mtcars$mpg, breaks=12, col="red")

    106

  • Plot

    (Kernel Density) Plots plot(density(x)) , x vector.

    # Kernel Density Plot

    d

  • Kernel Density Group sm package sm.density.compare(x, factor)

    x vector, factor grouping . superimpose the kernal density plots of two or

    more groups.

    # MPG (cars with 4,6, or 8 cylinders) library(sm)

    attach(mtcars)

    # value label (factor . cyl=4,6,8 numeric ) cyl.f

  • 109

  • (contingency table)

    table( )

    prop.table( )

    margin.table( ) marginal

    2-way contingency table (2 ) ;

    110

  • cor( )

    cov( )

    : cor(x, use=, method= )

    Option x Matrix data frame

    use . Options: all.obs ( ), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)

    method Options: pearson, spearman, kendall.

    111

  • # mtcars /. listwise deletion

    cor(mtcars, use="complete.obs", method="kendall")

    cov(mtcars, use="complete.obs")

    cor.test( ) correlation coefficient neither cor( ) or cov( ) produce tests of significance,

    Hmisc package rcorr( ) pearson & spearman correlations/covariances , matrix pairwise deletion .

    #

    library(Hmisc)

    rcorr(x, type="pearson") # pearson spearman

    rcorr(as.matrix(mtcars)) # mtcars data frame

    112

  • cor(X, Y) rcorr(X, Y) column X column Y

    # mtcars Correlation matrix

    # rows: mpg, cyl, disp

    # columns:hp, drat, wt

    x

  • (x) (y)

    (=) 1 Yi = 0 + i xi + i

    (least squares method)

    , ,

    data(women)

    women

    fit

  • 115

  • I.

    II.

    III. Taxonomy

    IV.

    V.

    VI.Underfitting Overfitting

    VII.Data Exploration

    116

  • Data Mining

    Predictive Analysis

    Data Analysis

    Data Science

    OLAP

    BI

    Analytics

    Text Mining

    SNA (Social Network Analysis)

    Modeling

    Prediction

    Machine Learning

    Statistical/Mathematical Analysis

    KDD (Knowledge Discovery)

    Decision Support System

    Simulation

    () (Data Analysis), (Data Mining)

    117

  • Data Preparation

    Data Exploration

    Modeling ( )

    Evaluation

    Deployment

    118

  • CRISP-DM

    Cross-Industry Standard Process for DM

    119

  • 120

  • 121

  • 122

  • 123

  • Dataset

    124

  • Taxonomy

    125

  • (Univariate)

    Table

    Barplot

    Pie chart

    Dot chart

    Factor

    Stem-and-leaf plot

    Strip chart

    , ,

    Variation: Variance, , IQR

    Histogram

    Mode, Symmetry, Skew

    Boxplot

    126

  • 2 (bivariate)

    2-way Table (summarized/unsummarized)

    Marginal distribution

    2-way Contingency table

    Boxplot

    Densityplot

    Strip chart

    Q-Q (quantile-quantile) plot

    Scatterplot

    2 (correlation)

    127

  • (multivariate)

    R data frame list

    Boxplot xtabs()

    split() stack()

    Lattice

    128

  • Pearson

    Spearman Rank

    Kendal Rank

    129

  • Functions

    cor( ) function produces correlations

    cov( ) function to produces covariances.

    mtcars cor(mtcars, use="complete.obs", method="kendall")

    cov(mtcars, use="complete.obs")

    130

    Option

    x Matrix or data frame

    use missing . Options are: all.obs (assumes no missing data), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)

    method Correlation . Options are: Pearson, Spearman, kendall.

  • , .

    Hmisc package rcorr( ) produces correlations/covariances and significance

    levels for pearson and spearman correlations. # Correlations with significance levels

    library(Hmisc)

    rcorr(x, type="pearson") # pearson/spearman

    rcorr(as.matrix(mtcars))

    cor(X, Y) or rcorr(X, Y) --> X, Y column correlation.

    # Correlation matrix from mtcars with mpg, cyl, and disp as rows

    # and hp, drat, and wt as columns

    x

  • 1

    2

    2

    2

    1

    2

    132

  • (Machine Learning)

    How do machines learn?

    Abstraction Knowledge Representation 133

  • (Supervised ML) We know the labels and the number of classes

    (Unsupervised ML) We do not know the labels and may not know the

    number of classes

    134

  • 135

  • Classification

    136

  • Underfitting Overfitting

    Underfitting

    137

  • Overfitting

    training data noise .

    138

  • Data Exploration

    - ()

    , () visualization

    : Box Plot, Histogram, PCA, charting(Pareto, MV, ...)

    139

  • R (1)

    140

  • I.

    II.

    III.

    141

  • I.

    :

    : KNN

    R coding : KNN

    142

  • (Classification)

    (training set ) Class

    Model as a function of the values of other attributes.

    KNN (K-Nearest Neighbors)

    Nave Bayes

    Decision Tree

    Regression

    SVM (Support Vector Machine)`

    143

  • 144

  • Classification Marketing

    (Target Marketing)

    Market Segmentation -->

    Fraud Detection

    /

    , , ...

    (Attrition/Churn)

    (model for loyalty)

    (Sky survey)

    (star or galaxy/ )

    145

  • = A flow-chart-like tree structure

    Leaf node

    = class label or class label distribution

    146

  • heuristic recursive partitioning.

    Root node

    target class feature tree branch

    divide-and-conquer the nodes

    () 3 : mainstream

    hit/critics choice/box-office bust

    : movie script pattern

    scatter plot

    films proposed shooting budget/the number of A-list celebrities for starring roles/the categories of success

    147

  • 148

  • 149

  • C5.0 decision tree algorithm

    best split feature split?

    purity : entropy entropy =0: completely homogeneous

    entropy =1: maximum disorder

    : red (60%), white (40%) entropy

    curve() function

    150

  • split point ? IG (Information Gain) =split entropy split

    entropy . , split ( ) entropy.

    feature split homogeneity

    IG .

    IF IG=0 ; No reduction in entropy

    ELSE IF max IG : Entropy prior to the split Entropy after the split=0 . , split completely homogeneous!

    Pruning the decision tree

    Decision tree can continue to grow indefinitely (, overly specific) pre-pruning post-pruning

    151

  • Best Split Node Impurity

    152

  • Tree Construction ()

    All the training examples are at the root.

    Tree Pruning () Data Noise branch

    Tree Induction ( ) Greedy Strategy

    Split the records based on an attribute test that optimizes certain criterion.

    Issues Determine how to split the records

    How to specify the attribute test conditions?

    How to determine the best split?

    Determine when to stop splitting

    153

  • Node Impurity

    Information Gain Entropy

    ID3

    Gain Ratio IG Splitinfo

    C4.5

    Gini Binary split

    CART

    154

  • Entropy = S impurity

    S; a set of exmples

    p; positive example

    q; negative example

    Gain(T,X)

    = Entropy(T) Entropy(T,X)

    155

  • Overfitting Prepruning

    Tree construction (threshold) goodness measure

    Postpruning

    "Fully grown" tree branch get a sequence of progressively pruned trees

    Training data "best pruned tree"

    156

  • 157

  • Information Gain Best Predictor?

    158

  • Decision Tree Root node

    159

  • Tree Rule

    160

  • Decision Tree Regression

    161

  • Entropy vs.

    162

  • R

    C5.0

    163

  • KNN (K-Nearest Neighbors)

    KNN classifies unlabeled examples based on their similarity

    with examples in the training set

    xu D , find the k closest labeled examples

    in the training data set and assign xu to the class that appears most frequently within the k-subset

    k-NNR only requires

    k

    A set of labeled examples (training data)

    closeness

    training dataset

    , Unlabeled example test dataset .

    164

  • KNN algorithm feature (feature space) 2-/3-/4-dimensional

    165

  • 166

  • k balance between overfitting and uderfitting.

    k ; noisy data (, risk of ignoring small but important pattern

    k ; noisy data . , accidentally mislabeled item

    167

  • k 1-Nearest Neighbor

    3-Nearest Neighbor

    knn Min-max normalization

    Z-score standardization

  • Voronoi Diagram

    Decision surface formed by the training examples

    169

  • Rescaling

    Min-max

    Z-score 170

  • R (KNN):

    : /

    1

    2 /

    3

    4

    5

    10

    (Radius, Texture, Perimeter, Area, Smoothness, ...)

    171

  • R coding

    172

  • II.

    Clustering

    : K-Means

    R : Clustering

    173

  • Cluster

    class object object object

    Clustering

    class grouping .

    Group (partition)

    Group label unsupervised ML group .

    actionable insight meaningful label !

    174

  • Clustering

    Scalability

    attribute (// )

    Attribute shape

    High dimensionality, Noisy data , Interpretability

    : ,

    , , pattern

    : taxonomy, ,

    175

  • : infer specialty by examining their research publications

    176

  • K Means

    The k-means algorithm for clustering

    initial assignment phase

    k initial cluster center , initial guess locally optimal solution cluster .

    3 cluster k=3

    Update phase

    Initial center new location shift (centroid) example cluster

    177

  • 178

  • : (i) cluster .

    (ii) coordinates of the cluster centroids.

    k (= cluster ) balance : overfitting vs. underfitting

    a priori knowledge (= a priori belief)

    randomly, business requirement , sqrt(n/2)

    large dataset elbow point

    179

  • Clustering

    Partitioning Method

    partition (k) (partitioning) , iteratively relocate (Object )

    Density-based Method

    (threshold) cluster

    , for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

    180

  • Hierarchical Method Agglomerative approach

    bottom-up Group merging, until termination condition holds.

    Divisive approach

    top-down cluster split, until termination condition holds.

    181

  • III.

    Apriori

    R coding

    182

  • .

    BreadMilk [sup = 5%, conf = 100%]

    t (= a set of items) I.

    I = {i1, i2, , im}: a set of items.

    Trxn Database T = {t1, t2, , tn}.

    : Market basket transactions: t1: {bread, cheese, milk}

    t2: {apple, eggs, salt, yogurt}

    tn: {biscuit, eggs, milk}

    183

  • Rule (Support)

    , Pr(A B) (1 )

    Support=( A, B )/

    ('A=>B' = 'B=>A' )

    2. (Confidence)

    A , B Pr(B|A)

    = (A, B )/(A )

    'A=>B' 'B=>A'

    3. (Improve)

    = (A, B * )/ (A *A )

    1 1 , 1 .

    184

  • () .

    , A B .

    . , A B

    185

    n

    countYXsupport

    ). (

    countX

    countYXconfidence

    .

    ). (

  • Find all rules that satisfy the user-specified minimum

    support (minsup) and minimum confidence (minconf).

    Features Completeness: find all rules.

    No target item(s) on the right-hand-side

    Mining with data on hard disk (not in memory)

    !

    the Apriori Algorithm

    186

  • Apriori

    : (1 ) minimum support itemset

    (frequent itemsets large itemsets).

    (2 ) Use frequent itemsets to generate rules.

    : (frequent itemset) {Chicken, Clothes, Milk} [sup = 3/7]

    and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3]

    187

  • 1: Mining (frequent itemset)

    = an itemset whose support is minsup.

    :

    apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets

    188

    AB AC AD BC BD CD

    A B C D

    ABC ABD ACD BCD

  • 2: rule Frequent itemsets association rules

    One more step is needed to generate association rules

    For each frequent itemset X,

    For each proper nonempty subset A of X,

    Let B = X - A

    A B is an association rule if

    Confidence(A B) minconf, support(AB) = support(A B) = support(X) confidence(AB) = support(A B) / support(A)

    189

  • R coding

    190

  • R (2) CLICKSTREAM PROFILING

    191

  • Clickstream

    Clickstream Data Warehouse by Mark Sweiger

    Schema

  • Identify who you are from where you go

    Click path

    Web log

    Page Tagging

    : Google Analytics

    Internet Traffic

    Google Analytics. Yandex, Kontagent

    Crowd-sourcing

  • Quiz: Which one is a more frustrated?

  • (Path Analysis)

    Choice model of Browsing

    Text

    Markov

  • Path Analysis

  • Page

    User Session

  • Probability of Viewing a Page

    Transition Matrix

  • Predicting Purchase Conversion

  • Profiling:

    Data Vectorization!!!

  • Clustering, SVM,

  • Page Contents = HTML Code + Regular Text

  • Tokenization & Lexical Parsing

    HTML code,

    , Stop word ,

    term frequency (TF)

    Result: Document Vector

  • Classifying Document Vector

  • Markov Chain

    : Auto Insurance Risk :

    low risk or high risk - 12

    :

    high risker 60% chance high risk

    high risker 40% chance low risk.

    low risker 15% chance high risk

    low risker 85% chance low risk.

    Task:

    Set up a probability tree, transition diagram, and transition matrix to our process.

  • Matrix (time, sequence, trials, etc.)

    probability tree

    same state state .

    .

    mutual exclusive

    Task

    Find the probability of being in any given state may steps into the process.

  • Random guessing: 7%

    Text Classification: 25%

    + Domain Model 41%

    + Browsing Model 78%

    Source: http://www.andrew.cmu.edu/user/alm3/

    http://www.andrew.cmu.edu/user/alm3/

  • Profiling

    ()

    213

  • (churn analysis) (app)

  • 215

  • Churn Rate

    Most of the Apps Lose Half of their Peak Users within 3 Months

  • churn analysis

  • Business Objective: Reduce Customer Churn

    Solution #1 .

    Solution #2 .

    Action Plans

    App (eg. Gaming App, Social App)

    List down the activities that users perform on your app

    core feature

    average life-time

  • Churn Criteria

    Cut-off date

    = app ~

    : A app 2014531 inactive Cut-off Date 40 days

    data points: app activities

    app

    app

    core feature

  • 1

    2

    3

    (Preprocessing)

  • Variable (Feature)

  • Google Analytics R

    Image source: Google Analytics Core Reporting API Dev Guide

  • app

  • , Classification Problem

    Logistic Regression

    Predictor(dependent) variable will be unique key(Visitor ID) for each visitors

    Predicted label would be

    1 : Visitor will churn vs. 0 : Visitor would not churn

  • Process

    Random Train Test

    Train Data-set

    Test Data-set

    Test Data

  • (Accuracy)

    Confusion Matrix

    Accuracy

    = (No of Correctly Predicted Labels) / Total No of Labels

    = (620 + 1024)/ (620 + 4 + 7 + 1024)

    ~ 99.34 %

  • User Segmentation

  • User types

  • (Market Segmentation using k-means)

  • Market Matching

    Segmentation

    dissecting the marketplace into submarkets that require different marketing mixes

    Targeting

    Process of reviewing market segments and deciding which one(s) to pursue

    Positioning

    Establishing a differentiating image for a product or service in relation to its competition

    230

  • SNS 10

    231

    Segmentation

    Geographic

    Demographic Psychographic

    Behavioral Geodemographic

  • R coding

    232

  • R (3)

    233

  • ?

    Non-Math/Stats Model

    Representation of Some Phenomenon

    Math/Stats Model

    Describe Relationship between Variables

    (Deterministic) Models (no randomness)

    (Probabilistic) (with randomness)

    234

  • Deterministic Models (no randomness)

    Hypothesize Exact Relationships

    Prediction Error

    : (Body mass index: BMI)

    BMI = Weight in Kilograms/ (Height in Meters)2

    (with randomness) Hypothesize 2 Components

    Deterministic

    Random Error

    : (Systolic blood pressure)

    SBP = 6 x age(d) +

    , Random Error (: Birthweight)

    235

  • Probabilistic models

    Regression Models

    Corrleation models

    Other models

    236

  • (Regression) ( ) ()

    Use equation to set up relationship

    Numerical Dependent (Response) Variable

    1 or More Numerical or Categorical Independent (Explanatory) Variables

    1. Hypothesize Deterministic Component

    Estimate Unknown Parameters

    2. Random Error Term

    Estimate Standard Deviation of Error

    3. Fitted Model

    4. Use Model for Prediction & Estimation

    237

  • Specifying the deterministic component

    1.

    2. (Hypothesize Nature of Relationship)

    Expected Effects (i.e., Coefficients Signs)

    Functional Form (Linear or Non-Linear)

    Interactions

    1. (: Epidemiology)

    2.

    3. (Previous Research)

    4. Common Sense

    238

  • : Which model is more logical?

    239

    Years since seroconversion

    CD+ counts

    CD+ counts

    Years since seroconversion

    Years since seroconversion

    Years since seroconversion

    CD+ counts

    CD+ counts

  • Regression

    240

    Regression

    (Simple)

    (Multiple)

    2 1

  • Linear Equation

    241

  • R coding

    (Simple Linear Regression)

    lm()

    : coef()

    (fitted value): fitted()

    (residual): residual()

    : confint()

    predict()

    predict.glm(), predict.lm(), predict.nls()

    summary()

    F

    ANOVA 242

  • (Multiple Linear Regression)

    n : I()

    (outlier)

    243

  • R class ts

    frequency=7: a weekly series

    frequency=12: a monthly series

    frequency=4: a quarterly series a

  • Time Series Decomposition

    4 : Trend component: long term trend

    Seasonal component: seasonal variation

    Cyclical component: repeated but non-periodic fluctuations

    Irregular component: the residuals

    : AirPassengers : plot(AirPassengers)

    apts

  • Popular models

    Autoregressive moving average (ARMA)

    Autoregressive integrated moving average (ARIMA)

    # build an ARIMA model

    fit

  • 247

  • R

    R / R

    R

    R

    248

  • R

    ~ 1 Million records : R 1M ~ 1Billion : tuning >= 1Billion records: MapReduce

    (: 10 K record hierarchical clustering 50 M )

    R R object Sampling H/W upgrade: 64-bit 8TB RAM interpreter (c/c++ ) R --> Parallel R

    249

  • R MapReduce -

    R Hadoop job (map) .

    Join, , sort reduce $ export HADOOP_HOME = /usr/lib/hadoop

    $ ${HADOOP_HOME}/bin/hadoop fs -rmr output

    $ ${HADOOP_HOME}/bin/hadoop fs

    -put test-data/stocks.txt stocks.txt

    $ ${HADOOP_HOME}/bin/hadoop \

    jar ${HADOOP_HOME}/contrib/streaming/*.jar \

    -D mapreduce.job.reduces=0 \

    -inputformat org.apache.hadoop.mapred.TextInputformat \

    - input stocks.txt \

    - output output \

    - mapper `pwd'/src/main/test/stock_day_avg.R \

    - file `pwd`//src/main/test/stock_day_avg.R 250

  • R Map Reduce R

    ()

    R .

    $ cat test-data/stocks.txt | src/main/test/stock_day_avg.R |

    sort --key 1,1 | src/main/test/stock_cma.R

    Hadoop job .

    251

  • R

    Visualization

    252

  • R ()

    Shiny R framework

    R Application

    http://shiny.rstudio.com/ ( http://shiny.rstudio.com/gallery/ )

    , Professional ()

    253

  • !!

    Data as a Strategic Value

    Data Science

    , , ,

    254

  • 255