Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
일정표
1일차 2일차 3일차 4일차
오전
도입 빅데이터 배경/개념 빅데이터 플랫폼
데이터 분석 개념과 절차 1 CRISP-DM 분석전략 (목표와 가
설/지표체계) 분석도구
통계 기초 이론 기술통계/추론통계
데이터 수집 개요 Excel SQL/NoSQL,
분석절차 2 모델링 개요 Bias-Variance
Trade-off Resampling
통계분석 모델링 3 비선형모델 선형대수와
다변량분석 데이터 정제 및 EDA
이론 실습
기계학습3 신경망 군집화 연관분석
모델개발3 (모델평가, 성능고도화) 모델평가 모델 성능고
도화
오후
실습 환경구축 (R, RStudio) R 기초
R 데이터구조, 함수 작성
R 활용 통계분석 모델링1
통계분석 모델링 2 회귀분석 모델선정과
Regularization 시계열분석
기계학습1 KNN 의사결정트리
기계학습2 SVM Naïve Bayes
시각화 시각화
빅데이터 플랫폼 Hadoop Spark
마무리 클라우드 DL
빅데이터 개념과 분석 플랫폼 데이터 분석 개념과 모델링 통계 분석 기계학습 R 언어
2
1일차
3
빅데이터 개요
배경 – 3V
• Tidal Wave – 3VC
• Supercomputer – High-throughput computing
– 2가지 방향:
• 원격, 분산형 대규모 컴퓨팅 (grid computing)
• 중앙집중형 (MPP)
• Scale-Up vs. Scale-Out
• BI (Business Intelligence) – 특히 DW/OLAP/데이터 마이닝
5
Hadoop
• Hadoop의 탄생? – 배경
• Google!
• Nutch/Lucene 프로젝트에서 2006년 독립 – Doug Cutting
– Apache의 top-level 오픈소스 프로젝트
– 특징
• 대용량 데이터 분산처리 프레임워크 – http://hadoop.apache.org – 순수 S/W
• 프로그래밍 모델의 단순화로 선형 확장성 (Flat linearity) – “function-to-data model vs. data-to-function” (Locality)
– KVP (Key-Value Pair)
6
1990년대 – Excite,
Alta Vista, Yahoo,
…
2000 – Google ;
PageRank,
GFS/MapReduce
2003~4 –
Google Paper
2005 – Hadoop
탄생
(D. Cutting &
Cafarella)
2006 – Apache
프로젝트에 등재
7
8
• Hadoop Kernel
• Hadoop 배포판 – Apache 버전
• 2.8.x, 1.x
– 3rd Party 배포판
• Cloudera, HortonWorks와 MapR 9
• Hadoop & Ecosystems
10
11
빅데이터 전략과 분석과제
• 전략 일반론 – MBO (목표에 의한 관리)
• (업무 인과관계 보다는) 현재 수립된 사업목표를 다운스트림하여 각 조직과 계층에서 해야 할 일을 명확히 함
– BSC (균형성과관리)
• 목표의 인과관계를 규정하고 여러 측면 (perspective)에서 균형 있는 성장을 도모
12
• 균형성과관리와 KPI 지표 분석
13
• 빅데이터 세부과제의 선택
14
“전략적 중요성과 현실적 실행능력”
해결요인의 발굴
전략목표
성취가능성 집행력
전술
문제
중요성
시급성
실행역량 효과성
과제의 Positioning
• 빅데이터 분석 프로젝트 추진 절차 – POC 단계
• TFT 구성 • 자체 교육 (문제의식 + 방법론 + 기초기술) • 해결하려는/ 해결 가능한 문제를 정의 • 문제 해결을 위해 필요한 feature-set을 식별 • 모델링 (대체안) 및 평가 후 선정 • 효과 평가
– 1차 확산단계 • 추가의 문제 식별 후 실행 • 전사 데이터 관리전략 검토 • 1차 확산단계 평가
– 2차 확산단계 • 전사적 데이터 전략검토 수립 • 전사적 인재 관리전략 검토 수립 • (데이터 중심의 문화)
15
빅데티터 주요 활용 사례
16
리스크 분석 (은행)
사기 탐지 (신용카드) 자금세탁 위험탐지
소셜네트워크 분석 금융 및 통신사 마케팅
유통 최적화 (시뮬레이션) 부당 보험첨구/탈세위험 탐지
사전적 예방점검 (항공) 감성분석/SNA 제조부문에서의 수요예측 건강보험/질병정보 분석
전통적 DW 텍스트 분석 실시간 영상감시
실시간 (real time) 일괄처리 (Batch)
정형데이터 비정형데이터 데이터의 유형
데이터의 속도
데이터 분석 개요
17
뒷부분에서 가져옴
데이터 분석의 개념과 범위
• Data Mining/ Predictive Analysis
• Data Science
• BI/OLAP
• Analytics
• Modeling
• Machine Learning
• 수리/통계 분석
• KDD (Knowledge Discovery)
• Decision Support System
18
• 발전 – Data Science – 전통적 분석
• BI/OLAP/DB Query, Spreadsheet 중심 분석
• 통계 분석
– + 텍스트 분석 (SNA/감성분석, 마이닝, 검색)
– + Machine Learning/Deep Learning
데이터과학 (Data Science)
• Data Science
• 통계와 기계학습
통계 기계학습
Population & Sample All Data
Estimation Learning
Hypothesis Classifier
Example/Instance Data point
Regression Supervised Learning
Covariate Features
Response Label
… … 19
기계학습/분석 절차
• CRISP-DM
20
Business 이해
데이터 이해
데이터 준비
모델링 Deployment 모델의 평가
Format Data
데이터 통합
Construct Data
데이터 정제
Select Data
Business Objectives 결정
Review Project
최종 보고서
Plan Monitoring & Maintenance
Plan Deployment
이후 단계의 결정
Review Process
모델수행 결과의 평가
Assess Model
모델 수립
Test Design
모델링 기법의 선택
상황의 분석
Explore Data
Describe Data
Initial Data의 수집
Data Mining 목표 설정
데이터 품질 점검
Project Plan의 작성
참조: https://www.the-modeling-agency.com/crisp-dm.pdf
(Cross Industry Standard Process for Data Mining)
• 분석도구 – Big Bang
– 유료:
• Excel, SAS, SPSS, Matlab, …
– 오픈소스:
• R vs Python vs Octave vs Julia, …
http://www.openwith.net 21
R
• open-source 수리/통계 분석도구 및 프로그래밍 언어 – S 언어에서 기원하였으며 7,000여 개의 package
• CRAN: http://cran.r-project.org/
– 뛰어난 성능과 시각화 (visualization) 기능
22
기초통계
23
목차
• Unit I: 개요 – 1. 개요와 기술(記述)통계
– 2. 확률이론과 Bayesian
• Unit II: 변량별 데이터 분석 – 3. 단변량/이변량/다변량
• Unit III: 분포와 표본추출 – 4. 이산 분포와 연속 분포
– 5. 표본추출과 표본분포
• Unit IV: 모수 추정 – 6. 추정(단일/2개 모집단)
– 7. 가설검정
– 8. 분산분석과 실험계획
24 http://www.openwith.net
• 1.2 기술통계 (Descriptive Statistics) – (1) 중심경향성: Ungrouped Data
• Mode, Mean, Median • Percentile, Quantile/Quartile
– (2) 변동성: Ungrouped Data • Range & IQR (Interquartile Range) • MAD (Mean Absolute Deviation) • Variance, Standard Deviation
• Empirical Rule와 Chebychev’s Theorem
• Population vs. Sample Variance and Standard Deviation – Unbiased estimator
• Z-score
• Coefficient of Variation (CV)
27
http://www.openwith.net
– (3) Measures of Shape
• Skewness
– Coefficient of Skewness
• Kurtosis
• Box-and-Whisker Plots
28
– (4) 연관성 (Association) 측도
• Correlation
– Pearson product-moment correlation coefficient
– Spearman Correlation Coefficient
– Kendall Tau(τ) Correlation Coefficient
» 두 변수 순서관계 (ordinal association)
29 http://www.openwith.net
2. 확률이론과 Bayesian
• 2.1 기본개념
– Experiment, (근원) 사건, 표본공간, 독립사건, Unions, Intersections,
– MECE (Mutually Exclusive Collectively Exhaustive)
– Marginal, Union, Joint
– Counting Possibilities
• mn Counting Rule: m x n
• Sampling from a Population with Replacement: (N)n possibilities
• Combinations: Sampling from Population Without Replacement: NCn = 𝑁!/𝑛!(𝑁−𝑛)!
30
P(X⋂Y) = 0
http://www.openwith.net
31 http://www.openwith.net
3. 변량별 분석 도구
• 3.1 단변량 – Categorical Data
• Table, Barplots, Pie Chart, Dot Chart
– Numeric Data
• Stem-and-leaf plots, Strip chart
• Center: mean, median & mode
• Range, variance, …
– 분포의 모양
• Mode, Symmetry and Skew
• Boxplot, Histogram
33 http://www.openwith.net
• 3.2 이변량 (Bivariate) 데이터 – Pairs of categorical variables
• 2-way Table - 주변분포 (Marginal Distribution), 조건부 분포, contingency table
– 독립표본의 비교 • Side-by-side Boxplots, Density plot, Strip Chart, Q-Q plots
– Numeric Data에서의 관계(Relationship) • Scatter plot을 이용한 관계성 분석 - 상관관계
– 단순회귀분석
• 3.3 다변량 (Multivariate) 데이터 – 다변량데이터의 요약
• 범주형 다변량데이터 요약
• 독립표본의 비교와 관계성 비교
– 다변량 데이터 모델링 • Boxplot과 다변량 모델
• Contingency Table – xtabs()
• split()과 stack()
– Lattice 그래픽 활용
34
http://www.openwith.net
4. 이산 분포와 연속분포
• 4.1 개요 – 확률변수 (Random variable)
• = a variable that contains the outcomes of a chance experiment
• 4.2 이산분포의 모양 – 평균 or 기대값
• = long-run average of occurrences
– Variance와 Standard Deviation
• 4.2 이항분포 – Binomial formula
– 이항분포의 평균과 표준편차
• 4.3 Poisson 분포 – Law of improbable events
36
http://www.openwith.net
λ = long-run average
• 4.5 초기하 (Hypergeometric) 분포 – 개요
• = 유한 모집단으로부터 비복원추출 시 나타나는 확률분포
– 다음 경우에 이항분포 대신 사용:
• (i) Sampling is done without replacement.
• (ii) n ≥ 5% N
37 http://www.openwith.net
(연속 분포 )
• 4.6 일양분포 (一樣分布 Uniform Distribution)
• 4.7 정규분포 – 개요
• Gaussian 분포
• 정규분포의 확률밀도함수
– Standardized Normal Distribution • z score = 평균을 중심으로 한 표준편차의 개수
• z distribution
• 4.8 이항분포 대신 정규분포의 적용 (Approximate) – 경험법칙;
• 대략 normal curve value의 99.7%가 3 s.d. 이내
• n • p > 5 and n • q > 5
– Correcting for Continuity • ; Converting discrete distribution into a continuous distribution.
38
http://www.openwith.net
• 4.7 지수분포 – = Random occurrences 사이 시간의 확률분포
– 지수분포의 확률
• random arrivals 사이의 Inter-arrival times는 지수분포
– cf. Poisson 분포 = random occurrences over some interval
39 http://www.openwith.net
5. 표본추출과 표본분포
• 5.1 Sampling(표본추출) 방법
• 5.2 𝑥 의 표본분포
– 중심극한정리
• 𝜇𝑥 = μ
• 𝜎𝑥 = 𝜎
𝑛
– z Formula for Sample Means
– Sampling from a Finite Population
– 중심극한정리
• 5.3 𝑝 의 표본분포
40 http://www.openwith.net
6. 추정
• 신뢰구간 추정 (단일 모집단) – z 통계량 이용한 신뢰구간 추정 (단일 모집단) (σ Known)
• 점추정 (point estimation)
• 100(1-α)% Confidence Interval to Estimate μ: σ known]
• 유한조정계수
• Sample Size가 작은 경우 – 여태까지 주로 n ≥ 30
– n < 30 이어도 중심극한정리에 의해 z formula 적용 :
– sample size가 클 때 또는 작아도 모집단이 정규분포 (σ known)
42 http://www.openwith.net
– t 통계량 이용한 신뢰구간추정 (단일모집단) (σ Unknown)
• 모집단이 정규분포인데 모집단 s.d 를 모르는 경우 t 분포 적용.
– 표본크기에 따라 분포가 다르다.
– t statistic 의 assumption: 모집단이 정규분포
» If population is not normal dist. or is unknown, nonparametric techniques
– t Distribution의 특징: Robust
• t 통계량을 이용한 모집단 평균 추정에서의 신뢰구간
– 모비율 추정
43 http://www.openwith.net
– 모분산 추정
• (…)
– Sample Variance
– 모분산과 표본분산의 관계: χ2 분포
– 표본크기의 산정
• μ 추정 시의 표본크기
– μ 추정 시: 표본크기는 z formula를 이용
• p 추정 시의 표본크기
44 http://www.openwith.net
7. 가설검정 (단일 모집단)
• 7.1 개요 – Hypotheses의 종류
– Statistical Hypotheses
• H0 Ha
– 가설검정의 절차
– Rejection and Nonrejection Regions
– Type I 및 Type II Errors
45 http://www.openwith.net
• 7.2 z 통계량 이용한 모평균의 가설검정 (σ Known) – 단일평균에 대한 z Test
– 유한모집단의 평균에 대한 검정
– p-Value를 이용한 가설검정
• p-value = 관측된 유의수준 (level of significance)
– defines the smallest value of 𝛼 for which the H0 can be rejected.
• “α 가 p보다 커야만 H0를 reject 가능”
– Critical Value Method를 이용한 가설검정
• Rejecting H0 using p-values
46 http://www.openwith.net
• 7.3 t 통계량 이용한 모평균 가설검정 (σ Unknown) – (…)
• z Test of a Population Proportion
– Critical Value Method를 이용한 가설검정 • Rejecting H0 using p-values
• 7.4 비율에 관한 가설검정 – […]
• Using p-value
• Using the critical value method
47 http://www.openwith.net
• 7.5 분산에 관한 가설검정
• Table χ2 vs. Observed χ2
• H0 can also be tested by the critical value method.
• 관측된 χ2 값 대신 critical χ2 value for α를 적용하여 s2 계산 yields critical sample variance (sc
2)
• 7.6 Type II Errors
48 http://www.openwith.net
(추정 – 2개 모집단)
• 7.7 z 통계량 이용한 두 개 평균 차에 대한 추정/가설검정 (σ Known)
– (…) – CLT: “”Difference in two sample means, 𝑥 1 − 𝑥 2 ~ ND() for large sample (both
n1 and n2 ≥ 30) regardless of the shape of populations”
– z formula for the difference in two sample means
– Hypothesis Testing – H0: μ1 – μ2 =δ
– Ha: μ1 – μ2 ≠δ
– Confidence Intervals
49
http://www.openwith.net
A/B testing
• 개념
– = 2-sample hypothesis testing = bucket tests = split-run testing.
– (marketing/BI에서의) randomized experiment with two variants
– 예:
• website 비교 - 2 versions of a website that differ only in the design of a single button element, the relative efficacy of the two designs can be measured (e.g., click-through rate for a banner advertisement).
50
• 7.8 두 평균 차에 대한 추정/가설검정: 독립표본이고 σ Known – 가설검정
– t Test를 이용한 두 모평균 차에 대한 CI 수립 및 가설검정 – Confidence Intervals
• 7.9 서로 관련된 모집단에 대한 추정
– 종류 • Before-and-after study • Matched-pair with built-in relatedness, as an experimental control mechanism
(ex) twins, siblings
– 가설검정
– 신뢰구간
51
http://www.openwith.net
• 7.10 두 개 모비율에 대한 추정(p1 - p2)
– (…)
– 가설검정
– 신뢰구간
• 7.11 두 개 모분산에 대한 추정
52 http://www.openwith.net
8. 분산분석과 실험계획
• 8.1 실험계획 – 개념
• = a plan and a structure to test hypotheses in which the researcher either controls or manipulates one or more variables.
– 독립변수 (I.V.) • 처치변인 (treatment variable)
= 실험자가 통제 또는 조절하는 변인 • 분류변인 (classification variable (=factors))
= some characteristic of the subject that was present prior to the experiment and is not a result of manipulations or control.
• Each I.V. has 2 or more levels (= classifications =subcategories)
– 종속변수 (D.V.)
53 http://www.openwith.net
• 8.2 Completely Randomized Design (CRD)
– One-Way Analysis of Variance
• H0: μ1 = μ2 = μ3 = … = μk
• Ha: At least one of the means is different from the others.
54 http://www.openwith.net
– F 분포표 상의 수치
– ANOVA tests are always one-tailed tests w/ rejection region in the upper tail
– “Observed F value” vs. “Critical value of F test” (=Table F value) (d.f.에 의해 참조되는 값)
– Reject H0 if (observed F > critical F)
– F값 및 t값의 비교 • F = t2 for dfC = 1
55
http://www.openwith.net
• 8.3 다중비교 검정 – (…)
• ANOVA는multiple group의 평균 차에 대한 가설검정에 유용
– (장점) Type I error, α, is controlled
– Tukey’s HSD Test: The Case of Equal Sample Sizes
• = pairwise multiple comparisons
– Tukey-Kramer Procedure: The Case of Unequal Sample Sizes
56 http://www.openwith.net
• 8.4 Randomized Block Design (RBD) • CRD (I.V. = treatment var. ) + Blocking variable
– Block’g var ; to control confounding/concomitant variable
» researcher want to control but is not the treatment of interest
57 http://www.openwith.net
• 8.5 Factorial Design (Two-Way ANOVA) – Factorial Design의 장점
CRD RBD Factorial Design
각 변수의 Effect를
별도로 분석 (one
per design).
즉, 변수를 독립적으
로 검토
…
단 , focus on one
treatment variable &
control for the
blocking effect
Interaction 분석 가능
하나의 실험설계에서 두 변수를 동시에 분석.
Confounding or concomitant 변수를 하나의
study에서 control 가능하므로 CRD보다 power 증
가 가능 SSE로부터 2nd 변수의 추가효과를 제거
FD with 2 treatments는 RBD와 유사
두 변수의 effect에 주목
(2 treatment변수 간의 interaction 분석 가능, if
multiple measurements are taken under every
combination of levels of 2 treatment)
http://www.openwith.net 58
– 2개 처치변인을 가지는 Factorial Designs
– Factorial Design에 대한 통계검정 • Row effects:
H0: Row means all are equal. Ha: At least one row is different.
• Column effects: H0: Col. means are all equal. Ha: At least one col is different.
• Interaction effects: H0: Interaction effects =0. Ha: Interaction effect is present.
• Each of these observed F values is compared to a table F value.
• The table F value is determined by a, dfnum, and dfdenom.
http://www.openwith.net 59
실습환경 구축
61
추가 slide
R과 RStudio
• R 설치
• RStudio 설치
62
R 기초
63
• R 데이터 구조
• 제어구문
• R 함수의 작성
• R에서의 OOP
• 별도 자료 제공
64
R활용 통계분석 모델링 1
65
실습
66
2일차
67
데이터의 수집과 적재
데이터 수집 개요
• 1. 파일 – 일반 파일
• txt, csv/tsv, …
• xml, json
– Excel,
• 2. 데이터베이스 – RDB
– NoSQL
• 3. RESTful API
• 패키지 이용 (예)
• install.packages(“jsonlite”)
• install.packages(‘RCurl’, ‘XML’)
69
70
• R 에서의 데이터 변환
– 데이터 타입의 확인 • is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()
• as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame)
– Dates • character numeric 데이터
to one long Vector To Matrix To data frame
From Vector c(x,y) cbind(x,y), rbind(x,y) data.frame(x,y)
From Matrix as.vector(mymatrix) as.data.frame(mymatrix)
From data frame as.matrix(myframe)
R 데이터 수집 – 파일
• File Import : – scan()
• scan(file = " ", what = double(0), n = -1, sep = "", dec = ".", skip = 0, na.strings = "NA")
– read.table() • read.table(file, header = FALSE, sep = "", dec = ".", row.names, col.names)
– read.csv() • read.csv(file, header = TRUE, sep = ",", dec=".“)
– read.csv2() • read.csv2(file, header = TRUE, sep = ";", dec=",“)
– read.delim() • read.delim(file, header = TRUE, sep = "\t", dec=".", fill = TRUE, ...)
– read.delim2() • read.delim2(file, header = TRUE, sep = "\t", dec=",", fill = TRUE,...)
71
• File Export : – cat() 함수의 이용
• cat (x , file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE)
– write() 함수의 이용
• write (x, file = "data", ncolumns =, append = FALSE, sep = " ")
• write.table(x, file = "", append = FALSE, sep = " ", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE)
– write.csv() 함수의 이용
• write.csv(x, file=“ ”)
• write.csv2(x, file=“ ”)
72
R 데이터 수집 – Spreadsheet
• 1 read.table() 함수의 이용 – 유의점
• encoding – fileEncoding = "UCS-2LE" # Windows ‘Unicode’ files
– fileEncoding = "UTF-8"
• Header line과 separator
• 2 Fixed-width-format 파일
• 3 DIF – Data Interchange Format
• 4 scan()
• 5 Re-shaping data
73
R 데이터 수집 – RDB
• 1. RDB 개요와 SQL문
• 2 R의 DB interface – 개요
• 모든 패키지 (RODBC는 예외)가 DBMS 제품별로 제시됨.
• 통합이용을 위한 제안 -DBI – unified ‘front-end’ package DBI (https://developer.r-project.org/db)
– ‘back-end’는 제휴 형태
– 대표적 패키지: • RMySQL.
• ROracle
• RPostgreSQL
• RSQLite
• RJDBC
• RpgSQL
74
R 데이터 수집 – NoSQL
• NoSQL 개요 – 분산시스템이 가져야 할 특성: Brewer’s CAP Theorem
• Consistency
– 모든 node들은 같은 시간에 동일 항목에 대해 같은 내용을 보여준다.
• Availability
– 모든 사용자들이 읽기 및 쓰기가 가능 - 일부 node 장애 시에도 다른 node에 영향을 미치지 말것
• Partition Tolerance (생존성)
– node간의 메시지 손실이 있어도 정상적으로 동작해야 한다.
– 현실
• 이들 모두 보장하는 것은 불가능하므로 한 가지 특성을 타협
• NoSQL; scaleout을 위해 partition tolerance는 필수 – C와 A 중 하나를 선택.
75
• NoSQL 개요 (계속) – NoSQL 종류
• Key-Value Stores
– 원천기술: DHTs / Amazon’s Dynamo paper
– 예: Memcached, Coherence, Redis
• Column Store
– 원천기술: Google의 BigTable 논문
– 예: Hbase, Cassandra, Hypertable
• Document Store
– 원천기술: Lotus Notes
– 예: CouchDB, MongoDB, Cloudant
• Graph Database
– 원천기술: Euler & graph 이론
– 예: Neo4J, FlockDB
76
• MongoDB – 일반사항
• mongoDB = “Humongous DB”
• Document-based Open-source
• “High performance, high availability”
• Automatic scaling
• C-P on CAP
– Documents 관리
• Each document stored in a collection in BSON format
• Collections
– Have index set in common
– Like tables of relational db’s.
– Documents do not have to have uniform structure
77
• 확인 – service --status-all | grep mongod – /etc/rc.d/init.d
• Start MongoDB – $ sudo service mongod start – sudo service mongod stop
• Connect & Select DB – show dbs – use mydb – db
• Collection을 생성한 후 insert documents – use mydb – j = {name: “mongodb”} – k = {x: 3}
• Create mydb database and testData collection – db.testData.insert( j) – db.testData.insert(k) – show collections
• Confirm – db.testData.find()
78
mongoDB SQL
Document Tuple
Collection Table/View
PK: _id Field PK: Any Attribute(s)
Uniformity not Required
Uniform Relation Schema
Index Index
Embedded Structure Joins
Shard Partition
분석 모델링 – BIAS-VARIANCE TRADE-OFF
79 http://www-bcf.usc.edu/~gareth/ISL/
모델링
– Advertising data set
f의 추정?
• Prediction
• The accuracy of Y as a prediction for Y depends on two
quantities, which we will call the reducible error and the irreducible error.
• Inference • (Black box가 아니라) 각종 질문에 대한 해답을 구함. (예: 앞서의
Advertising data)
• prediction and inference
• Parametric Methods – 1. An assumption about the functional form, or shape, of f.
– 2. 모델 선정 후 training 데이터로 모델 fitting.
– 선형모델의 경우 parameter (β0, β1,..., βp)를 추정 – 즉, find values of these parameters such that:
• Non-parametric Methods – Do not explicitly assume about the functional form of f.
Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.
– 분석자는 level of smoothness를 정해야 함.
Quantitative vs. Qualitative variables
• Quantitative variables – Regression – (ex) OLS
• Qualitative variables – Classification – (ex) Logistic Regression
• Quantitative + Qualitative variables – (ex) Logistic Regression, KNN, boosting
• 참조
– 모델 선정 시 predictor가 qualitative or quantitative 여부는 크게 중요하지 않다.
– Most of the statistical learning methods can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.
83
• Supervised Versus Unsupervised Learning
• Regression Versus Classification Problems
84
Supervised Learning Unsupervised Learning
Trade-Off: Prediction Accuracy vs. Model Interpretability
Model Accuracy의 분석
• Quality of Fit의 측정 – In regression setting, the mean squared error (MSE), given by
• Choose method that gives lowest “test” MSE.
– 단, overfitting 문제
• Bias-Variance Trade-Off – Regression
– Classification Setting
• 일반적으로 error율을 최소화
• The Bayes Classifier
» BER
• K-Nearest Neighbors
분석 모델링 – RESAMPLING & CV
89 http://www-bcf.usc.edu/~gareth/ISL/
Cross-Validation
• 방법 – Validation dataset의 Hold back
• 장단점 – 장점:
• 단순하므로 쉽게 적용 가능
– 단점: • 검증 MSE 가 변동성이 크다 • 학습 데이터에서 관찰된 부분 집합만 가지고 모델 fitting 시키므로 관측
치가 적은 학습 데이터의 경우 좋지 않은 성능을 발휘할 개연성이 있다
90
Training Data Testing Data
– 예: auto 데이터 – mpg ~ horsepower
– mpg ~ horsepower + horspower2
• Cross-Validation을 통한 모델 적합도 평가
– 데이터를 학습(196 obs.)과 검증(196 obs.)으로 randomly allocate
– 학습 데이터를 이용하여 fitting 후 검증 데이터로 평가 (min MSE)
91
• 그 밖의 resampling 기법 – Bootstrap
• Samples drawn at random, with replacement
– Jackknife
• a resampling technique especially useful for variance and bias estimation.
• = randomization test
– Permutation
• (Like bootstrapping,) a permutation test builds - rather than assumes - sampling distribution by resampling the observed data.
• (Unlike bootstrapping,) we do this without replacement.
92
LOOCV
• 개념 – 검증세트접근법과 유사하나 검
증세트법의 단점을 보완 하려 함.
– Leave-One-Out CV
• 방법론 – N개 데이터를 아래와 같이 분할
• 학습데이터 크기: n -1
• 검증데이터 크기: 1
– 학습 데이터로 모델 fitting
– 검증데이터로 검증 – MSE계산
– 이 과정을 n 번 실시한다
93
5.1 Cross-Validation 181
F IGUR E 5.3. A schematic display of LOOCV . A set of n data points is repeat-edly split into a training set (shown in blue) containing all but one observation,and a validation set that contains only that observation (shown in beige). The testerror is then estimated by averaging the n resulting MSE’s. The first training setcontains al l but observation 1, the second training set contains all but observation2, and so forth.
observations, and a prediction y1 is made for the excluded observation,using itsvaluex1. Since(x1,y1) wasnot used in thefittingprocess,MSE1 =(y1 − y1)
2 provides an approximately unbiased estimate for the test error.But even though MSE1 is unbiased for the test error, it is a poor estimatebecause it is highly variable, since it is based upon a single observation(x1,y1).We can repeat the procedure by selecting (x2,y2) for the validationdata, training the statistical learning procedure on the n −1 observations{(x1,y1), (x3,y3), . . . , (xn ,yn )}, and computingMSE2 = (y2−y2)
2. Repeat-ing this approach n times produces n squared errors, MSE1, . . . , MSEn .The LOOCV estimate for the test MSE is the averageof these n test errorestimates:
CV (n) =1
n
n
i= 1
MSE i . (5.1)
A schematic of the LOOCV approach is illustrated in Figure 5.3.LOOCV has a couple of major advantages over the validation set ap-proach. First, it has far less bias. In LOOCV, we repeatedly fit the statis-tical learning method using training sets that contain n − 1 observations,almost as many as are in the entire data set. This is in contrast to thevalidation set approach, in which the training set is typically around halfthe sizeof theoriginal data set. Consequently, theLOOCV approach tendsnot to overestimate the test error rate as much as the validation set ap-proach does. Second, in contrast to thevalidation approach which will yielddifferent results when applied repeatedly due to randomness in the train-
k-fold Cross Validation
• 개념
– LOOCV 은 k-fold의 k = n인 특별한 경우임
94
5.1 Cross-Validation 183
F IGUR E 5.5. A schematic display of 5-fold CV . A set of n observations israndomly split into five non-overlapping groups. Each of these fifths acts as avalidation set (shown in beige), and the remainder as a training set (shown inblue). The test error is estimated by averaging the five resulting MSE estimates.
chapters. Themagic formula (5.2) does not hold in general, in which casethe model has to be refit n times.
5.1.3 k-Fold Cross-Validation
An alternative to LOOCV is k-fold CV. This approach involves randomlyk-fold CV
dividing the set of observations into k groups, or folds, of approximatelyequal size. The first fold is treated as a validation set, and the methodis fit on the remaining k − 1 folds. The mean squared error, MSE1, isthen computed on the observations in the held-out fold. This procedure isrepeated k times; each time, a different group of observations is treatedas a validation set. This process results in k estimates of the test error,MSE1,MSE2, . . .,MSEk . Thek-fold CV estimate iscomputed by averagingthese values,
CV (k) =1
k
k
i = 1
MSE i . (5.3)
Figure 5.5 illustrates the k-fold CV approach.It isnot hard toseethat LOOCV isa special caseof k-fold CV in which kis set to equal n. In practice, one typically performsk-fold CV using k = 5or k = 10. What is the advantage of using k = 5 or k = 10 rather thank = n? The most obvious advantage is computational. LOOCV requiresfitting the statistical learningmethod n times. This has thepotential to becomputationally expensive (except for linear models fit by least squares,in which case formula (5.2) can be used). But cross-validation is a verygeneral approach that can be applied to almost any statistical learningmethod. Some statistical learningmethods havecomputationally intensivefitting procedures, and so performing LOOCV may pose computational
LOOCV vs. K-fold CV
• 자동차 데이터: – 좌측: LOOCV Error곡선
– 우측: K=10 Cross-Validation 임
– 두 모델 모드 안정적이나 LOOCV 가 보다 계산강도가 큼!
95
• 분류상황에서의 정확도 평가
• 𝐼(𝑦𝑖 ≠ 𝑦𝑖 ) is an indicator function, which gives 1 if the condition is correct, otherwise 0.
• 즉, error rate=incorrect classifications (=misclassification) 비율
– 예:
• Bayes error rate
• KNN에서의 분류 정확도
96
nyyIRateErrorn
i
ii /)ˆ( 1
– Bayes Classifier
• 0-1 loss is most commonly used.
• The optimal classifier (Bayes classifier) is:
• Our goal: Learn a proxy f(x) for Bayes rule from training set examples
97
– Bayes error rate (BER)
• = lowest possible error rate that could be achieved if somehow we knew exactly what the “true” probability distribution of the data looked like.
• On test data, no classifier (or stat. learning method) can get lower error rates than the Bayes error rate.
• Of course in real life problems the BER can’t be calculated exactly.
98 Bayes’ Error Rate: 0.133
– 예: Bayes Boundary를 통한 확인
• Logistic Regression에서 차수(次數)가 높아지면서 개선
99
Error Rate: 0.201 Error Rate: 0.197 Error Rate: 0.160 Error Rate: 0.162
– KNN
• k Nearest Neighbors
• The smaller that k is the more flexible the method will be.
100
K=3
– 예: 교차검증을 통한 선택: Logistic Regression vs. KNN
101
Logistic Regression KNN
Brown: Test Error Blue: Training Error Black: 10-fold CV Error
선형 회귀분석
102
회귀분석
• 단순회귀분석 – Correlation
– 단순회귀분석
• dependent variable = the variable to be predicted (y).
• independent variable = explanatory variable = The predictor (x).
• SLR의 대상: only a straight-line relationship between 2 variables
– Regression Line Equation의 결정
• deterministic regression model is y = β0 + β1x
• probabilistic regression model is y = β0 + β1x + ε
103
http://www.openwith.net
Regression
• 개요 – single numeric D.V. (value to be predicted)과 one or more
numeric I.V. (predictors)간의 관계식.
– "regression" = process of fitting lines to data (Galton)
– also used for hypothesis testing, determining whether data indicate that a presupposition is more likely to be true or false.
• 다양한 모델에 적용 – SLR
– MLR
– GLM
– Link functions
– Logistic regression, Poisson regression, …
104
SLR
• OLS
105
– 추정값의 표준오차 • error분석을 위해 잔차 (= 개별 point에 대한 estimation errors) 계산 대신 standard error of the estimate 이용.
– SSE is in part a function of the number of pairs of data being used to compute the sum, which lessens the value of SSE as a measurement of error.
– 더 좋은 지표 = standard error of the estimate (se) is a standard deviation of the error of the regression model.
– (정규분포 empirical rule: “68% 가 μ+ 1σ 범위, 95%가 μ+ 2σ 범위.
regression의 assumption도 for a given x, error terms ~ ND() )
– 이제 error terms ~ ND(), se 는 error의 s.d., AVG error =0 이므로
» 68% of the error values (residuals) should be within 0 ±1se
» 95% of the error values (residuals) should be within 0 ±2se.
– se provides a single measure of magnitude of errors in model.
– 또한 outlier 식별에 이용. (예: outside ±2se or ±3se)
107 http://www.openwith.net
– 결정계수
• R2 = I.V. (x)가 variability of D.V. (y)를 얼마나 설명하는가
» r2=0 … r2= 1
– D.V. (y) has a variation, measured by SS of y (SSyy):
» SSyy=SSR +SSE
» If each term is divided by SSyy , the resulting equation is
– r2 is proportion of y variability explained by regression model:
• Relationship Between r and r2
– r2 = (r)2
» coeff’t of correlation & determination
– 회귀모델 기울기의 가설검정 & 모델 전반의 Testing
• 기울기
– r = (r)2
108
계수추정
– OLS chooses β0 and β1 to minimize the RSS, using some
calculus
추정된 계수의 Accuracy
– (Q) μ 추정치가 얼마나 정확한가?
– (A) SE(μ ) (=standard error of μ )를 계산
– 즉, β0 와 β1에 대한 표준오차의 계산
• residual standard error RSE =RSS
(n−2).
선형모델의 Accuracy
• Residual Standard Error
• R2 Statistic
– r = Cor(x, y)
– R2 = r2
다중회귀분석
• SLR과 MLR • 단순회귀모델: y =β0 + β1x +ε
• 다중회귀모델: y =β0 + β1x1 + β2x2 + …+ βkxk +ε
• 독립변수를 가진 MR Model (First Order) – y = β0 + β1 x1 + β2 x2 +ε
– Constant & coefficients는 표본으로부터 추출: y =b0 +b1x1 +b2x2 response surface / response plane
• 회귀모델과 계수에 대한 유의성 검정 – <Regression 모델의 adequacy 분석>
– 모델 전반의 검정 • 단순회귀; t test of slope of the regression line to see if ≠ 0. (즉,
whether I.V. contribute significantly in predicting D.V. )
• 다중회귀; an analogous test makes use of F statistic.
112
> reg <- function(y, x) {
x <- as.matrix(x)
x <- cbind(Intercept = 1, x)
solve(t(x) %*% x) %*% t(x) %*% y
}
113
– 회귀계수에 대한 Significance Tests
• individual significance tests for each regression coefficient with t test.
– H0: β1 =0 H0: β2 =0 … H0: βk =0
– Ha: β1 ≠ 0 Ha: β2 ≠ 0 Ha: βk ≠ 0
– d.f. for each of individual tests of regression coefficients are n - k - 1.
– 추정치의 잔차와 표준오차 및 R2
• Residuals
– = error of the regression model
– 활용: outlier 탐지, regression분석 시 assumptions 검정
• SSE 와 Standard Error of the Estimate
– = 추정 값의 표준오차 = 추정표준오차(표준추정오차)= 차이의 표준오차
– = 최적선에 대한 산포도에서 점들의 분산도
– = 𝑦 를 중심으로 실제 y 점수분포가 (회귀선에 의한) 어느 정도인가 표시
– SSE =Σ(y - 𝑦 )2
– 회귀분석의 가정 (error terms ~ ND(0) + 경험칙 (대략 잔차의 68%가 ±1se 범위, 95% 가 ±2se 범위) 회귀모델의 데이터 fitting정도를 측정하는데 standard error of estimate가 유용.
114
계수추정 (MLR)
주요 이슈
• (1) Response-Predictors간의 관계성 여부? – 가설검정
• H0: β1 = β2 = ···= βp =0
• Ha: at least one βj is non-zero.
– F-statistic 계산:
• 단, TSS = (yi − y )2 and RSS = (yi − yi )2.
– IF H0 is true (=response-predictors간 no relationship) THEN F 값은 1에 근접
– IF Ha is true,
– THEN E{(TSS - RSS)/p} >σ2, so we expect F > 1 .
• (2) 변수 별 중요도 결정 – Variable Selection
• Mallow’s Cp,
• Akaike information criterion (AIC),
• Bayesian information criterion (BIC),
• adjusted R2
– 그런데 2p 모델
• Forward selection
• Backward selection
• Mixed selection
• (3) Model Fit – In SLR, R2 = 설명변수와 상관계수간의 상관계수의 제곱
– In MLR, it equals Cor(Y, Y .)2
– fitted linear model의 특징: maximizes this correlation among all possible linear models.
– p-value를 통해 R2 의 개선 정도를 계수화
– RSE의 정의:
• Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.
• (4) Predictions
– β0, β1,..., βp의 true value를 안다 해도 random error로 인해 완벽
한 예측은 불가능. (즉, irreducible error)
– confidence interval
– prediction interval
기타의 주요 이슈
• Interaction terms
• Non-linear effects
• Multicollinearity
• Model Selection
120
• mathematical transformation을 통한 Non-linear models – <first-order model>
• one independent variable: y = β0 + β1 x1 +ε • two independent variables: y = β0 + β1 x1 + β2 x2 +ε
– <polynomial regression model> • ; contain squared, cubed, or higher powers of the predictor variable(s) and
contain response surfaces that are curvilinear. Yet, they are still special cases of the general linear model given in formula:
• y = β0 + β1 x1 + β2 x2 + … + βk xk +ε
– <second-order model with one independent variable> • y = β0 + β1 x1 + β2 x2
2 + ε
– <Quadratic model> 次數가 2차 (=polynomial equation of degree 2) • = a special case of the general linear model –curvilinear regression by
recoding the data before the multiple regression analysis is attempted.
– Quadratic form (2차 형식) – quadratic curve (2차 곡선)
• XTAX = 𝑥1𝑥2
•𝑎 𝑏𝑐 𝑑
• 𝑥1 𝑥2
• = ax12 + bx1x2 + xc1x2 + dx2
2
• Non-linear 특화 모델 (後述)
121
Model Transformation
• 개념 – exponential model log antilog
– inverse model
• Tukey의 Ladder of Transformations
122
• Non-linear Relationships의 예 – Polynomial regression
• Regression 분석에서의 interaction – Interaction항목을 별도의 독립변수로 검토
• interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable,
• y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +ε
– (x1x2 term = interaction term).
• Even though this model has 1 as the highest power of any one variable, it is considered to be a second-order equation because of the x1x2 term.
124
• 모델구축: Search 절차 – 회귀모델 개발:
• (i) maximize explained proportion of the deviation of y values.
• (ii) Be as parsimonious as possible.
– Search 절차 • All Possible Regressions (모든 가능한 조합의 회귀분석)
– If a data set contains k independent variables, all possible regressions will determine 2k -1 different models.
• Stepwise Regression (단계적 회귀분석) – single predictor variable 에서 시작해서 adds and deletes predictors one
step at a time, examining the fit of the model at each step until no more significant predictors remain outside the model.
– STEP 1/2/3: …
• Forward Selection (전진선택법) – = stepwise regression과 동일. 단, once a variable is entered into the
process, it is never dropped out.
• Backward Elimination (후진제거법) – …
125
• Multicollinearity (다중공선성)
– = 2 이상 독립변수가 highly correlated. (2개: collinearity; 여러 개: multicollinearity)
– 1. It is difficult to interpret the estimates of the regression coeff’ts.
– 2. Inordinately small t values for regression coefficients may result.
– 3. S.D. of regression coefficients are overestimated.
– 4. The algebraic sign of estimated regression coefficients may be the opposite of what would be expected for a particular predictor value.
– multicollinearity문제는 regression 계수를 평가하는 t값에도 영향.
• Multicollinearity can result in an overestimation of s.d. of the regression coefficients t values tend to be underrepresentative when multicollinearity is present.
– (접근법)
• examine a correlation matrix to search for possible intercorrelations among potential predictor variables.
• Stepwise regression to prevent the problem of multicollinearity.
126
Interaction
• 개념 – When the effect on Y of increasing X1 depends on another X2.
• 예: – Advertising 예:
• TV and radio advertising both increase sales.
127
Sales = b0 +b1 ´TV +b2 ´Radio+b3 ´TV ´Radio
Intercept
TV
Radio
TV*Radio
Term
6.7502202
0.0191011
0.0288603
0.0010865
Estimate
0.247871
0.001504
0.008905
5.242e-5
Std Error
27.23
12.70
3.24
20.73
t Ratio
<.0001 *
<.0001 *
0.0014 *
<.0001 *
Prob>|t|
Parameter Estimates
• Dummy coding – 예: “men” and “women” (category listings)
• Code as indicator variables (dummy variables); Male=0, Female=1.
• Suppose we want to include income and gender.
– β2 = average extra balance each month that females have for given income level. Males are the “baseline”.
128
129
Regression equation
female: salary = 112.77+1.86 + 6.05 position
males: salary = 112.77-1.86 + 6.05 position
Different intercepts Same slopes
Line for women
Line for men Regression coefficients
Coefficient Std Err t-value p-value
Constant 233.7663 39.5322 5.9133 0.0000
Income 0.0061 0.0006 10.4372 0.0000
Gender_Female 24.3108 40.8470 0.5952 0.5521
LOGISTIC REGRESSION
130
Logistic Regression
• Regression과 분류 (classification) – 개념
• Categorical 변수에 대한 regression equation의 적용
• Why not linear regression?
– Solution
131
X
X
e
eYPp
10
10
1)1(
• Interpreting В1
– we are predicting P(Y) and not Y.
• If В1 =0, there is no relationship between Y and X.
• If В1 >0, when X gets larger so does probability of Y = 1.
• If В1 <0, when X gets larger, the probability of Y = 1 gets smaller.
– But how much bigger or smaller depends on where we are on the slope
• Are the coefficients significant?
– 가설검정: whether we can be sure В0 and В1 significantly ≠ 0.
• Z test 역시 p-value 해석에 아무 변화가 없다.
• p-value for balance is very small, and b1 is positive, so we are sure that if the balance increase, then the probability of default will increase as well.
132
• 예: – average balance = $1000 일 때 probability of default?
• less than 1%.
– For a balance of $2000, the probability is much higher, and equals to 0.586 (58.6%).
133
• Multiple Logistic Regression – We can fit multiple logistic just like regular regression
– Default data : Predict Default using: • Balance (quantitative) + Income (quantitative) + Student
(qualitative)
– Prediction • A student with a credit card balance of $1,500 and an income of
$40,000 has an estimated probability of default
134
– An Apparent Contradiction!
• Students (Orange) vs. Non-students (Blue)
– To whom should credit be offered?
135
Positive
Negative
비선형 모델
• 개념 – 비선형으로의 확장 The truth is never linear!
– often the linearity assumption is good enough.
• 종류 – Polynomial Regression
– Step 함수
– Basis 함수
– Regression Splines
– Smoothing Splines
– Local Regression
– GAM
137
Polynomial Regression
• Create new variables X1 = X, X2 = X2, etc and then treat as multiple linear regression.
138
Step 함수
• Choice of cutpoints or knots can be problematic.
• In R: I(year < 2005) or cut(age; c(18, 25, 40, 65, 90)).
139
Regression Splines
• Piecewise Polynomials
• Knot의 개수와 위치 선정
140
Spline의 평활화
• 개요
• 평활변수 (smoothing parameter) λ의 선정
141
Local Regression
• 개념
– Fitting at a target point x0 using only the nearby training observations (= Memory-based procedure (모든 training data를 사용))
• 선택항목
– Weighting function K
– Fitting 방법: 선형, constant, quadratic regression …
– Span s
142
GAM (Generalized Additive Models)
• 개요 – Allows for flexible nonlinearities in several variables, but
retains the additive structure of linear models.
– 각 선형요소 βixij를 평활화한 비선형함수 fj(xij)로 대체
143
MODEL선택과 REGULARIZATION
144
Outline
• 개념 – 독립변수의 개수가 많을 경우 이를 축소하여 단순화
– 즉, OLS fitting에 대한 alternative fitting을 통한 MSE 최소화
– 필요성 • Prediction Accuracy
• Model Interpretability
• Subset Selection – Best Subset Selection
– Stepwise Selection
– Choosing the Optimal Model
• Shrinkage Methods – Ridge Regression
– The Lasso
145
• 1. Prediction Accuracy – X와 Y의 관계가 선형이고 n >>p 일 때는 비교적 low bias, low
variance (단, n= # of observations, p= # of predictors)
– 그러나 • when , OLS fit can have high variance and may result in
overfitting and poor estimates on unseen observations,
• when , the variability of the least squares fit increases dramatically, and the variance of these estimates in infinite
• 2. Model Interpretability – 독립변수 X의 개수가 많을 경우 이들의 Y에 대한 효과가 감소
• Leaving these variables in the model makes it harder to see the “big picture”, i.e., the effect of the “important variables”
• The model would be easier to interpret by removing (i.e. setting the coefficients to zero) the unimportant variables
146
n » p
n < p
Solution
• Subset Selection – 전체 p개의 설명변수 X의 일부분 (subset)을 식별해 낸 후 이를 이용
해서 모델 fitting
– 예: best subset selection, stepwise selection
• Shrinkage – Shrink the estimates coefficients towards zero reduces variance
– Some of the coefficients may shrink to exactly zero, and hence shrinkage methods can also perform variable selection
– 예: Ridge regression, Lasso
• 차원축소 (Dimension Reduction) – Involves projecting all p predictors into an M-dimensional space
where M < p, and then fitting linear regression model
– 예: Principle Components Regression
147
Best Subset Selection
• One simple approach – = take the subset with the smallest RSS or the largest R2.
– 단, 모델의 변수가 많아질 수록 R2 증가 (== smallest RSS)
– 예
148
• Measures of Comparison – Add penalty to RSS for the number of variables (complexity)
– 종류
• Adjusted R2
• AIC (Akaike information criterion)
• BIC (Bayesian information criterion)
• Cp (equivalent to AIC for linear regression)
149
• Stepwise Selection – 배경
• Best Subset Selection is computationally intensive especially when we have a large number of predictors (large p)
– More attractive methods:
• Forward Stepwise Selection:
– Begins with the model containing no predictor, and then adds one predictor at a time that improves the model the most until no further improvement is possible
• Backward Stepwise Selection:
– Begins with the model containing all predictors, and then deleting one predictor at a time that improves the model the most until no further improvement is possible
150
Shrinkage Methods
• Ridge Regression – Ordinary Least Squares (OLS) estimates β by minimizing
– Ridge Regression uses a slightly different equation
– Tuning parameter λ • is a positive value.
• has the effect of “shrinking” large values of β towards zero.
• It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance
• Notice that when λ = 0, we get the OLS!
151
– As λ increases, standardized coefficients shrinks towards 0.
– 효과
• It turns out that the OLS estimates generally have low bias but can be highly variable. In particular when n and p are of similar size or when n < p, then the OLS estimates will be extremely variable
• The penalty term makes the ridge regression estimates biased but can also substantially reduce variance
• Thus, there is a bias/ variance trade-off
152
– 효과 • 일반적으로,
– RR estimates will be more biased than OLS but have lower variance
– Ridge regression will work best in situations where the OLS estimates have high variance
• If p is large, using best subset selection approach requires searching through enormous numbers of possible models
• With Ridge Regression, for any given λ, we only need to fit one model and the computations turn out to be very simple
• Ridge Regression can even be used even when p > n!
153
Lasso
• 개념 – (배경) Ridge Regression isn’t perfect
• the penalty term will never force any of the coefficients to be exactly zero. Thus, the final model will include all variables, which makes it harder to interpret
• LASSO 역시 유사하지만 penalty term 이 다름
• Penalty term – Ridge Regression minimizes
– The LASSO estimates β by minimizing the
154
• Tuning parameter λ의 선택 – Select a grid of potential values, use cross validation to
estimate the error rate on test data (for each value of λ) and select the value that gives the least error rate
155
시계열분석
156
개요
• 시계열 데이터의 요소 – Trend
– Cyclity
– Seasonality
– Irregularities
• 핵심 개념 – Forecasting Error의 측정
– Error
– Mean Absolute Deviation (MAD)
– Mean Square Error (MSE)
157
Smoothing 기법
• Naïve Forecasting 모델
• Averaging 모델 – Simple Averages
– Moving Averages
– Weighted Moving Averages
• Exponential Smoothing
158
Trend 분석
• Linear Regression Trend Analysis
• 2차식 (Quadratic Model)을 이용한 Regression Trend Analysis – 2차 회귀모델 (The quadratic regression model)
– y = β0 + β1 x1 + β2 x22 + ε
• Holt의 2-Parameter Exponential Smoothing 기법 – 앞서의 exponential smoothing (single exponential smoothing)은
stationary time-series data 예측에는 적합하지만 그러나 trend를 가지는 시계열 데이터에는 부적합 (because the forecasts will lag behind the trend.)
– Holt’s technique uses weights (β) to smooth the trend in a manner similar to the smoothing used in single exponential smoothing(α).
159
계절효과
• Decomposition
• Winter의 3-Parameter Exponential Smoothing Method
160
Autocorrelation과 Autoregression
• Autocorrelation – = serial correlation
– = 예측모델의 오차항이 상호 연관성을 가질 때 발생.
– 회귀분석 시 문제가 됨.
• 회귀분석의 전제조건:
– error terms are independent or random (not correlated).
– autocorrelation 발생 시의 문제
• Autocorrelation 문제의 해결방안Problem – Addition of Independent Variables
– Transforming Variables
– Autoregression
161
확장
• Index Numbers – Simple Index Numbers
– Unweighted Aggregate Price Index Numbers
– Weighted Aggregate Price Index Numbers
– Laspeyres Price Index
– Paasche Price Index
• 모델 고도화 – White noise, autoregressive (AR), moving average (MA), ARMA
models
– Stationarity, detrending, differencing, 및 seasonality
– Autocorrelation function (ACF)과 partial autocorrelation function (PACF)
– Dickey-Fuller tests
– ARMA 모델 선정을 위한 Box-Jenkins methodology
162
Time Series ARIMA Models
• Time series 예: – Modeling relationships using data collected over time . prices,
quantities, GDP, etc. – Forecasting . predicting economic growth. – Time series involves decomposition into a trend, seasonal, cyclical,
and irregular component.
• Lags 를 무시했을 경우 발생되는 문제
– Values of yt are affected by the values of .. in the past. – For example, the amount of money in your bank account in one
month is related to the amount in your account in a previous month.
– Regression without lags fails to account for the relationships through time and overestimates the relationship between the dependent and independent variables.
163
Autoregressive (AR) models
• Autoregressive (AR) models are models in which the value of a variable in one period is related to its values in previous periods.
• AR(p) is an autoregressive model with p lags:
– 단, μ is a constant, γp is coefficient for the lagged variable in time t-p.
• AR(1) is expressed as:
– AR(1) with ....0.8 AR(1) with ......0.8
164
3일차
165
선형대수이론 기초
167
Matrix
– Square Matrix
• ; has the same number of rows as columns
– Transpose
• ; created by converting its rows into columns
168
• 행렬의 곱
• 항등행렬 – AI = A
• Orthogonal Matrix – A matrix A is orthogonal if AAT = ATA = I.
169
벡터
• 개념 – = points ; components --> dimension
• 길이 (length)
• Vector 연산 (operation) – Addition
– Scalar Multiplication • 𝑣 = [3,6,8,4] 일 때 1.5 ∗ 𝑣 = 1.5 ∗ 3,6,8,4 = [4.5, 9, 12, 6]
– 내적 (Inner Product) • = dot product = scalar product
170
• Orthogonality – = perpendicular inner product = 0
• Normal Vector
• Orthonormal Vector – = Vectors of unit length that are orthogonal to each other
171
Eigenvector와 Eigenvalue
• Eigenvector – = An eigenvector is a nonzero vector that satisfies
단, A = square matrix, v = eigenvector, λ = eigenvalue
172
• Eigenvector & eigenvalue 구하기
173
• Eigendecomposition – 고유값분해를 이용한 대각화 (정방행렬에 대해서만 가능)
– 대각행렬과의 행렬 곱
• SVD (특이값 분해)
174
다변량 통계분석
175
통계와 벡터 개념
• 원점수 벡터, 편차점수 벡터, 표준점수 벡터 – “centered” = 원점수 X에서 평균 𝑋 를 빼준 점수
– “centered & scaled”=centered 점수/표준편차(𝑠)표준점수 (𝑧)
– 표준편차의 벡터개념
• 변인 X의 표준편차 𝑠 = (𝑋−𝑋 )2
𝑁−1=
(𝑋−𝑋 )2
𝑁−1
– 분자는 편차점수 벡터의 길이 해당 변인의 variability를 반영
• 편차점수 벡터 길이와 표준편차 관계 𝑋 − 𝑋 = 𝑁 − 1 𝑠
• 즉, z 표준화는 모든 변인벡터의 길이를 𝑁 − 1로 통일시키는 것
피험자 원점수 (X) 편차점수 (X-𝑋 ) 표준점수 (z)
1 15 0 0
2 12 -3 -1
3 18 3 1
𝑋 15 0 0
s 3 3 1
인용: 박광배, 『다변량분석』, 학지사 176
– 상관계수의 벡터개념
– 𝑧𝐴 = (0)2+(−1)2+(1)2 = 1.414
– 𝑧𝐵 = (0.92)2+(−1.06)2+(0.13)2= 1.414
» 즉, 두 변인의 상관계수는 r=cosθ
• 선형조합과 데이터 분석
– 투사점 (projection point) – 변인 C 즉, 선형조합축 C의 각도 조합가중치 (composite weight)
• 표준화 (standardization)
•22
22+0.82= 0.9285
0.82
22+0.82= 0.3714 이들의 제곱합 = 1
• 또한 cosθA = 0.9285, cosθB =0.3714
피험자 A B
1 15 21
2 12 16
3 18 19
피험자 ZA ZB
1 0 0.92
2 -1 -1.06
3 1 0.13
177
피험자 변인 A 변인 B 변인 C = 2A+0.8B
1 1 3 4.4
2 2 2 5.6
• SSCP 행렬 – Sum-of-Squares and Cross-Products
• = 𝐴′𝐴
• 𝐴′ 𝐴 𝑆𝑆𝐶𝑃
•1 2 3−4 −6 −23 9 6
1 −4 32 −6 93 −2 6
= 14 −22 39−22 56 −7839 −78 126
X’X= SSCP = =
178
• Variance-Covariance Matrix – Variance
– Covariance
– 예에서 • (A의 요소 – 각 열의 평균) SSCP를 구한 후
• SSCP / (A의 행의 개수) Variance-Covariance Matrix
•1
3
−1 0 10 −2 2−3 3 0
−1 0 30 −2 31 2 0
=0.667 0.667 10,667 2.667 −21 −2 6
• Correlation Matrix – A의 요소들을 각 열별로 표준화 SSCP를 구한 후
– (그 SSCP 행렬의 모든 요소)/(A의 행의 수) Correlation Matrix
–1
3
−1.225 0 1.2250 −1.225 1.225
−1.225 1.225 0
−1.225 0 −1.2250 −1.225 1.225
1.225 1.225 0=
1 0.5 0.50,5 1 −0.50.5 −0.5 1
179
거리개념의 확장
• Distance of a point from the mean in univariate space = 𝑥𝑖 − 𝑥
• Euclidean distance
180
• 다변량 데이터에서의 거리측정
– (𝑥𝑖 − 𝑥 )2+(𝑦𝑖 − 𝑦 )2+⋯+ (𝑛𝑖 − 𝑛 )2
• Euclidean distance
• 정의
• 한계 – it has some degree of covariance
181
거리의 개념
• Distance between numeric data points – Minkowski
– Euclidean distance.
• When p = 2,
– Manhattan distance. • When p = 1,
– Mahalanobis distance
• 기타
– Distance between categorical data points • Hamming distance, Jaccard,
– Distance between sequence (String, TimeSeries)
• 기타 관련 개념 – z-transform, Pearson,
182
Variance-Covariance Matrix
183
Covariance와 Distance
• It would be easier to calculate distance if we could rescale the coordinates so they didn’t have any covariance
184
Distances as Vectors
• Distances in coordinate space can be described as vectors
185
MVA 기법
기법 Interdependence Explanatory/Confirmatory
Factor Analysis Interdependence Explanatory
Confirmatory
MDS Interdependence Explanatory
Cluster Analysis Interdependence Explanatory
Canonical Correlation Dependence Confirmatory
SEM (Structural Equation Modeling)
Dependence Confirmatory
ANOVA Dependence Confirmatory
Discriminant Analysis Dependence Confirmatory
Logit Choice Model Dependence Confirmatory
Source: Analyzing Multivariate Data By J.M. Lattin (외)
PCA
187
차원축소 개요
• 개념 – Reduce the dimensionality of a data set by finding a new set
of variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.
– By information we mean the variation present in the sample, given by the correlations between the original variables.
– The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.
• The Curse of Dimensionality – Overfitting의 위험성
– Curse of Dimensionality
• Main Approaches for Dimensionality Reduction – Projection – Manifold Learning
• Feature reduction algorithms – Unsupervised
• Latent Semantic Indexing (LSI): truncated SVD • Independent Component Analysis (ICA) • Principal Component Analysis (PCA) • Canonical Correlation Analysis (CCA)
– Supervised • Linear Discriminant Analysis (LDA)
– Semi-supervised • Research topic
– PCA Vs. MDA • What is a “good” subspace?
190
• Projection
PC의 기하학적 의미
• The 1st PC Z1 is a minimum distance fit to a line in X space • the 2nd PC Z2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC • PCs are a series of linear least squares fits to a sample, each orthogonal to all
the previous.
192
193
194
195
• Choosing the Right Number of Dimensions
Algebraic definition of PCs
• Given a sample of n observations on p variables
• Define 1st PC of the sample by linear transformation
where the vector
a1 = (a11, a21, ..., ad1)
xj = (x1j, x2j, ..., xdj)
is chosen such that var[z1] is maximum.
197
d
nxxx ,,, 21
.,,2,1,1
111 njxaxazd
i
ijij
T
Covariance
• For 2 dimensional data, cov(x,y)
• For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z)
• For an n-dimensional data set, 𝑛
𝑛−2 !∗2 different
covariance values
• So, the definition for the covariance matrix for a set of data with dimensions is:
• Eigenvector? – non-0 vector that, after being multiplied by the matrix, remain parallel to origi
nal vector.
• In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length.
non-eigenvector eigenvector
• 진행절차 – Step 1: 데이터 입수 및 정비
• Subtract the mean & covariance matrix 계산
– Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산
– Step 5: components 선택 및 feature vector 생성
– Step 6: 새로운 데이터 셋 도출
• components 선택 및 feature vector 생성
• eigenvector with the highest eigenvalue is principle component of the data set.
• 나머지 생략 가능…
• Step 6: 새로운 데이터 셋 도출
– RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top.
– RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.
– original data를 우리가 선택한 vector에 의거하여 변형
– the patterns are the lines that most closely describe the relationships between the data.
• Getting the old data back
• Biplot – shows the proportions of each variable along the 2 PCs
• Spree
PCA vs. LDA
• LDA not guaranteed to be better for classification – Assumes classes are unimodal Gaussians
– Fails when discriminatory information is not in the mean, but in the variance of the data
• Example where PCA gives better projection:
205
206
LDA (선형판별식)
207
Frequency
table
Zero R
One R
Naive
Bayesian
Decision
tree
Covariance
matrix
LDA
Logistic
regression
Similarity
functions
KNN
Others
ANN
SVM
208
LDA
• LDA: pick a new dimension that gives: – Maximum separation between means of projected classes
– Minimum variance within each projected class
• Solution: eigenvectors based on between-class and within-class covariance matrices
209
PCA vs. LDA
• LDA not guaranteed to be better for classification – Assumes classes are unimodal Gaussians
– Fails when discriminatory information is not in the mean, but in the variance of the data
• Example where PCA gives better projection:
210
211
• Modeling difference in groups for the purpose of separating 2 or more classes, objects, categories,
– much like logit, probit models
• LDA seeks to reduce dimensionality while preserving as much of the (two) class discriminatory information as possible
• (ex)
– Assume D-dimensional samples, N1 of which belong to class w1, and N2 to class w2.
– obtain a scalar y by projecting the samples x onto a line y
𝑦 = 𝑤𝑇𝑥
• Select one that maximizes the separability of the scalars
212
• 예 – Discriminating students in high school
• will go to college
• will go to trade school
• discontinue education
– Some pattern must be there, so we collect
• family background
• academic information
– Discriminate a person between mail or femail based on height
213
Theory
• Discrimination by comparing means of variables • Can have several variables • Assumes multi variety normality - independent variables ne
ed to be continuous • Homogeneous variance • Create an equation which minimizes the possibility of miscl
assifying cases into their respective groups or categories – D = a1*X1 + a2*X2 + ... + ai*Xi + b
• where: • D = discrimination function • X = response score for that variable • a = discrimination coefficient (analogous to regression coeff) • B = constant • i = No of discriminant variables
214
• Based on the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)
• To capture the notion of separability, Fisher defined the following score function:
215
• Given the score function, the problem is to estimate the linear coefficients that maximize the score which can be solved by the following equations:
– 𝛽 = 𝐶−1 𝜇1 − 𝜇2 Model coefficients
– 𝐶 = 1
𝑛1+𝑛2 (𝑛1𝐶1 + 𝑛2𝐶2) Pooled covariance matrix
» 𝛽 ; Linear model coefficients
» 𝐶1, 𝐶2 ; Covariance matrices
» 𝜇1, 𝜇2 ; mean vectors
216
• Mahalanobis distance between 2 groups – A distance greater than 3 means that in two averages differ by
more than 3 standard deviations
– It means that the overlap (probability of misclassification) is quite small
– ∆2= 𝛽𝑇 𝜇1 − 𝜇2
– ∆ : Mahalanobis distance between two groups
217
• Finally, a new point is classified by projecting it onto the maximally separating direction and classifying it as C1 if:
– 𝛽𝑇 𝑥 − 𝜇1+𝜇2
2> log
𝑃(𝑐1)
𝑃(𝑐2)
• 𝛽 ; coefficient vector
• 𝑥 ; Data vector
• 𝜇 ; mean vector
• 𝑃(𝑐2) ; class probability
218
LDA 예
• 은행에서의 고객 (중소기업) 부도위험 판별 – clients who defaulted (red square) and those that did not
(blue circles) separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE)
– LDA 이용한 판별모델 (default and non-default)
– Data
(no of observations = 100)
BUSAGE DAYSDELQ DEFAULT Z Z-Z0 Prediction
87 13 N
89 27 Y
...
219
• We use LDA to find an optimal linear model that best separates two classes (default and non-default)
220
Z0 = .3985302 Log(P(N)/PY)) = 0.4771212547
221
• The first step is to calculate the mean (average) vectors, covariance matrices and class probabilities
222
• Then we calculate people covariance matrix and finally the coefficients of the linear model
223
• Assume we have a point with: BUSAGE=111 and DAYSDELQ=24
– x = [111 24]
• 𝜷𝑻 𝒙 − 𝝁𝟏+𝝁𝟐
𝟐> 𝐥𝐨𝐠
𝒑(𝒄𝟏)
𝒑(𝒄𝟐)
• 𝛽𝑇 ; 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟
• 𝑥 ; Data vector
•𝜇1+𝜇2
2 ; mean vector
• 𝑝(𝑐1) ; class probability
•−0.0095−0.1408
[111 24][116.23 16.89]+[115.04 55.32]
2>? log
0.75
0.25
224
• A Mahalanobis distance of 2.32 shows a small overlap between 2 groups which means a good separation between classes by the linear model – ∆2= 𝛽𝑇 𝜇1 − 𝜇2 = 5.40
– ∆ = 2..32
225
EDA
226
EDA (탐색적 데이터 분석)
• 주된 용도 – 1. 데이터셋에 대한 insight. – 2. 데이터에 영향주는 요소 (Understand some critical impact
variable)을 확인하고 그 관계를 이해함 – 3. Outlier 존재 여부를 확인 – 4. 데이터셋에 내재하는 전제조건 (underlying assumptions)을 검증
• 데이터 분석 – 탐색 vs. 확인 – Confirmatory data analysis
• tests a hypothesis • settles questions • (Inferential statistics)
– Exploratory data analysis • finds a good description • raises new questions • (Descriptive statistics)
227
• Exploratory Data Analysis (EDA) – an approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• develop parsimonious models; and
• determine optimal factor settings.
3가지 접근법
• For classical analysis, the sequence is – Problem => Data => Model => Analysis => Conclusions
• For EDA, the sequence is – Problem => Data => Analysis => Model => Conclusions
• For Bayesian, the sequence is – Problem => Data => Model => Prior Distribution => Analysis
=> Conclusions
230
• 데이터분석 절차 – 시각화 및 EDA
231
그림출처: Wickham and Grolemund
• Data Munging – Transforming data
– Raw data to usable data
– Data must be cleaned first
• 주요 Tasks – Renaming variables
– Data type conversion
– Encoding, decoding or recoding data
– Merging data sets
– Transforming data
– Handling missing data (imputing)
– Handling anomalous values
232
1. 기술적 측면
• (1) 데이터 읽기 – read.table
• read.delim read.delim2 read.csv • read.csv2 read.table read.fwf
– A freshly read data.frame should always be inspected with functions like head, str, and summary
• (2) 타입 변환 – coercion
• as.numeric as.logical as.integer • as.factor as.character as.ordered
– factor 변환 • factor()
– date 변환 • library(lubridate)
• (3) 문자열과 encoding – Sys.getlocale("LC_CTYPE") – f <- file("myUTF16file.txt", encoding = "UTF-16")
2. Consistent Data
• (1) Missing value – na.rm = TRUE – (persons_complete <- na.omit(person))
• (2) special value 문제 – (예) – is.special <- function(x){ – if (is.numeric(x)) !is.finite(x) else is.na(x) – }
• (3) Outlier 문제
3. 수정
• 대체값 적용 (Imputation) – x <- 1:5 # create a vector... – x[2] <- NA # ...with an empty value – x <- impute(x, mean) – x – ## 1 2 3 4 5 – ## 1.00 3.25* 3.00 4.00 5.00 – is.imputed(x)
– # -- – I <- is.na(x) – R <- sum(x[!I])/sum(y[!I]) – x[I] <- R * y[I]
– # -- – data(iris) – iris$Sepal.Length[1:10] <- NA – model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) – I <- is.na(iris$Sepal.Length) – iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])
dplyr 기초
• 6가지의 주된 함수 – Pick observations by their values (filter()). – Reorder the rows (arrange()). – Pick variables by their names (select()). – Create new variables with functions of existing variables (mutate()). – Collapse many values down to a single summary (summarise()). – + group_by()
• changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
• 사용법
– The first argument is a data frame. – The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes). – The result is a new data frame.
• dplyr 의 filter에서의 logical operation
Tidy data set
• rules for a tidy dataset : – Each variable must have its own column.
– Each observation must have its own row.
– Each value must have its own cell.
> table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766
> table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Brazil 1999 37737 3 China 1999 212258 4 Afghanistan 2000 2666 5 Brazil 2000 80488 6 China 2000 213766
Relational data
• A primary key – uniquely identifies an observation in its own table. – (ex) planes$tailnum is a primary key
• A foreign key – uniquely identifies an observation in another table. – (ex) flights$tailnum is a foreign key
시각화
(v.0.9) 242
개요
• Base Graphics – plot()
• hist() and boxplot().
– points(), lines(), text(), mtext(), axis(), rug(), identify()
• 특화 패키지: – http://www.comput
erworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html
ggplot2
• ggplot2 의 특징 – (장점) Consistent underlying grammar of graphics – (한계) Things you cannot do:
• 3-dimensional graphics • Graph-theory type graphs (nodes/edges layout)
• Anatomy of a plot: – data aesthetic mapping – geometric object statistical transformations – scales coordinate system – position adjustments faceting
• ggplot2 vs. Base Graphics – ggplot2 is more verbose for simple / canned graphics – is less verbose for complex / custom graphics – does not have methods (data should always be in a data.frame) – uses a diferent system for adding plot elements
• Geometric Objects And Aesthetics – Aesthetic Mapping
• ggplot 에서 aesthetic 이란 "something you can see" – position (i.e., on the x and y axes)
– color ("outside" color)
– fill ("inside" color)
– shape (of points)
– linetype
– size
• > aes()
– Geometric Objects • = actual marks we put on a plot
– points (geom_point, for scatter plots, dot plots, etc)
– lines (geom_line, for time series, trend lines, etc)
– boxplot (geom_boxplot, for, well, boxplots!)
Lattice Graphics
• Lattice = a flavour of trellis graphics – For lattice, graphics formulae are mandatory.
– grid = a low-level graphics system. It was used to build lattice.
• Lattice vs. base graphics – xyplot() vs. plot() – plot() gives a graph as a side effect of the command.
– xyplot() generates a graphics object.
• As this is output to the command line, the object is “printed”, i.e., a graph appears.
246
graph_type 설명 formula 예
barchart bar chart x~A or A~x
bwplot boxplot x~A or A~x
cloud 3D scatterplot z~x*y|A
contourplot 3D contour plot z~x*y
densityplot kernal density plot ~x|A*B
dotplot dotplot ~x|A
histogram histogram ~x
levelplot 3D level plot z~y*x
parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A~x or x~A
xyplot scatterplot y~x|A
wireframe 3D wireframe graph z~y*x
247
기계학습 모델링
248
Machine Learning?
• 개념 – = subfield of Artificial Intelligence (AI)
– “construction and study of systems that can learn from data”
• 종류 – http://en.wikipedia.org/wiki/Machine_learning
• 방법론 – “A computer program learns from experience (E) with some cl
ass of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E”
249
• 용어 – Features
• = distinct traits that can be used to describe each item in a quantitative manner.
– Samples • an item to process (e.g. classify). • document, a picture, a sound, a video, a row in database or CSV file,
…
– Feature vector • an n-dimensional vector of numerical features that represent some o
bject.
– Feature extraction • Preparation of feature vector – transforms the data in the high-dimen
sional space to a space of fewer dimensions. •
– Training/Evolution set • Set of data to discover potentially predictive relationships.
250
학습 (Learning) vs. 훈련 (Training )
251
절차
252
종류
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
• Reinforcement Learning – allows the machine or software agent to learn its behavior bas
ed on feedback from the environment.
– This behavior can be learnt once and for all, or keep on adapting as time goes by.
253
ML Algorithm의 유형
• Predictive model – = target 변수와 다른 feature들 사이의 관계를 발견 또는 모델링
하려는 것
– = supervised learning clear instruction on what they learn & how (단, 사람 아닌 target values provide a supervisory role to find …)
• Descriptive model – Nor target to learn No single feature is more important tha
n other
– (ex) Pattern discovery (Market basket analysis, clustering)
254
Supervised L Unsupervised L Other Types Remarks
NN
(Naïve) Bayes
Decision Tree
(Classif’n Rule L)
Linear Regression
Model Tree
Neural Net
SVM
AR
K-means
.. Incremental L.
… 255
KNN
256
KNN
• = classify unlabeled examples by assigning them to the class of the most similar labeled example
• [사례] Blind testing을 통한 tomato 분류배정
257
258
259
260
• 거리계산 – Euclidean distance
– Manhattan D
• NN – 1NN ; orange이므로 as fruit
– 3NN ; vote among the 3 nearest neighbor
• 적절한 K 값의 선택
Large K Small K
Bias 감소 Variance 감소
But Underfitting But Overfitting
Single K outlier
실무: 학습대상 concept의 복잡성, training data의 개수 261
절차
• Data 준비 – 사전준비 – transform features to a standard range
• Shrinking, rescaling min-max normalization
• Z-score standardization
– Nominal feature의 경우 dummy coding • 단, Ordinal data의 경우 number 부여 후 normalize
• 특징 – Lazy Learning No abstraction, No generalization
– 대신, instance-based Learning
– Non-parametric Learning
262
응용
• Voronoi Diagram – Training example에 의거한 Decision surface
263
BAYESIAN과 NAÏVE BAYES
264
기본 개념
• 배경: – the estimated likelihood of an event should be based on the e
vidence at hand.
• 확률
265
• 조건부 확률 - Bayes’ theorem
– 사전확률 (prior probability)
• the most reasonable guess would be the probability that any prior message was spam (~ 20 % in the example).
– likelihood
• Probability that the word Viagra was used in previous spam messages
– Marginal likelihood
• Probability that Viagra appeared in any message at all.
• 사후확률 (posterior probability) – Bayes' theorem을 이용해서 메시지가 spam일 사후확률 계산
– IF ( > 50 %) THEN message is likely to be spam should be filtered.
266
• Bayes’ theorem의 적용
• Frequency table
• P(spam∩Viagra) = P(Viagram|spam) * P(spam) = (4/20)((20/100)=0.04
• P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80
267
Naïve Bayes 분류
• spam mail 예의 확장 – Train by constructing likelihood table for the appearance of 4
words:
– 확률계산 (ex) Viagra=Y, Money=N, Groceries=N,
Unsubscribe=Y
– Class-conditional independence를 이용하여 계산 단순화
268
– likelihood of ham: • (4/20) * (10/20) * (20/20) * (12/20) * (20/100) = 0.012
• Spam일 확률: 0.012 / (0.012 + 0.002) = 0.857
– likelihood of ham : • (1/80) * (66/80) * (71/80) * (23/80) * (80/100) = 0.002
• Ham일 확률: 0.002 / (0.012 + 0.002) = 0.143
– => expect that the message is spam (85.7 %), ham with 14.3 %. • 즉, “this message is 6 times more likely to be spam than ham.”
– probability of level L for class C, given the evidence provided by features F1 ~ Fn, is:
269
• Laplace estimator – (IF) message contains: Viagra, Groceries, Money, Unsubscribe.
• naive Bayes algorithm 에서의 likelihood of spam:
– (4/20) * (10/20) * (0/20) * (12/20) * (20/100) = 0
• And the likelihood of ham is:
– (1/80) * (14/80) * (8/80) * (23/80) * (80/100) = 0.00005
• probability of spam is: 0 / (0 + 0.0099) = 0, probability of ham is: 0.00005 / (0 + 0. 0.00005) = 1
– (Solution) Laplace estimator
• frequency table의 count에 숫자 (1) 가산 ensure that each feature has a nonzero probability of occurring with each class.
• Naïve Bayes에서 numeric feature 사용 – By discretizing/binning
270
4일차
271
의사결정트리
272
의사결정 트리
• 의사결정 트리의 예 – = A flow-chart-like tree structure
• Leaf node – = class label or class label distribution
273
• heuristic – recursive partitioning. – Root node에서 출발해서
– target class를 가장 잘 구분해주는 feature를 선택하여 첫 번째의 tree branch 형성
– 정지조건 충족될 때까지 divide-and-conquer the nodes
• (예) 영화분류 – 각 영화를 3 카테고리 중 하나로 분류 예측: mainstream
hit/critics choice/box-office bust
– 데이터: movie script를 조사하여 pattern 분석
– scatter plot
– films proposed shooting budget/the number of A-list celebrities for starring roles/the categories of success
274
275
276
C5.0 decision tree algorithm
• best split의 선택 – 어느 feature 중심으로 split?
– purity의 측정기준: entropy • 데이터의 entropy =0: completely homogeneous
• 데이터의 entropy =1: maximum disorder
– 예: red (60%), white (40%)일 때의 entropy는
• 이를 확장하여 가능한 조합에 curve() function
277
• 여기서 split point를 어디로 할 것인가? – IG (Information Gain) =split 이전의 entropy와 split 이후의
entropy의 비교. 단, split 후에는 (나뉜 부분에 대한 가중치 적용) entropy.
– feature별로 split 시 homogeneity변화 정도 계산
– IG가 높을 수록 그룹 동질성이 높음을 뜻한다.
• IF IG=0 ; No reduction in entropy
• ELSE IF max IG : Entropy prior to the split Entropy after the split=0를 의미. 즉, split을 통해 completely homogeneous!
• Pruning the decision tree – Decision tree can continue to grow indefinitely (즉, overly
specific) 이므로 … pre-pruning 또는 post-pruning
278
• Best Split의 결정 – Node Impurity
279
• 의사결정트리 생성 – Tree Construction (구축)
• All the training examples are at the root.
– Tree Pruning (가지치기) • Data Noise에 의한 branch를 제거 분류정밀도 향상
• Tree Induction (트리 귀납) – Greedy Strategy
• Split the records based on an attribute test that optimizes certain criterion.
– Issues • Determine how to split the records
– How to specify the attribute test conditions?
– How to determine the best split?
• Determine when to stop splitting
280
– Node Impurity의 측정 척도
• Information Gain – Entropy 이용
– ID3 알고리즘에서 이용
• Gain Ratio – IG 또는 Splitinfo 이용
– C4.5 알고리즘에서 이용
• Gini 계수 – Binary split에만 이용 가능
– CART 알고리즘에서 이용
281
– Entropy 계산 • = S의 impurity 정도
» S; a set of exmples
» p; positive example의 비율
» q; negative example의 비율
– Gain(T,X) • = Entropy(T) – Entropy(T,X)
282
• Overfitting의 회피 – Prepruning
• Tree construction을 일찍 멈춤 – 일정 기준 (threshold) 이하로서 goodness measure를 충족하면 더 이상 분할하지 않는다
– Postpruning
• "Fully grown" tree로부터 branch를 제거 get a sequence of progressively pruned trees
• Training data와는 다른 데이터를 만든 후 "best pruned tree" 인지 여부를 결정
283
SVM
284
SVM이란?
• Hyperplance에서의 분류 – flat boundary (=hyperplane)를 기준으로 양쪽이 동질집단 (fairly
homogeneous partitions)이 되도록 분류
– ; instance-based nearest neighbors + linear regression
285
maximum margin 의 발견
• MMH 탐색 – creates the greatest separation between 2 classes
• Support Vector – = points from each class that are the closest to the MMH
• vector geometry를 이용하는 알고리즘
286
• linearly separable data 의 경우 – Search
• MMH is as far away as possible from the Convex hull (= outer boundaries) of two data group
• Hyperplane – In n-dimension space,
• (goal) To find a set of weights that specify hyperplance
287
• In n-dimensional space, the following equation is used:
• Using this formula, the goal of the process is to find a set of weights that specify two hyperplanes,
• Vector geometry defines the distance between these two planes as:
• in order to maximize distance, we need to minimize ||w||. In order to facilitate finding the solution, the task is typically reexpressed as a set of constraints:
288
non-linear spaces 에 대한 kernel 적용
• ; maps the problem into a higher dimension space
• SVMs with non-linear kernels add additional dimensions to the data and can create separation.
• Kernel functions – Linear kernel – Polynomial kernel – Sigmoid kernel – Gaussian RBF kernel
289
290
신경망과 DEEP LEARNING
291
Neural networks의 이해
• 개요 – 인간의 brain 모델을 따라 입력 및 출력 신호 사이의 관계를 규정하려는 시도
– = a network of interconnected cells (=neurons) to create a massive parallel processor
• 85 billion neurons (human) (cf. mouse (75 million), fly (100k))
• 궁극적 비교는 Turing test
– 초기에는 단순 논리비교 급격한 발전
292
From biological to artificial neurons
• (설명)
– Incoming signals dendrites (수상돌기)
• = biochemical process that allows the impulse to be weighted according to its relative importance or frequency
• As cell body begins to accumulate the incoming signals, a threshold is reached cell fires > Output signal is transmitted via electrochemical process down Axon (축삭돌기). chemical signals are passed to neighboring neurons across synapse. (= tiny gap)
– X var ≈ dendrite, Σ = directed network diagram, f = activation function
293
• NN의 특징을 결정하는 요소: – Activation function
• transforms a neuron's net input signal into a single output signal to be broadcasted further in the network
– Network topology (or architecture)
• 모델이 가진 neuron의 수 + 연결된 layer의 수
– Training algorithm
• 설정되는 connection weights to inhibit or excite neurons in proportion to the input signal
294
Activation 함수
• Threshold activation function Modeled after nature
– Unit step activation function
– Sigmoid activation function
• ; not binary, differentiable
295
• 기타 유형의 activation 함수
296
• 다양한 activation functions 사이의 차이점: – the output signal range. - Typically, this is one of (0, 1), (-1, +1), or (-inf, +inf).
– allows the construction of specialized neural networks.
» (ex) linear activation function linear regression model,
» (ex) a Gaussian activation function RBF network model
• 단, squashing problem – For many of the activation functions, the range of input values that affect the
output signal is relatively narrow. (예: in sigmoid, the output signal is always 0 or always 1 for an input signal below -5 or above +5, respectively.) The compression of the signal in this way results in a saturated signal at the high and low ends of very dynamic inputs, just as turning a guitar amplifier up too high results in a distorted sound due to clipping the peaks of sound waves.
– squeezes input values into a smaller range of outputs
• solution – transform all neural network inputs such that the feature values fall within a s
mall range around 0. Typically, this is done by standardizing or normalizing the features. By limiting the input values, the activation function will have action across the entire range, preventing large-valued features such as household income from dominating small-valued features such as the number of children in the household.
297
Backpropagation을 통한 신경망 훈련
• Network’s connection weights reflect the patterns observed over time. backpropagation
• backpropagation algorithm iterates through many cycles of two processes. Each iteration of the algorithm is known as an epoch. – Because the network contains no a priori (existing) knowledge, typically the w
eights are set randomly prior to beginning. Then, the algorithm cycles through the processes until a stopping criterion is reached. The cycles include:
• A forward phase • A backward phase
Strengths Weaknesses
Can be adapted to classification or numeric prediction problems
Among the most accurate modeling approaches
Makes few assumptions about the data's underlying relationships
Reputation of being computationally intensive and slow to train, particularly if the network topology is complex
Easy to overfit or underfit training data Results in a complex black box model that
is difficult if not impossible to interpret
298
• gradient descent. – determine how much (or whether) a weight should be
changed, when using information sent backward to reduce the total error
– Backpropagation algorithm uses the derivative of each neuron's activation function to identify the gradient in the direction of each of the incoming weights – hence the importance of having a differentiable activation function.
• 학습률 – The gradient suggests how steeply the error will be reduced
or increased for a change in the weight.
– It changes the weights (학습률) that result in the greatest reduction in error by an amount of learning rate.
299
• = 1st order optimization algorithm – Optimization
– =finding the “best” value of a function which is the minimum value of the function.
– To find a local minimum of a function, one takes a step proportional to the negative of the gradient of the function at the current point.
300
• Gradient is the slope of a function.
– # of “turning points” of a function depend on the order of the function.
• Not all turning points are minima.
– The least of all the minimum points is called the “global” minimum.
– Every minimum is a “local” minimum.
301
f(x)
x
global minimum
inflection point
local minimum
global maximum
Gradient Descent
302
DEEP LEARNING
303
개념
• Perceptron – Artificial neuron (다수의 신호를 입력 받아 하나의 신호를 출력)
Perceptron과 논리회로
• Truth table – AND
– NAND
– OR – (b, w1, w2) = (-0.5, 1.0, 1.0)
– XOR • MLP (multi-layer perceptron)
ANN
• 신경망
– Bias
– 활성화 함수
• Sigmoid 함수
– 구현
ℎ 𝑥 = 1
1 + 𝑒𝑥𝑝(−𝑥)
• Step function
• Sigmoid 함수와 계단 함수 비교
• 비선형 함수
• ReLU 함수
– Inner Product of NN
– Inner_Product_NN.py
X W = Y (1)x2 2x3 3x1
– Layer간의 신호 전달 (3-layer의 예)
𝑎11= 𝑤1 1
1𝑥1 + 𝑤1 2
1𝑥2 + 𝑏1
1
Input layer Layer 1
Layer 1 Layer 2
Layer 2 Output Layer
– Output layer
• Identity function
• Softmax
– 특징
» 출력 총합 =1
» y = exp(x) 단조함수 각 원소의 대소 유지
» 단, overflow의 개선!
– Batch 처리
• 다수의 데이터를 함께 묶어 predict() (예: 이미지 100개)
– Mini-batch
𝑦𝑘 = 𝑒𝑥𝑝(𝑎𝑖)
exp (𝑎𝑖)𝑛𝑖=1
Training NN – Cost
• Cross Entropy Loss
E = − 𝑡𝑘 log 𝑦𝑘𝑘
• Mean Square Error
• Misclassification Rate
• L1 loss – = LAD (least absolution deviation)
– cf. L2 loss = least squares
E = 1
2 (𝑦𝑘 − 𝑡𝑘)
2
𝑘
Optimizers
• SGD with Momentum
• RMS propagation
• Adagrad
• Adadelta
• Adam
Deep Architecture
• multiple levels of non-linear operations 으로 구성 – 예: neural nets with many hidden layers
318
• Deep Learning – Deep Neural Networks
• (focused on developing these networks>
319
PACKAGE AVAILABLE ARCHITECTURE
MXNetR Feed-forward neural network, CNN
Darch Restricted Boltzmann machine, deep belief network
Deepnet Feed-forward NN, restricted Boltzmann machine, deep belief network, stacked autoencoders
H2O Feed-forward neural network, deep autoencoders
Deepr H2O + deepnet
CNN (Convolutional Neural Networks)
320
C layers are convolutions, S layers pool/sample Often starts with fairly raw features at initial input and lets CNN discover improved feature layer for final supervised learner – eg. MLP/BP
Deep Belief Network
321
비 지도학습
322
개요
• 자율 학습(Unsupervised Learning) – 데이터가 어떻게 구성되었는지를 알아내는 것.
– 지도 학습(Supervised Learning) 혹은 강화 학습(Reinforcement Learning)과 달리 입력값에 대한 목표치가 주어지지 않는다.
• 특징 – 통계의 밀도 추정(Density Estimation)
– 데이터의 주요 특징을 요약하고 설명
• 대표적 예 – 군집화(Clustering)
– 독립 성분 분석(Independent Component Analysis)
323
군집화
• Automatically divide the data into clusters, or group similar items, without having been told what the groups should look like ahead of time. – Unsupervised
• guided by the principle that records inside a cluster should be very similar to each other, but very different from those outside.
• the class labels obtained from an unsupervised classifier are without intrinsic meaning.
• Clustering as a machine learning task
324
325
군집화 접근 유형
• 1.Hard clustering – A given data point only belongs to one cluster. – = exclusive clustering. – 예: K-Means clustering
• 2.Soft clustering: – A given data point can belong to more than one cluster. – = overlapping clustering. – 예: Fuzzy K-Means algorithm
• 3.Hierarchial clustering: – hierarchy of clusters using top-down (divisive) or bottom-up – = (agglomerative) approach.
• 4. Flat clustering: – Is a simple technique where no hierarchy is present.
• 5.Model-based clustering: – data is modeled using a standard statistical model to work with different
distributions. The idea is to find a model that best fits the data.
326
• 기계학습 task로서의 군집화 – (예) 회의 소집에서의 좌석 배정
• (i) computer/DB database,
• (ii) math/stats,
• (iii) machine learning – without or forgetting a survey. 대신 research history를 조사하여 배정.
327
군집화 알고리즘의 종류 (category)
• Partitional clustering – Directly decompose the data set into a set of disjoint clusters. – CLARA – CLARANS
• Hierarchical clustering – proceeds successively by either merging smaller clusters into larger ones,
or by splitting larger clusters. dendrogram – Aggloemerative – Divisive
• Density-based clustering – groups neighbouring objects of a data set into clusters based on density
conditions. – DBSCAN
• Grid-based clustering. – mainly proposed for spatial data ining. – STING
328
k-means
• k-means algorithm for clustering – Using distance to assign and update clusters – Choosing the appropriate number of clusters
• heuristic 절차를 통해 local optimal solution을 찾아냄. – Unless k and n are extremely small, it is not feasible to compute the optimal c
lusters across all possible combinations of examples. So, it starts with an initial guess for the cluster assignments then modifies the assignments slightly to see if the changes improve the homogeneity within the clusters.
• (2 단계) – (i) it assigns examples to an initial set of k clusters. – (ii) it updates the assignments by adjusting the cluster. – The process of updating and assigning occurs several times until making chan
ges no longer improves the cluster fit. • If the results vary dramatically, this could indicate a problem. For instance, the data
may not have natural groupings or the value of k has been poorly chosen. • So, try a cluster analysis more than once to test the robustness of your findings.
329
• Using distance to assign and update clusters
– k points 선택 (in the feature space)
• to serve as the cluster centers. Often, the points are chosen by selecting k random examples from the training dataset. Because we hope to identify three clusters, k = 3 points are selected.
• These points are indicated by the star, triangle, and diamond in the following figure:
330
– Update phase • Shift the initial centers to a new location (= centroid ; calculated as t
he mean value of the points currently assigned to that cluster).
• Cluster A claims an additional example from Cluster B.
• update phases continue.
331
– The learned clusters can be reported in one of the two ways. • simply report the cluster assignments for each example. • report the coordinates of the cluster centroids after the final update.
332
• Choosing the appropriate number of clusters – Bias vs. Variance
– A priori knowledge , business requirements 이용
• Elbow method
– attempts to gauge how the homogeneity or heterogeneity within
the clusters changes for various values of k.
– K= elbow point
333
K-medoids
• K-mean와 유사
• 차이점 – cluster is represented with the object closest to the center of
cluster outlier에 해해 more robust
• cf. Centroid in the k-means algorithm
– PAM (Partitioning Around Medoids) CLARA algorithm
• by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger data.
– package cluster : Functions pam() and clara()
– package fpc : Function pamk() does not require choosing k
334
Density-based clustering
• 개요 – Group objects into cluster by densely populated area
• DBSCAN algorithm (package fpc ) – (2 key parameters)
• eps: reachability distance (= size of neighborhood) • MinPts: minimum number of points.
– If the number of points in the neighborhood of point α is no less than MinPts, then α is a dense point. All the points in its neighborhood are density-reachable from α and are put into the same cluster as α.
– 장점 • Can discover clusters with various shapes and sizes • Insensitive to noise
335
ASSOCIATION RULES
개념
알고리즘: Apriori, ECLAT and FP-growth
Interestingness 척도
활용
Association Rules
• 연관규칙의 발견 – itemsets that occur together frequently – A ⇒ B, where A and B are items or attribute-value pairs.
• 규칙 (rule)
– database tuples having the items in the left hand of the rule are also likely to having those items in the right hand.
– rules presenting association or correlation between itemsets.
• 예:
– bread ⇒ butter – computer ⇒ software – age in [20,29] & income in [60K,100K] ⇒ buying up-to-date
mobile handsets
– where P(A) is the % (or probability) of cases containing A.
예
• 100 명의 학생 중 – 10 명이 data mining 기법을 알고 있고,
– 8 명이 R language를 알고,
– 6 명이 이들 모두를 안다고 할 때
– knows R ⇒ knows data mining
– support = P(R & data mining) = 6/100 = 0.06
– confidence = support / P(R) = 0.06/0.08 = 0.75
– lift = confidence / P(data mining) = 0.75/0.10 = 7.5
Association Rule Mining
• 2 steps: – frequent itemset generation
• is computing intensive.
• Finding all frequent itemsets whose supports are no less than a minimum support threshold;
• # of possible itemsets is 2n − 1 (단, n is the no of unique items)
• Algorithms: Apriori, ECLAT, FP-Growth
– 이들 frequent itemsets으로부터, association rules를 발견
• AR with confidence above a minimum confidence threshold
• 비교적 straightforward,
Downward-Closure Property
• anti-monotonicity – For a frequent itemset, all its subsets are also frequent.
if {A,B} is frequent, then both {A} and {B} are frequent.
– For an infrequent itemset, all its super-sets are infrequent. if {A} is infrequent, then {A,B}, {A,C} and {A,B,C} are infrequent.
– useful to prune candidate itemsets
Itemset Lattice
Apriori
• Apriori [Agrawal and Srikant, 1994]: a classic algorithm – A level-wise, breadth-first algorithm – Counts transactions to find frequent itemsets – Generates candidate itemsets by exploiting downward closure
property of support
• Apriori 프로세스 – 1. Find all frequent 1-itemsets L1
– 2. Join step: generate candidate k-itemsets by joining Lk−1 with itself
– 3. Prune step: prune candidate k-itemsets using downward-closure property
– 4. Scan the dataset to count frequency of candidate k-itemsets and select frequent k-itemsets Lk
– 5. Repeat above process, until no more frequent itemsets can be found.
From [Zaki and Meira, 2014]
FP-growth
• frequent-pattern growth, which mines frequent itemsets without candidate generation [Han et al., 2004]
• Compresses the input database creating an FP-tree instance to represent frequent items.
• Divides the compressed database into a set of conditional databases, each one associated with one frequent pattern.
• Each such database is mined separately.
• It reduces search costs by looking for short patterns recursively and then concatenating them in long frequent patterns.†
• https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm
FP-tree
From [Han, 2005]
ECLAT
• A depth-first search algorithm using set intersection – ECLAT: equivalence class transformation [Zaki et al., 1997] – It works recursively. – The initial call uses all single items with their tid-sets
• Idea: – use tid set intersecion to compute the support of a candidate
itemset, avoiding the generation of subsets that does not exist in the prefix tree.
– t(AB) = t(A) ∩ t(B) – support(AB) = |t(AB)|
• Eclat – intersects the tidsets only if the frequent itemsets share a
common prefix. – traverses the prefix search tree in a DFS-like manner, processing a
group of itemsets that have the same prefix, also called a prefix equivalence class.
From [Zaki and Meira, 2014]
Interestingness 척도
• 2 categories: Subjective와 objective
• Objective measures – 예: lift, odds ratio and conviction
– are often data-driven and give the interestingness in terms of statistics or information theory.
• Subjective (user-driven) measures – 예: unexpectedness and actionability
– focus on finding interesting patterns by matching against a given set of user beliefs.
Applications
• Market basket analysis
• Churn analysis and selective marketing
• Credit card risk analysis
• Stock market analysis
• Medical diagnosis
Redundant Rules
• There are often too many association rules discovered from a dataset.
• It is necessary to remove redundant rules before a user is able to study the rules and identify interesting ones from them.
• Redundancy 제거와 Remaining Rules
해석 및 시각화
MODEL 성능의 측정지표
353
개요
• classification 성능의 측정 – Working with classification prediction data in R – A closer look at confusion matrices – Confusion matrices를 이용한 성능 측정 – Beyond accuracy – other measures of performance
• The kappa statistic • Sensitivity and specificity • Precision and recall • The F-measure
– Performance tradeoff의 시각화 • ROC curves
• 미래 (향후) 성능의 추정 – holdout method – Cross-validation – Bootstrap sampling
• Summary
354
Classification 성능의 측정
• Accuracy = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
• 그러나 class imbalance problem – = trouble associated with data having a large majority of
records belonging to a single class
– 따라서 raw accuracy보다 model performance 측정지표가 필요
355
Classification 예측 데이터 작업
• 3 major types of data that are used to evaluate a classifier – Actual class values
• > actual_outcom <- testdata$outcome
– Predicted class values • > predicted_outcome <- predict(model, testdata)
– Estimated probability of the prediction • = internal prediction probability
• 통상, predict() allows to specify the type of prediction. – (ex) class types: prob, posterior, raw
– (예) sms_results.csv – model is very confident 또는 somewhat less extreme probability
– In spite of such mistakes, is the model still useful?
356
Confusion matrices 세부사항
• Confusion matrix – = a table that categorizes predictions to whether they match
the actual value in data
357
Confusion matrices를 통한 성능 측정
• accuracy = 𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• error rate = 𝐹𝑃+𝐹𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁= 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
• 예:
• table(sms_results$actual_type, sms_results$predict_type) • xtabs(~ actual_type + predict_type, sms_results) • library(gmodels) • CrossTable(sms_results$actual_type, sms_results$predict_type) • (154 + 1202) / (154 + 1202 + 5 + 29) # accuracy • (5 + 29) / (154 + 1202 + 5 + 29) # error rate • 1 - 0.9755396 # error rate = 1 - accuracy • library(caret) • confusionMatrix(sms_results$predict_type, sms_results$actual_type,
positive = "spam")
358
Beyond accuracy – 기타의 성능측정
• The kappa statistic
– adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.
– < 1 <0 ; indicates perfect agreement between the model's predictions and the true values – a rare occurrence.
– (common interpretation: )
• Poor agreement = Less than 0.20 Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60 Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
– k = Pr 𝑎 −Pr (𝑒)
1 −Pr (𝑒)
• 단, kappa statistic의 계산방식은 여러 가지
359
• Sensitivity and specificity – overly conservative vs. overly aggressive (tradeoff)의 균형
– 예: e-mail filter ; …
– 모델의 sensitivity (= true positive rate) • = proportion of positive examples that were correctly classified.
• = # of true positives / total # of positives in the data
• - those correctly classified (the true positives), as well as those incorrectly classified (the false negatives).
• = 𝑇𝑃
𝑇𝑃+𝐹𝑁
– 모델의 specificity (= true negative rate) • = proportion of negative examples that were correctly classified.
• # of true negatives / total # of negatives
– Specificity = 𝑇𝑁
𝑇𝑁+𝐹𝐹
360
• Precision and recall – precision (= positive predictive value)
• = proportion of positive examples that are truly positive; 즉, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases very likely to be positive. It will be very trustworthy.
• Precision = 𝑇𝑃
𝑇𝑃+𝐹𝑃
– recall • measures of how complete the results are. = # of true positives / tot
al # of positives • the same as sensitivity, only the interpretation differs. • A model with high recall captures a large portion of the positive exa
mples, meaning that it has wide breadth. – 예: 검색에서의 high recall – SMS spam filter에서의 high recall
• Recall = 𝑇𝑃
𝑇𝑃+𝐹𝑁
361
• F-measure – ; precision + recall ==> F-measure (= F1 score or the F-score). – ; combines precision and recall using the harmonic mean.
• F-measure= = 2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙
recall + precision =
2 𝑥 𝑇𝑃
2 x TP + FP + FN
– F-measure계산 ; 앞서의 precision & recall values :
• > f <- (2 * prec * rec) / (prec + rec) • This is the same as using the counts from the confusion matrix: • > f2 <- (2 * 154) / (2 * 154 + 5 + 29)
• 단, assuming equal weight to precision and recall (not always valid)
대신 다른 weights 적용. tricky at best and arbitrary at worst. • 대신, use F-score in combination with methods that consider a model
's strengths and weaknesses more globally, such as those described in the next section.
362
운영특성곡선
363
성능 tradeoffs의 시각화
• ROC curves – ; examine tradeoff between the detection of true positives,
while avoiding the false positives.
– ROCR package의 prediction()
• Curves are defined on a plot with:
– Y축: proportion of true positives + X축: proportion of false positives
– equivalent to sensitivity and (1 – specificity), respectively,
– == sensitivity/specificity plot:
– The points comprising ROC curves indicate the true positive rate at varying false positive thresholds. To create the curves, a classifier's predictions are sorted by the model's estimated probability of the positive class, with the largest values first.
– Beginning at the origin, each prediction's impact on the true positive rate and false positive rate will result in a curve tracing vertically (for a correct prediction), or horizontally (for an incorrect prediction).
364
365
• 3 hypothetical classifiers are contrasted in the plot. – (i) 대각선 ;
• represents a classifier with no predictive value detects true positives and false positives at exactly the same rate. = the baseline by which other classifiers may be judged; ROC curves falling close to this line indicate models that are not very useful.
– (ii) the perfect classifier • has a curve passing through the point at 100 % TP rate and 0 % FP rate.
– (iii) Most real-world classifiers • they fall somewhere in the zone between perfect and useless.
• AUC (area under the ROC curve) – Curve가 perfect classifier에 가까울 수록 좋다. can be measured using AUC. – AUC ranges = 0.5 (a classifier with no predictive value) ~ 1.0 (a perfect
classifier). • 0.9 . 1.0 = A (outstanding) 0.8 . 0.9 = B (excellent/good) • 0.7 . 0.8 = C (acceptable/fair) 0.6 . 0.7 = D (poor) • 0.5 . 0.6 = F (no discrimination)
– 단, ROC curve 모양이 다르면서도 AUC가 동일할 수 있다. can be misleading. Use AUC in combination with qualitative examination of the ROC curve.
366
미래 성능의 추정
• resubstitution error – 모델 수립 단계에서의 (일부 R package의) confusion matrices와
performance measures insight about the model's resubstitution error.
– ; occurs when the training data is incorrectly predicted in spite of the model being built directly from this data. This information is intended to be used as a rough diagnostic, particularly to identify obviously poor performers.
– future performance의 징표는 아니지만: – error rate on the training data can be extremely optimistic
about a model's future performance. • 예: a model that used rote memorization to perfectly classify every
training instance (that is, zero resubstitution error) would be unable to make predictions on data it has never seen before.
• (권고) – 모델 평가에는 새로운 데이터 이용
• caret package
367
Holdout method
• Conventional validation = 단순 holdout method (예: 2/3:training, and 1/3 testing ) – It is easy to unknowingly violate this rule by choosing a best model based upon repeated tes
ting.
• 따라서 대신, training과 test datasets 외에 제 3의 validation dataset. – used for iterating and refining the model or models chosen – Leave test dataset to be used only once as a final step to report an estimated error rate for futu
re predictions. – A typical split between training, test, and validation = 50 %, 25 %, and 25 % respectively.
• 단, holdout sampling의 문제 - each partition may have a larger or smaller proportion of some classes.
• (Solution) stratified random sampling – ensures that the generated random partitions have approximately the same proportion of each class as the full dataset.
368
BAGGING과 RANDOM FOREST, HOLD-OUT기법
369
개요
• Bagging과 boosting – = meta-algorithms that pool decisions from multiple classifiers.
– by Leo Breiman: Bootstrap aggregating.
• L. Breiman, “Bagging predictors,” Machine Learning, 24(2):123-140, 1996.
• Majority vote from classifiers trained on bootstrap samples of the training data.
370
Bagging
• Generate B bootstrap samples of the training data: random sampling with replacement.
• Train a classifier or a regression function using each bootstrap sample. – For classification: majority vote on the classification results.
– For regression: average on the predicted values.
• Reduces variation.
• Improves performance for unstable classifiers which vary significantly with small changes in the data set, e.g., CART.
• Found to improve CART a lot, but not the nearest neighbor classifier.
371
Boosting
• 개요
– Iteratively learning weak classifiers
– Final result is the weighted sum of the results of weak classifiers.
• 다양한 boosting algorithms:
– Adaboost (Adaptive boosting) by Y. Freund and R. Schapire
• 기타의 boosting algorithms:
– LPBoost: Linear Programming Boosting is a margin-maximizing classification algorithm with boosting.
– BrownBoost: increase robustness against noisy datasets. Discard points that are repeatedly misclassified.
– LogitBoost: J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: a statistical view of boosting,” Annals of Statistics, 28(2), 337-407, 2000.
372
• Adaboost for Binary Classification
373
RANDOM FOREST
374
random forests
• 개념 – = ensemble classifier using many decision tree models.
– Can be used for classification or Regression.
– Accuracy and variable importance information is provided with the results.
• Split point
375
HADOOP
376
GFS 아키텍처
그림출처: Ghemawat et.al., “Google File System”, SOSP, 2003
377
MapReduce – 프로그래밍 모델
378
Hadoop Ecosystem
그림출처: https://www.mssqltips.com/
379
빅데이터와 R
(v.0.9) 380
High Performance R
• 개요
– R is slow
– Bad programmers are slower
– R can't fix bad programming
• Debugger 활용 – …
• Profiling – Use system.time() to get a general sense of a method.
– Use Rprof() for more detailed profiling.
381
• R 코드 품질의 향상 – Loops - 가급적 피한다
– Ply 함수
• apply() • lapply() • mapply() • sapply()
– Vectorization • 가급적 vectorize! • 단,
– Ply functions are not vectorized. – Vectorization is fastest, but often needs lots of memory.
– map(), reduce()
382
• R과 Parallelism – 개념
– 종류 • Implicit parallelism: Parallel details hidden from user • Explicit parallelism: Some assembly required. . . • Embarrassingly Parallel or loosely coupled: Obvious how to make
parallel • Tightly Coupled: Opposite of embarrassingly parallel; lots of
dependence in computations.
383
384
Streaming R
385
RHadoop
386
• https://github.com/RevoluationAnalytics/RHadoop
• Native R – + C++ + Java
• 다음의 패키지를 포함 – rmr ; interface for running MR jobs via Hadoop streaming in R
– rhbase ; interface to read/write data to/from Hbase table
– rhdfs ; HDFS access
387
388
389
390
If R could, it would:
Map: imd <- lapply(input,function(j)
list(key=K1(j), value=V1(j)))
keys <- lapply(imd,"[[",1)
values <- lapply(imd, "[[",2)
Reduce: tapply(values,keys, function(k,v)
list(key=K1(k,v), value=V1(v,k)))
391
SparkR
• SparkR is a language binding that seamlessly integrates R with Spark, and enables native R programmers to scale in a distributed setting
• https://github.com/amplab-extras/SparkR-pkg
392
• R + RDD = RRDD – (map)
• lapply
– (mapPartitions) • lapplyPartition • groupByKey • reduceByKey • sampleRDD • ... • collect • cache • ... • textFile • parallelize • broadcast • includePackage
393
• SparkR workflow
394
• Capturing closures:
출처: http://blog.obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
395
SPARK
396
The Big Data Analysis Triad
Batch
Interactive Streaming
397
The Hadoop stack
• Ecosystem tools – Hive DB, Hbase
– Pig
– Storm
• Hadoop – Map
– Reduce
– Shuffle, partition, sort
– HDFS
• Distributed data processing – Fault tolerant
– Process peta byte data sets
398
Limitations of Hadoop
• Batch mode – Only the batch layer in the Lambda pattern
– No real time
– No repetitive queries
– Iterative algorithms
– Interactive data querying
– Poor support for distributed memory
399
Spark: An overview
• “Over time, fewer projects will use MapReduce, and more will use Spark” – Doug Cutting, creator of Hadoop
• 특징 – New architecture: scale better and simplify
– In memory processing for Big Data
– Cached intermediate data sets
– Multi-step DAG based execution
– Resilient Distributed Data(RDD) sets
– The core innovation in Spark
400
Spark Ecosystem tools
401
RDD
• DAG Execution Engine – = Directed Acyclic Graph
• RDD – Resilient Distributed Data sets
– Features • Read only
• Fault tolerance without replication – Uses data lineage for recovery
• Low network I/O
• Partitions/Slices
• parallel tasks
402
WRAP-UP
(v.0.9) 403
총 정리
404
질의 응답
405