2015/11/9 ~ 13 ([email protected])¹…데이터... · 2015-11-08 · Apache Hadoop • 오픈소스 분산컴퓨팅 프레임워크 • 참여자 –60 여명의 committers from ~10

빅데이터 분석 교육

2015/11/9 ~ 13

윤형기 ([email protected])

빅데이터분석교육(2015-11)

mailto:[email protected]

일정

월 (D1) 화 (D2) 수 (D3) 목 (D4) 금 (D5) 오전

주제

빅데이터기술 개요 R 기초 R과 기계학습 (1) R과 기계학습 (3) Python과 데이터분석 (2)

내용

빅데이터 배경 Hadoop 분석기법

R 데이터타입 함수작성법 시각화

개론 선 형 모 델 과

비선형모델

연관규칙 군집화 MVA

추천 시스템

오후

주제

환경구축/ MR프로그래밍

R 통계 R과 기계학습 (1) Python과 데이터 분석 (1)

모델평가 및 Wrap-up

내용

Hadoop 설치 MapReduce 프

로그래밍 DB/DW (Hive) NoSQL

기초통계 추정,가설검정 ANOVA, … 회귀분석

Classification 인 공신경 망 /

SVM

Python iPython numpy,

pandas, … 활용법

모 델 평 가 와 모델개선

Wrap-up


• D1 빅데이터 기술개요/MR프로그래밍

• D2 R 기초/R 통계

• D3 R과 기계학습

• D4 R과 기계학습/Python과 데이터분석

• D5 Python과 데이터분석/모델평가 및 Wrapup


D1

빅데이터 기술개요


빅데이터 배경


배경 – 3V

• Tidal Wave – 3VC

• Supercomputer – High-throughput computing

– 2가지 방향:

• 원격, 분산형 대규모 컴퓨팅 (grid computing)

• 중앙집중형 (MPP)

• Scale-Up vs. Scale-Out

• BI (Business Intelligence) – 특히 DW/OLAP/데이터 마이닝


BI

• BI 개요

구성 솔루션 설명

전략 BI BSC Balanced Scorecard. 균형성과관리.

VBM Value-based Management. 가치창조경영.

ABC Activity Based Costing. 활동기준 원가계산.

분석 BI OLAP On-line Analytical Processing. 다차원 분석

확장 ERP,CRM ERP, CRM, SCM 등의 기능을 확장하여 BI기능 제공

인프라/

운영 BI

ETL Extraction-Translation-Loading.

DW Data Warehouse. 데이터 저장소 (repository)

전달 BI Portal 포털.


Hadoop

• Hadoop의 탄생? – 배경

• Google!

• Nutch/Lucene 프로젝트에서 2006년 독립 – Doug Cutting

– Apache의 top-level 오픈소스 프로젝트

– 특징

• 대용량 데이터 분산처리 프레임워크 – http://hadoop.apache.org – 순수 S/W

• 프로그래밍 모델의 단순화로 선형 확장성 (Flat linearity) – “function-to-data model vs. data-to-function” (Locality)

– KVP (Key-Value Pair)


http://hadoop.apache.org/

Hadoop 탄생의 배경

1990년대 – Excite,

Alta Vista, Yahoo,

…

2000 – Google ;

PageRank,

GFS/MapReduce

2003~4 –

Google Paper

2005 – Hadoop

탄생

(D. Cutting &

Cafarella)

2006 – Apache

프로젝트에 등재


Frameworks


• Big Picture


• Hadoop Ecosystem Map


HADOOP


Apache Hadoop

• 오픈소스 분산컴퓨팅 프레임워크

• 참여자 – 60 여명의 committers from ~10 companies

• Cloudera, Yahoo!, Apple, and more

– 수 백 명의 contributors (기능 개발, fixing bugs)

– 많은 관련 projects, applications, tools, etc.

• Hadoop Common ; – As a base framework, contains libraries and utilities needed by

other Hadoop modules.

• HDFS

• MapReduce – a programming model

• YARN – a resource-management platform


• Hadoop Kernel

• Hadoop 배포판 – Apache 버전

• 2.x.x : 0.23.x 기반

– 3rd Party 배포판

• Cloudera, HortonWorks와 MapR 빅데이터분석교육(2015-11)

Hadoop ecosystem


HDFS

• 요구사항 – Commodity hardware

• 잦은 고장은 당연한 일

– 수 많은 대형 파일 • 수백 GB or TB

• 대규모 streaming reads – Not random access

– “Write-once, read-many-times”

– High throughput 이 low latency보다 더 중요

– “Modest” number of HUGE files • Just millions; Each > 100MB & multi-GB files typical

– Large streaming reads • …


• HDFS에서의 해결책 – 파일을 block 단위로 저장

• 통상의 파일시스템 (default: 64MB)보다 훨씬 커짐

– Replication 을 통한 신뢰성 증진

• Each block replicated across 3+ DataNodes

– Single master (NameNode) coordinates access, metadata

• 단순화된 중앙관리

– No data caching

• Streaming read의 경우 별 도움이 안됨

– Familiar interface, but customize the API

• 문제를 단순화하고 분산 솔루션에 주력


GFS 아키텍처

그림출처: Ghemawat et.al., “Google File System”, SOSP, 2003


• HDFS File Storage


• HDFS 이용환경 – 명령어 Interface

– Java API

– Web Interface

– REST Interface

• WebHDFS REST API

– HDFS를 mount하여 사용


HDFS 명령어 Interface

• Create a directory $ hadoop fs -mkdir /user/idcuser/data

• Copy a file from the local filesystem to HDFS $ hadoop fs -copyFromLocal cit-Patents.txt /user/idcuser/data/.

• List all files in the HDFS file system $ hadoop fs -ls data/*

• Show the end of the specified HDFS file $ hadoop fs -tail /user/idcuser/data/cit-patents-copy.txt

• Append multiple files and move them to HDFS (via stdin/pipes) $ cat /data/ita13-tutorial/pg*.txt | hadoop fs -put- data/all_gutenberg.txt


• File/Directory 명령어: – copyFromLocal, copyToLocal, cp, getmerge, ls, lsr

(recursive ls),

– moveFromLocal, moveToLocal, mv, rm, rmr (recursive

rm), touchz, mkdir

• Status/List/Show 명령어: – stat, tail, cat, test (checks for existence of path,

file, zero length files), du, dus

• Misc 명령어: – setrep, chgrp, chmod, chown, expunge (empties trash

folder)


HDFS Java API

• Listing files/directories (globbing)

• Open/close inputstream

• Copy bytes (IOUtils)

• Seeking

• Write/append data to files

• Create/rename/delete files

• Create/remove directory

• Reading Data from HDFS org.apache.hadoop.fs.FileSystem (abstract)

org.apache.hadoop.hdfs.DistributedFileSystem

org.apache.hadoop.fs.LocalFileSystem

org.apache.hadoop.fs.s3.S3FileSystem


HDFS Web Interface


MapReduce

• Topology


• MapReduce – 프로그래밍 모델


Hadoop 1 Limitations

Scalability Max cluster size – 4,000 nodes Max. concurrent tasks – 40,000 Coarse sync in Job tracker

NameNode가 취약점 Failure kills all queued and running jobs

Re-startability Restart is very tricky due to complex state

낮은 Resource Utilization

Hard partition of resources into map and reduce slots

MR에 한정 Doesn’t support other programs Iterative applications implementations are 10x slower

Lack of wire-compatible protocols

Client and cluster must be of same version Applications and workflows cannot migrate to different clusters


Hadoop 2 Design concept

• job Tracker의 기능을 2개 function으로 분리

– cluster resource management

– Application life-cycle management

• MR becomes user library, or one of the application residing in Hadoop


MR2 이해를 위한 Key Concept

• Application – a job submitted to the framework – 예: MR job

• Container – = allocation의 기본 단위 Fine-grained resource allocation – 예: container A = 2GB, 1 CPU – replaces the fixed MR slots

• Resource Manager – = global resource scheduler – Hierarchical queues

• NodeManager – Per-machine agent – Container의 life-cycle관리 – container resource monitoring

• Application Master – Per application으로서 application scheduling 및 task execution을 관리 – 예: MR Application Master


• YARN = MR2.0 + – Framework to develop and/or execute distributed processing

applications

– 예: MR, Spark, Hama, Giraph


비교

Hadoop 1.x

Hadoop 2.x


YARN의 문제점

• Complexity – Protocol are at very low level, very verbose

• Long running job에 적합하지 않음

• Application doesn't survive Master crash

• No built-in communication between container and master

• Hard to debug


Hadoop의 장단점과 대응

• Haddop의 장점

– commodity h/w

– scale-out

– fault-tolerance

– flexibility by MR

• Hadoop의 단점

– MR!

– Missing! - schema와 optimizer, index, view, ...

– 기존 tool과의 호환성 결여

• 해결책: Hive

– SQL to MR

– Compiler + Execution 엔진

– Pluggable storage layer (SerDes)

• 미해결 숙제: Hive

– ANSI SQL, UDF, ...

– MR Latency overhead

– 계속 작업 중...!


SQL-on-MapReduce

• 방향 – SQL로 HDFS에 저장된 데이터를 빠르게 조회하고, 분석 – MR을 사용하지 않는 (low latency) 실시간 분석

• New Architecture for SQL on Hadoop – Data Locality – (MR대신) Real-timer Query – Schema-on-Read – SQL ecosystem과 tight 통합

• SQL on Hadoop 프로젝트 예 – Google Dremel – Apache Drill – Cloudera Impala – Citus Data – Tajo

• 2013년 3월 Apache Incubator Project에 선정 APL V2.0 • 국내기업 적용 – SK텔레콤 등


대표적인 Hadoop 활용

• Text Mining

• Index 생성

• 그래프 분석

• 패턴 인식

• Collaborative filtering

• 예측모델

• 감성분석

• Risk 분석


유형별 활용 양태

리스크 분석 (은행)

사기 탐지 (신용카드), 자금세탁 위험탐지

소셜네트워크 분석 금융 및 통신사의 마케팅 (이벤트)

유통 최적화 (시뮬레이션) 부당 보험첨구 및 탈세위험 탐지

사전적 예방점검 (항공) 감성분석/SNA 제조부문에서의 수요예측 건강보험/질병정보 분석 전통적 DW 텍스트 분석 실시간 영상감시

실시간 (real time) 일괄처리 (Batch)

데이터의 속도

데이터의 유형

정형데이터 비정형데이터

37

Use the right tool for the right job


The Ecosystem is the System

• "Hadoop Ecosystem" – 1차적 subprojects

• Zookeeper

• Hive and Pig

• HBase

• Flume

– 2차적 subprojects

• Sqoop

• Oozie

• Hue

• Mahout


Ecosystem 관계도


그림출처: https://www.mssqltips.com/

https://www.mssqltips.com/



빅데이터 분석


빅데이터 플랫폼

빅데이터분석교육(2015-11) 그림출처: it.toolbox.com

분석도구 – Big Bang

• 기능특화


R

• open-source 수리/통계 분석도구 및 프로그래밍 언어 – S 언어에서 기원하였으며 7,000여 개의 package

• CRAN: http://cran.r-project.org/

– 뛰어난 성능과 시각화 (visualization) 기능


http://cran.r-project.org/




Python

• 오픈소스 프로그래밍 언어

– Multi-platform

– 풍부한 패키지 (≈ 10k)

• 가독성

– Logic 언어

– Executable pseudocode

• 간결성

– Expressiveness less code

• Full-stack

– Web– GUI– OS– Science

• 활발한 커뮤니티 활동


분석기법

• Data Mining

• Predictive Analysis

• Data Analysis

• Data Science

• OLAP

• BI

• Analytics

• Text Mining

• SNA (Social Network Analysis)

• Modeling

• Prediction

• Machine Learning

• Statistical/Mathematical Analysis

• KDD (Knowledge Discovery)

• Decision Support System

• Simulation

편의상 (데이터) 분석(Data Analysis), 마이닝 (Data Mining)으로 혼용


• 통계기초이론 Taxonomy


• 기계학습이론 Taxonomy


HADOOP 설치


기본환경

• Linux • Java

– Java

• VMware Player – 또는 VirtualBox – 8GB RAM 할당

• Cloudera Quick Start VM – Hadoop – Hive, HBase, …

• Windows > VM > CentOS > Hadoop & Ecosystems

• vi 편집기 연습: vimtutor • bash 연습 • Eclipse 연습


Hadoop 설치 방법론

• 선택 (1) 설치 모드 – Standalone – Pseudo distributed cluster – Multinode cluster

• 선택 (2) 배포판 – hadoop.apache.org – cloudera – Hortonworks – MapR – 기타

• 선택 (3) 설치항목 – One-by-one vs. All-in-one – Cloud (예: Amazon) – Virtual Machine?




MAPREDUCE 프로그래밍


프로그래밍 기초

• Java 프로그래밍

• Tools – Eclipse

• www.eclipse.org

– Maven


http://www.eclipse.org/

Maven 소개

• Maven 이란? – 소스 코드로 부터 배포 가능한 산출물을 빌드하는 '빌드툴'

– 프로젝트 관리 도구 • 프로젝트 오브젝트 모델, 표준 집합, 프로젝트 라이프사이클, 의존성관

리 시스템, 라이프사이클에 정의된 단계에서 플러그인 골을 실행하기 위한 로직을 포함하는 관리 툴이다

– Archetype 통해 적합한 구조의 프로젝트 생성,

– 애플리케이션 명칭과 버전 및 연관 library에 대한 정보 등을 종합적으로 관리

– Sub-project간의 관계를 손쉽게 설정

– Nexus를 통해 저장소 관리

– Maven이 제공하는 기능 • Builds, Documentation, Reporting, Dependencies, SCMs, Releases,

Distribution


Maven

• 빌드(Build)란? – Source code 파일 실행 가능한 독립적 산출물로 생성

– 가장 중요한 단계는 compile 과정

– 수 많은 source code와 다양한 versioning 관리를 빌드툴(Build Tool)이 담당

• 빌드 툴(Build Tool) – 전처리(preprocessing), compilation, packaging, testing, 배포

(distribution) • 소스 코드를 바이너리 코드로 컴파일

• 바이너리 코드 패키징

• 테스트 수행

• 실서비스 시스템 배포

• 도큐먼트 및 릴리즈 노트 생성


• Project 생성 – mvn archetype:generate -DgroupId=com.mycompany.app -

DartifactId=my-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

– 처음 설치 직후에는 Maven is downloading the most recent

artifacts (plugin jars and other files) into local repository.

– the generate goal created a directory with the same name given as the artifactId.

– cd my-app

• Standard project structure

• POM file – 프로젝트를 build하는데 필요한 정보


POM 파일

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany.app</groupId> <artifactId>my-app</artifactId> <version>1.0-SNAPSHOT</version> <packaging>jar</packaging> <name>Maven Quick Start Archetype</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> </dependencies> </project>


Maven Phases

• validate: • compile: • test:

– using a suitable unit testing framework.

• package: – take the compiled code and package it in its distributable format, such as a

JAR.

• integration-test: – process and deploy the package if necessary into an environment where

integration tests can be run

• verify: • install:

– install the package into the local repository, for use as a dependency in other projects locally

• deploy: – copies the final package to the remote repository for sharing with other

developers and projects.


• The standard layout for Maven projects – the application sources reside in ${basedir}/src/main/java – test sources reside in ${basedir}/src/test/java,

where ${basedir} represents the directory containing pom.xml).


JAR 생성 및 로컬 repository에 설치

• mvn package

• mvn install

• mvn site

• mvn clean – remove the target directory with all the build data before starting

so that it is fresh.

• 기타 – mvn idea:idea

– mvn eclipse:eclipse


MapReduce 프로그래밍 모델





Job 수행


Word Count Output


MapReduce 프로그래밍

• 프로그래밍 모델 – Borrows from functional programming

– Users implement interface of 2 functions:


MR 프로그래밍 실습 (1)

• a) create the input directory – usr/cloudera/wordcount/input in HDFS

– sudo su hdfs

– hadoop fs -mkdir /user/cloudera

– hadoop fs -chown cloudera /user/cloudera

– exit

– sudo su cloudera

– hadoop fs -mkdir /usr/cloudera/wordcount /user/cloudera/wordcount/input


• b) create sample text files as input and move to the input directory

– (in /home/cloudera)

– echo "Good morning" > file0

– echo "Good evening" > file1

– echo "Good night" > file2

– hadoop fs -put file* /user/cloudera/wordcount/input

– hadoop fs -copyFromLocal file* /user/cloudera/wordcount/input


• c) run – cd /usr/lib/hadoop-mapreduce/

– hadoop jar hadoop-mapreduce-examples.jar wordcount /user/cloudera/wordcount/input/ /user/cloudera/wordcount/output

• d) examine the output – hadoop fs -cat /user/cloudera/wordcount/output/part-r-0000


실습(2) mvn to Eclipse project

• mkdir maven.test01

• cd maven.test01

• mvn archetype:generate

• (defaults)

• groupId: maven.test01

• artifactId: maventest01

• (defaults)

• cd ./maventest01

• mvn eclipse:eclipse • /* To generate the

eclipse project files from your POM

• --- • [Eclipse]

• import existing project into

workspace

• 프로그램 수정 후 run


실습(3) Training 프로젝트 활용

• Training 프로젝트 복제 > MyProject – StubDriver.java > refactor > rename > ProjectDriver >


실습(4) github 활용

• github? – Web 기반의 Git repository 호스팅 서비스

– distributed revision control & source code management (SCM) functionality of Git (cf. Git는 command-line tool).

– Source code 관리도구 (by Linus Tovalds)

• 실습 – https://github.com/eljefe6a/UnoExample

– $ git clone https://github.com/eljefe6a/UnoExample

– Eclipse에서 File > import > Existing Projects into Worksapce > Select root directory > Browse > Cloudera > UnoExample > ...

– Project Card > Properties > Java Build Path > cdh_~~


https://github.com/eljefe6a/UnoExample

https://github.com/eljefe6a/UnoExample

실습(5) New Project from scratch

• 프로젝트 생성 – New Project > …

• External jars 추가 – Properties > NewProj1

– Add External Jars > File System > usr > lib > hadoop > client-2.0 > select all > OK

– Add External Jars > hadoop > ~~.jars > OK

– hadoop.annotations.jar, ~~.auth.jar, ~~.common.jar

– Add External Jars > usr > lib > hadoop > lib > common-httpclient-3.1.jar > OK

• Class 작성 – New > Class


unix 명령어와 Streaming API

• Question: How many cities has each country ?

hadoop jar /mnt/biginsights/opt/ibm/biginsights/pig/test/e2e/ pig/lib/hadoop-streaming.jar \

-input input/city.csv \

-output output \

-mapper "cut -f2 -d," \

-reducer "uniq -c"\

-numReduceTasks 5

• Explanation: cut -f2 -d, # Extract 2nd col. in a CSV

uniq -c # Filter adjacent matches matching lines from INPUT,

# -c: prefix lines by the number of occurrences

additional remark: # numReduceTasks=0: no shuffle & sort phase!!


MapReduce High Level


MRv1 vs. MRv2


작업방식

• 개요 – JobTracker/TaskTracker의 기능을 세분화

• a global ResourceManager • a per-application ApplicationMaster • a per-node slave NodeManager • a per-application Container running on a NodeManager

– ResourceManager 와 NodeManager가 새로 도입

• ResourceManager – ResourceManager 가 application 간의 자원요청을 관리 (arbitrates resources among

applications) – ResourceManager의 scheduler를 통해 resource allocation to applications

• ApplicationMaster – = a framework-specific entity 로서 필요한 resource container를 scheduler로부터 할당 받

음 – ResourceManager 와 협의한 후 NodeManager(s) 를 통해 component tasks를 수행 – Also, tracks status & monitors progress

• NodeManager – = per-machine slave, is responsible for launching the applications’ containers, monitoring

their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.


MapReduce – 예

• 상황

– Twitter - 5억개의 tweets/일 처리 (== 3000/초).

– Actions • Tokenize − Tokenizes the tweets into maps of tokens and writes

them as key-value pairs.

• Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

• Count − Generates a token counter per word.

• Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.


MRv2의 필요성

Feature 기능

Multi-tenancy YARN allows multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise’s return on its Hadoop investments.

Cluster Utilization

Dynamic allocation of cluster resources를 통해 MR 작업 향상

Scalability Scheduling 기능 개선으로 확장성 강화 (thousands of nodes managing PB’s of data).


Hadoop1 MR Daemons


DB/DW (HIVE)


Pig/Hive – 분석언어


DBMS vs. DW

데이터베이스 Data Warehouse

주요 목적 Online Transactional Processing (OLTP) 단, DW에 제한적으로 이용 가능

Online Analytical Processing (OLAP).

Table과 Join 정규화로 인해 table join이 용이. reduce redundancy and save storage space.

de-normalized되어서 Table과 join이 단순하다. 분석 query의 응답시간 단축.

모델링 Entity – Relational modeling. DW설계용 데이터 모델링 기법

최적 Optimized for write operation.

Optimized for read operations.

성능 Performance is low for analysis queries.

High performance for analytical queries.


배경

• Facebook에서 시작

– Data was collected by nightly cron jobs into Oracle DB

– “ETL” via hand-coded python

– Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.


Hive

• Hive란? – An SQL-like interface to Hadoop – DW infrastructure – Hadoop위에서 데이터 요약 및 ad hoc query – MR for execution & HDFS for storage

• Hive Query Language – Basic-SQL: Select, from, join, group-by – Equi-join, multi-table insert, multi-group-by

• SELECT * FROM purchases where price > 100 GROUP BY storeid

• Why Hive and Pig? – MR은 강력하지만 다소 어렵다. – SQL 경험을 활용 vs. Java 코드 – Scripting 언어의 효율성 이용 – Hive과 Pig는 별개로 개발되어 온 분석자용 MR 도구 – Hive was initially developed at Facebook, Pig at Yahoo


Hive is NOT



Hive Components

• Shell: – allows interactive queries like MySQL shell connected to

database

– Also supports web and JDBC clients

• Driver: – session handles, fetch, execute

• Compiler: – parse, plan, optimize

• Execution engine: – DAG of stages (M/R, HDFS, or metadata)

• Metastore: – schema, location in HDFS, SerDe


Data Model

• Tables – Typed columns (int, float, string, date, boolean)

– Also, list: map (for JSON-like data)

• Partitions – e.g., to range-partition tables by date

• Buckets – Hash partitions within ranges (useful for sampling, join

optimization)



Data Model – Partitions

• Partitions – Nested sub-directories in HDFS – 흡사 dense indexes on

partition columns

– efficiently identify the rows that satisfy a certain criteria.

• 예: all "US" data from "2009-12-23" is a partition of the page_views table. analyze only "US" data for 2009-12-23

– 단, Partition 명은 편의에 따른 것일 뿐 – relationship 설정은 사용자가 할 것. – Partition columns are virtual columns, they are not part of the data itself but are derived on load.

• 예:

– Partition columns: ds, ctry

– HDFS for ds=20120410, ctry=US

» /wh/pvs/ds=20120410/ctry=US

– HDFS for ds=20120410, ctry=IN

» /wh/pvs/ds=20120410/ctry=IN


Metastore

• Metastore – Database: namespace containing a set of tables

– Holds table definitions (column types, physical layout)

– Partition data

– Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

• Physical Layout – Warehouse directory in HDFS

• e.g., /home/hive/warehouse

– Tables stored in subdirectories of warehouse • Partitions, buckets form subdirectories of tables

– Actual data stored in flat files • Control char-delimited text, or SequenceFiles

• With custom SerDe, can use arbitrary format


• Table 정의


Most common intersections

• What words appeared most frequently in both corpuses?

hive> SELECT word, shake_f, kjv_f,

(shake_f + kjv_f) AS ss

FROM merged SORT BY ss DESC

LIMIT 20;

• Some more advanced features… – “TRANSFORM:” Can use MapReduce in SQL statements

– Custom SerDe: Can use arbitrary file formats

– Metastore check tool

– Structured query log


Data Units

• Hive 데이터는 granularity 정도에 따라 organized into: – Databases:

• Namespaces that separate tables and other data units from naming confliction.

– Tables: • Homogeneous units of data which have the same schema.

– Partitions: • 각 Table에 1~여러 개 partition Keys가 존재 – determines how the

data is stored.

– Buckets (or Clusters): • Data in each partition may be divided into Buckets based on the

value of a hash function of some column of the Table. (예) page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.


Type System

• Primitive types – Integers: TINYINT, SMALLINT, INT, BIGINT.

– Boolean: BOOLEAN.

– Floating point numbers: FLOAT, DOUBLE .

– String: STRING.

• Complex types – Structs: {a INT; b INT}.

– Maps: M['group'].

– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.


Buckets

• Column의 hash 값에 따라 데이터를 Split – mainly for parallelism

• 각 partition의 데이터는 table내 특정 column의 hash 값에 따라 다시 Buckets로 분배.

• 예: – Bucket column: user into 32 buckets

– HDFS file for user hash 0

• /wh/pvs/ds=20120410/cntr=US/part-00000

– HDFS file for user hash bucket 20

• /wh/pvs/ds=20120410/cntr=US/part-00020


External Tables

• Point to existing data directories in HDFS

• Can create table and partitions

• Data is assumed to be in Hive-compatible format

• Dropping external table drops only the metadata

• 예: external table의 생성 – CREATE EXTERNAL TABLE test_extern(c1 string, c2 int)

– LOCATION '/user/mytables/mydata';


Serialization과 Deserialization

• Generic (De)Serialzation Interface SerDe

• LazySerDe 사용

• Flexibile Interface to translate unstructured data into structured data

• Designed to read data separated by different delimiter characters

• The SerDes are located in 'hive_contrib.jar';


HQL 개요

• SQL과 유사 – 유사한 개념과 용어 (table, row, column, schema)

– SQL-92에 의거

• Supports multi-table inserts via your code – Accesses "Big Data" via tables

• Converts SQL queries into MR jobs – 사용자는 MR을 몰라도 됨

• Also supports plugging custom MR scripts into queries


• Basic SQL – From clause sub-query – ANSI JOIN (equi-join only) – Multi-Table insert – Multi group-by – Sampling – Objects Traversal

• Extensibility – Pluggable Map-reduce scripts using TRANSFORM

• 사용방법 $ hive hive> $ hive -f myquery.hive


• Partitioning – Creating partitions

CREATE TABLE test_part(ds string, hr int) PARTITIONED BY (ds string, hr int);

INSERT OVERWRITE TABLE test_part PARTITION(ds='2009-01-01', hr=12) SELECT * FROM t;

ALTER TABLE test_part ADD PARTITION(ds='2009-02-02', hr=11);


SELECT * FROM test_part WHERE ds='2009-01-01';

• will only scan all the files within the

/user/hive/warehouse/test_part/ds=2009-01-01 directory

SELECT * FROM test_part

WHERE ds='2009-02-02' AND hr=11;

• will only scan all the files within the /user/hive/warehouse/test_part/ds=2009-02-02/hr=11 directory.


Table 생성과 데이터 적재

• Table 생성 hive> SHOW TABLES; hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE; hive> DESCRIBE shakespeare;

• Generating Data – 예: (word, frequency) data from the Shakespeare data set: $ hadoop jar \ $HADOOP_HOME/hadoop-*-examples.jar \ grep input shakespeare_freq ‘\w+’

• Loading data • Remove the MapReduce job logs:

$ hadoop fs -rmr shakespeare_freq/_logs

• Load dataset into Hive: hive> LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;


Data 작업

hive> SELECT * FROM shakespeare LIMIT 10;

hive> SELECT * FROM shakespeare

WHERE freq > 100 SORT BY freq ASC

LIMIT 10;

hive> SELECT freq, COUNT(1) AS f2

FROM shakespeare GROUP BY freq

SORT BY f2 DESC LIMIT 10;

hive> EXPLAIN SELECT freq, COUNT(1) AS f2

FROM shakespeare GROUP BY freq

SORT BY f2 DESC LIMIT 10;


• Joining tables – A powerful feature of Hive is the ability to create queries that

join tables together

– We have (freq, word) data for Shakespeare

– Can also calculate it for KJV

– Let’s see what words show up a lot in both

• Dataset 생성: $ tar zxf ~/bible.tar.gz –C ~

$ hadoop fs -put ~/bible bible

$ hadoop jar \

$HADOOP_HOME/hadoop-*-examples.jar \

grep bible bible_freq ‘\w+’


Import data to Hive $ hadoop fs –rmr bible_freq/_logs hive> LOAD DATA INPATH “bible_freq” INTO TABLE kjv;

Create an intermediate table hive> CREATE TABLE merged (word STRING, shake_f INT, kjv_f INT); Running the join

hive> INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1; hive> SELECT * FROM merged LIMIT 20;


• JOIN SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2);

• INSERTION

INSERT OVERWRITE TABLE t1 SELECT * FROM t2;

INSERT OVERWRITE TABLE sample1 '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample;


Typical Use Cases

• Ad-hoc queries • Summarization, Data 분석 • Log 데이터 처리 • Text mining • Document indexing • Customer-facing business intelligence (e.g., Google Analytics) • Predictive modeling, hypothesis testing


아키텍처


Without Hive

<그림출처: http://www.slideshare.net/cloudera/20101117-webinar-final-cloudera-hammerbacher-e-bay-madan>


http://www.slideshare.net/cloudera/20101117-webinar-final-cloudera-hammerbacher-e-bay-madan
















With Hadoop/Hive – Stage 1a

• Copy or Archive


With Hadoop/Hive – Stage 1b

• Unstructured Data 추가


With Hadoop/Hive – Stage 1c

• Consolidate Multiple DWs


With Hadoop/Hive – Stage 2

• Structure and Store


With Hadoop/Hive – Stage 3

• Ad hoc Query Support


Pig

• A high-level scripting language (Pig Latin)

• Process data one stop at a time

• Simple to write MR program

• Easy to understand

• Easy to debug

• A = load 'a.txt' as (id, name, age, ...)

• B = load 'b.txt' as (id, address, ...)

• C = JOIN A BY id, B BY id; STORE C into 'c.txt'


Hive vs. Pig

Hive Pig

사용언어 HQL (Hive Query Language) – SQL-like

Pig Latin – Script 언어

Schema Table definitions are stored in a metastore

A schema is optionally defined at runtime

Access 방식 JDBC, ODBC PigServer


Wordcount 예


MR에서의 Wordcount


Pig에서의 Wordcount

A = LOAD ‘wordcount/input’ USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as COUNT;

DUMP C;


Pig & Hive

• Pig and Hive work well together and many businesses use both.


NOSQL 데이터베이스


NoSQL?

• NoSQL도 DBMS이다. – 기존 RDBMS:

• Table

• More functionality, Less Performance

– OLAP

• Cube

– NoSQL

• Collections

• Less Functionality, More Performance

• 주안점: Scalability, Performance, HA

Structured Data Structured/ Unstructured Data


이론 – Brewer’s CAP Theorem

• 분산시스템이 가져야 할 특성: CAP – Consistency

• 모든 node들은 동일 항목에 대하여 같은 데이터를 보여준다.

– Availability • 모든 사용자들이 읽기/쓰기가 가능해야 하며, node 장애 시에도 타

node에 영향을 미치면 안된다.

– Partition Tolerance (생존성) • node간의 메시지 손실이 있어도 정상적으로 동작해야 한다.

• RDBM – transaction 처리 중심 CAP 중 생존성 (partition tolerance)을 포기한 C와 A의 특성.

– Atomicity – Consistency – Isolation – Durability


• NoSQL – 확장성 (scaleout)을 위해 생존성 (partition tolerance)은 필수. C와 A 중 하나를 택한다.

– ACID는 적어도 2단계 commit 구조를 가져야만 지원 가능한데, 분산환경에서는 비용이 크다. 여기서 나타난 것이 Brewer의 BASE 특성 • Basically Available

• Soft-state

• Eventual Consistence

• ACID를 대신하여 BASE 속성을 가지는 NoSQL은: – P + C : 모든 node가 함께 성능을 내며 일관성을 보장해야 하는 유

형 – 시스템 장애 시에는 일부 또는 전체 node에서 응답을 받을 수 없다 (Google Big Table, Hbase)

– P + A : 장애 시에도 데이터를 이용할 수 있지만 그 데이터가 최신의 정확한 데이터라고 보장하지 못한다. (Amazon Dynamo, Apache Cassandra)


NoSQL 종류

• Key-Value Stores – 원천기술: DHTs / Amazon’s Dynamo paper

– 예: Memcached, Coherence, Redis

• Column Store – 원천기술: Google의 BigTable 논문

– 예: Hbase, Cassandra, Hypertable

• Document Store – 원천기술: Lotus Notes

– 예: CouchDB, MongoDB, Cloudant

• Graph Database – 원천기술: Euler & graph 이론

– 예: Neo4J, FlockDB


NoSQL Landscape


NoSQL 데이터모델 비교


NoSQL 특징

• Missing? – Joins 지원 없음

– Complex Transaction 지원 없음 (ACID)

– Constraint 지원 없음

• Available? – Query Langauge

– 높은 성능

– Horizontal Scalability

NoSQL

SQL

성능

기능


MongoDB

• 일반사항 – mongoDB = “Humongous DB”

– Document-based Open-source (max 16 MB)

– “High performance, high availability”

– Automatic scaling

– C-P on CAP

• Documents 관리 – Documents are in BSON format, consisting of field-value pairs

– Each document stored in a collection

– Collections • Have index set in common

• Like tables of relational db’s.

• Documents do not have to have uniform structure


mongoDB SQL

Document Tuple

Collection Table/View

PK: _id Field PK: Any Attribute(s)

Uniformity not Required Uniform Relation Schema

Index Index

Embedded Structure Joins

Shard Partition


설치

• Configure yum – /etc/yum.repos.d/mongodb.repo

• [mongodb] • name=MongoDB Repository • baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64 • gpgcheck=0 • enabled=1

• 설치 – $ yum install mongo-10gen mongo-10gen-server – /etc/mongod.conf

• #where to log • logpath=/var/log/mongodb/mongod.log • logappend=true • # fork and run in background • fork=true • #port=27017 • dbpath=/var/lib/mongo • # location of pidfile • pidfilepath=/var/run/mongodb/mongod.pid


시작하기

• 확인 – service --status-all | grep mongod – /etc/rc.d/init.d – cat mongod

• Start MongoDB – $ sudo service mongod start – /var/log/mongod.log

– sudo chkconfig mongod on – sudo service mongod stop

• Connect & Select DB

– db – show dbs – use mydb – db – help


• Collection을 생성한 후 insert documents – use mydb

– j = {name: “mongodb”} – k = {x: 3}

• Create both mydb database and testData collection – db.testData.insert( j) – db.testData.insert(k)

– show collections

• Confirm

– db.testData.find()


Generate Test Data

• For loop를 통한 multiple documents의 insert – (~/.mongorc.js) function insertData(dbName, colName, num) { var col = db.getSiblingDB(dbName).getCollection(colName); for (i = 0; i < num; i++) { col.insert({x:i}); } print(col.count()); }

– > insertData(“test”,”testData”,400) – > db.testData.find() – > it


{ _id: ObjectId(7df78ad8902c) title: 'MongoDB Overview', description: 'MongoDB is no sql database', by: 'tutorials point', url: 'http://www.tutorialspoint.com', tags: ['mongodb', 'database', 'NoSQL'], likes: 100, comments: [ { user:'user1', message: 'My first comment', dateCreated: new Date(2011,1,20,2,15), like: 0 }, { user:'user2', message: 'My second comments', dateCreated: new Date(2011,1,25,7,45), like: 5 } ] }


Sample Document – BSON

Query


Data Modification


MongoDB CRUD 기본개념

• Read 작업 – Cursors

• Queries return iterable objects, called cursors, that hold the full result set.

– Query 최적화 • Analyze and improve query performance.

– 분산 Queries • ; sharded clusters 및 replica sets 를 통한 read 성능 개선.

• Write 작업 – Write Concern

• ; Write 작업의 정확성 (guarantee on the success of a write operation)

– 분선 Writes • ; sharded clusters 및 replica sets 를 통한 write 성능 개선.


Read Operations

• Cursors – Queries return iterable objects, called cursors, that hold the full

result set.

• Query 최적화 – query 성능을 분석 & 개선.

• Query Plans – MongoDB executes queries using optimal plans.

• 분산 Queries – Describes how sharded clusters and replica sets affect the

performance of read operations.


Read Operations 개요

• specify criteria, or conditions, that identify the documents …

– A query may include a projection

• specifies the fields from the matching documents to return.

• limits the amount of data that MongoDB returns to the client.

• Query Interface

– db.collection.find()

• ; accepts both the query criteria and projections and returns a cursor to the matching documents.


> db.users.insert(

... {

... name: "sue",

... age:26,

... status:"A",

... groups:["news","sports"]

... }

... )

> db.users.insert( { name: "msrk", age:32, status:"B", groups:["programmings","golf"] } )

> db.users.insert( { name: "kim", age:29, status:"C", groups:["tennins","climbing"] } )

> db.users.find()

> db.users.find({age: {$gt:18}},{name:1, address:1}).limit(5)


Projections

• MongoDB default – ; return all fields in all matching documents by default.

• Projection – Query에 projection 지정 (2nd argument to the find() method)

– MongoDB가 application에게 반환할 데이터의 크기를 제한

– reduce network overhead and processing requirements.

– ; either specify a list of fields to return or list fields to exclude in the result documents.

– 단,Except for excluding the _id field in inclusive projections, you cannot mix exclusive and inclusive projections.



Projection의 예

• 결과 set에서 한 개 Field를 Exclude 시키는 경우 db.records.find( { "user_id": { $lt: 42 } }, { "history": 0 } )

– ; to exclude the history field from the documents in the result set.

• Return Two fields and the _id Field db.records.find( { "user_id": { $lt: 42 } }, { "name": 1, "email": 1 } )

– ; returns just _id field (implicitly included), name field, email field.

• Return Two Fields and Exclude _id db.records.find( { "user_id": { $lt: 42} }, { "_id": 0, "name": 1 , "email": 1 } )

– ; only returns the name and email fields in the documents in the result set.

• Projection Behavior – By default, the _id field is included.

– Array로 된 field; MongoDB provides $elemMatch, $slice, and $.

– aggregation framework pipeline의 경우, $project pipeline stage를 사용


Sharding

• = 여러 대에 나누어서 데이터 저장

• (과제)

– data scalability, High query rates , storage capacity of a single machine, RAM stress I/O capacity of disk drives.

• (대처방안)

– (i) vertical scaling

– (ii) sharding.

• = horizontal scaling divides the data set and distributes the data over multiple servers, or shards.

• Each shard is an independent database, and collectively, the shards make up a single logical database.



Sharding in MongoDB

– sharding through the configuration of a sharded clusters.

• Sharded cluster의 구성요소 – Shards

• 데이터 저장 – HA 및 consistency를 위해 each shard is a replica set.

– Query Routers, • = mongos instances

• ; interface with client applications and direct operations to the appropriate shards.

• processes and targets operations to shards and then returns results to the clients.

• A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Most sharded clusters have many query routers.

– Config servers • ; store the cluster’s metadata. This data contains a mapping of the cluster’s data set

to the shards. The query router uses this metadata to target operations to specific shards. Production sharded clusters have exactly 3 config servers.



Data Partitioning

• Shard Keys

• Range-based sharding

• Hash-based sharding


D2

R – 기초와 통계


R 기초


분석도구 – Big Bang

• 기능특화


R

• open-source 수리/통계 분석도구 및 프로그래밍 언어 – S 언어에서 기원하였으며 수 많은 package

• CRAN: http://cran.r-project.org/

• 현재 > 5,100 packages

– 뛰어난 성능과 시각화 (visualization) 기능







R의 설치

R 사이트에 접속 후 모듈 다운로드 CRAN (Comprehensive R Archive Network)

http://www.cran.r-project.org/index.html

http://www.cran.r-project.org/








더블클릭하여 설치 시작


R 실행


R의 특징 – Why R?

오픈소스 소프트웨어 – 자유 + 무료 출현 배경은 S 언어.

기본 명령(함수) + 외부 packages 설치

특징 사용자 관점

• 어떤 환경에서든 사용 가능 (Windows, Unix, MacOS).

• 속도가 빠르고 그래픽 기능이 뛰어남.

프로그래밍 관점. • 객체지향형 (OOP)

• Generic 함수 Polymorphic.

• 함수는 그 결과를 object 형태로 반환

• 함수형 (Functional Programming) 언어 • 반복작업 (loop) 대신 함수기능으로 처리 compact code, 병렬처리

• 기타: 함수결과를 다른 함수에 입력, …


R의 사용자 인터페이스

빅데이터분석교육(2015-11) 빅데이터분석교육(2015-11)

R 사용자 인터페이스 <목차>

작업공간 (Workspace)

입출력

할당 (Assignment)

Packages

작업 시작 환경의 Customizing

특수한 출력 지정

Batch 처리

결과의 재사용


작업공간 (Workspace)

사용자 인터페이스 명령어 방식

• 대화식: > 프롬프트에서 명령어 입력

• source() 명령어에 R의 스크립트 파일 지정

그래픽 환경 • 기본 설치된 R의 메뉴 이용 또는 GUI 기반의 R을 별도로 설치 사용

R의 작업공간 둘러보기


R은 기본적으로 명령어 사용 방식. (프롬프트는 >)

그래픽 환경에 대한 많은 노력 진행 RStudio

R Commander

…


기본원리 – 함수와 객체 기본 내장:

• vector, matrix, data frame, list, function

사용자 작성: 함수 및 object를 프로그래밍하여 사용

편리한 기능 작업종료 시 작업공간의 이미지 저장 후 다음 작업 시 reload 가능.

위/아래방향의 화살표 key를 통해 명령어 history기능을 이용

통상 프로젝트 별로 물리적 폴더를 배정. • MS Windows 사용자의 경우:

• 잘못된 예 c:\mydocuments\myfile.txt • 왜냐하면 R에서 "\“는 escape character.

• 올바른 예: c:\\my documents\\myfile.txt 또는

• c:/mydocuments/myfile.txt


편리한 기능 – 계속 – 주석 (comment)

• ‘#’를 이용

도움말 기능: help.start() # general help

help(seq) # help about function seq

?seq # same thing

example(plot) # show an example of function foo

> RSiteSearch("lm") # help 매뉴얼 및 mailing lists 검색

History기능 history() # 디폴트는 최근 사용된 25개 명령어 목록

savehistory(file="myfile") # 작업내역을 저장 (".Rhistory“가

디폴트)

loadhistory(file="myfile") # 앞서의 작업내역을 이용


맛보기: 계산기 기능 > 5+4

> 7^2

> log(10)

> exp(1)

R 설치하면 많은 예제 데이터가 제공됨. 사용 가능한 dataset을 보려면:

data( ) # Load된 package에 따라 결과는 다르다.

개별 dataset의 세부 내용을 보려면: help(datasetname)

Session별 환경 option help(options) # 이용 가능한 options

options() # 현재 설정된 option 상황


작업 디렉토리 getwd() # 현재 디렉토리 - cwd

ls() # 현재 workspace에 있는 object 목록

setwd(mydirectory) # mydirectory로 이동

setwd("c:/docs/mydir") # MS Windows에서도 \ 대신 /를 이용할 것

오류메시지 > sqrt(-2)

[1] NaN

경고메시지:

In sqrt(-2) : NaN이 생성되었습니다

> absolute(3)

에러: 함수 "absolute"를 찾을 수 없습니다

끝내기 > q()


입출력

입력 source( ) 함수이용

• 현행 session의 script 수행 (디폴트는 현재 디렉토리) source("myfile") # script파일의 적용

출력 sink( ) 함수 - 출력 방향을 지정.

sink("myfile", append=FALSE, split=FALSE) # 출력을 파일로 지정

sink() # 출력을 터미널 화면으로 복구

• append option - 덮어 쓸지 또는 추가할지를 지정.

• split option – 출력 파일과 함께 화면출력도 할지를 지정. # 예: 출력을 특정 파일로 지정 (해당 이름의 파일을 엎어 쓴다)

sink("c:/projects/output.txt")

# 예: 출력을 특정 파일로 지정 (기존 파일에 내용 추가, 화면에도 동시에 출력)

sink("myfile.txt", append=TRUE, split=TRUE)


그래픽 출력 • sink( )대신 다음의 함수를 이용

• dev.off( ) – 화면출력으로 복원. jpeg("c:/mygraphs/myplot.jpg") # 다른 곳에 저장 시 full path 지정

plot(x)

dev.off()

Function Output to

pdf("mygraph.pdf") pdf 파일

win.metafile("mygraph.wmf") windows metafile

png("mygraph.png") png 파일

jpeg("mygraph.jpg") jpeg 파일

bmp("mygraph.bmp") bmp 파일

postscript("mygraph.ps") postscript 파일


할당 (Assignment)

변수에 값을 배정하는 것

R에서는 =, <-, <<- 를 사용할 수 있다. 그 밖에 =, <<- 도 사용 가능

할당된 객체는 메모리를 차지 rm() 불필요한 객체를 메모리에서 제거

> x=5; y=2

> x+3

[1] 8

> x+y

[1] 7

> print(x+y)

[1] 7

> x=pi

> x

[1] 3.141593

> rm(x) 빅데이터분석교육(2015-11)

패키지 (Package)

Package = R 함수, 데이터 및 컴파일된 코드의 모음.

R 최초 설치 시 base package 제공 기타 필요한 것은 별도 설치.

Library Package가 저장된 디렉토리로서 load 시켜야 사용이 가능

.libPaths() # library 위치 확인

library() # library 내에 존재하는 packages 목록

search() # 현재 load되어 있는 packages 목록

Packages 추가 ① 다운로드 설치 (한번만 하면 됨). install.packages(package명)

② CRAN Mirror사이트 선택. (e.g. Korea)

③ 현재의 session에 load (session당 한번만 실시) library(package명)


작업 시작환경의 Customization

R은 항상 Rprofile.site 파일을 먼저 수행. MS Windows: C:\Program Files\R\R-n.n.n\etc directory.

Rprofile 파일은 홈 디렉토리 또는 별도 디렉토리에 저장 가능.

Rprofile.site 파일 찾는 순서 • 현행 디렉토리 > 사용자의 홈 디렉토리

Rprofile.site 파일에는 2개의 함수를 지정 가능 .First( ) – R session 시작될 때 수행

.Last( ) – R session이 종료할 때 수행


Batch 처리

일괄처리 방식 (non-interactively) 처리 MS Windows의 경우 (경로명은 조절)

"C:\Program Files\R\R-2.13.1\bin\R.exe" CMD BATCH

--vanilla --slave "c:\my projects\my_script.R"

Linux의 경우 R CMD BATCH [options] my_script.R [outfile]


처리결과의 재 사용

출력결과를 화면에 출력 lm(mpg~wt, data=mtcars)

결과는 화면 출력되지만 저장되지 앟는다.

출력결과를 별도의 object에 저장 fit <- lm(mpg~wt, data=mtcars)

결과는 fit이라는 이름으로 저장되며 화면에는 출력되지 않음

출력결과의 내역정보 str(fit) # view the contents/structure of "fit"

"fit＂라는 이름의 list에 대한 관련 정보를 수록.


데이터 입력


데이터 입력 <목차>

데이터 기초

데이터 타입

데이터 끌어오기 (Importing)

키보드 입력

DBMS의 액세스

데이터 보내기 (exporting)

Dataset에 대한 정보 획득

변수의 Labels

결측치 (Missing Data)

날짜/시간 데이터


데이터 기초

데이터셋 (dataset) 데이터셋

• 여러 관측값을 가지는 것

• 위 데이터를 data vector로 저장할 때 c()를 이용 > Rev_2012 = c(110,105,120,140) # 예: 분기별 매출

> Rev_2013 = c(105,115,140,135)

> Revenue = cbind(Rev_2012, Rev_2013) # column별로 결합

> Revenue

Rev_2012 Rev_2013

[1,] 110 105

[2,] 105 115

[3,] 120 140

[4,] 140 135

>


R의 데이터 형태 (mode) 숫자형 (numeric)

• 숫자로만 이루이진 것 (문자가 들어있으면 전체가 문자형으로 바뀜)

문자형 (character) • “ “ 또는 ‘ ‘로 표시

논리값 (logical value: TRUE, FALSE) • 내부적으로 TRUE는 1, FALSE는 0

• 0이외의 숫자는 TRUE로 여겨짐


R의 데이터 타입 vectors (numerical, character, logical)

• R에는 scalar가 존재하지 않음. • = 한 개의 항목을 가지는 vector

• 문자열 (character string)도 문자 mode의 single-element vector

Matrices

Arrays

data frames

lists.


데이터 형태의 변환 데이터 형태 확인

is.numeric() is.character()

is.vector() is.matrix()

is.data.frame()

데이터 변환 함수 (뒷면)

~ 로 변환 변환함수 규칙

숫자형 (numeric)데이터 as.numeric() FALSE 0 “1”,”2” 1,2

논리형 (logical) 데이터 as.logical() 0 FALSE

문자형 데이터 as.character() 1,2 - “1”,”2” FALSE “FALSE”

Factor as.factor() 범주형 (factor) 형태로 변경

Vector as.vector() 벡터 형태로 변화

Matrix as.matrix() Matrix 형태로 변환

데이터프레임 as.dataframe() 데이터프레임 형태로 변환 빅데이터분석교육(2015-11)

데이터타입 - Vector

같은 mode의 데이터를 가지는 여러 항목으로 이루어진 것 a <- c(1,2,5.3,6,-2,4) # numeric vector

숫자 데이터 벡터 > x = c(1,3,5,7)

> x

[1] 1 3 5 7

문자 벡터 > family = c("아버지", "어머니","딸","아들")

> family

[1] "아버지" "어머니" "딸" "아들"

논리벡터 > c(T,T,F,T)

[1] TRUE TRUE FALSE TRUE


Vector indexing Vector의 개별 항목 (elements)은 첨자 (subscripts, [ ] )로 지정

a[c(2,4)] # 2번째와 4번째 항목

> new_a <- a[-2] # 2번째 항목을 제외

> new_a

[1] 1.0 5.3 6.0 -2.0 4.0

다양한 vector 생성 방법들 : 연산자

seq() – sequence 생성

rep() - vector항목의 반복

Vector 연산 (Vectorized Operations) = vector 내의 각각의 element에 함수를 적용하는 것 효율개선

• 결과값에 따라: Vector In, Vector Out or Vector In, Matrix Out


Filtering 조건 충족되는 항목만 추출 > z <- c(5,3,-2)

> w <- z[z > 0]

> subset()

> z <- c(5,3,-2)

> w <- z[z > 0]

> x <- c(6, 2:5, NA, 12)

> x[x>5]

> subset(x, x>5)

[1] 6 12>

NA와 Null NA ; 결측치 (missing value)

Null : undefined value (적절한 값이 존재하지 않음.)


행렬의 계산: Vector연산 R의 데이터타입으로서의 matrix와 구별할 것

각 행렬은 vector로 지정하고 연산이 가능 • 행렬의 곱셈은 %*%

• 행렬의 덧셈은 +

• 행렬을 합칠 때는 cbind() 또는 rbind()

• 역행렬을 구할 때는 library(MASS)를 부른 후 ginv()

• t() 전치행렬 (transpose)

예: 2개의 vector: x와 y가 다음과 같은 경우의 벡터연산


> x = c(1,2,3,4,5)

> y = c(-1,-2,-3,-4,-5)

> # x+ y

> x+y

[1] 0 0 0 0 0

> # x'y

> t(x) %*% y

[,1]

[1,] -55

> # xx'

> x %*% t(x)

[,1] [,2] [,3] [,4] [,5]

[1,] 1 2 3 4 5

[2,] 2 4 6 8 10

[3,] 3 6 9 12 15

[4,] 4 8 12 16 20

[5,] 5 10 15 20 25 빅데이터분석교육(2015-11)

> # 각 성분의 곱

> x * y

[1] -1 -4 -9 -16 -25

> # x, y를 합쳐 두 개의 열을 가진 행렬로 만들기

> new_matrix = cbind(x,y)

> new_matrix

x y

[1,] 1 -1

[2,] 2 -2

[3,] 3 -3

[4,] 4 -4

[5,] 5 -5

> # 행렬의 차원

> dim(new_matrix)

[1] 5 2

>


데이터 타입 – Matrices

= row와 column을 가지는 vector 각 column은 같은 mode(숫자 또는 문자 등)의 데이터

각 column내 수록된 항목의 개수는 일정

생성 시 미리 크기를 지정할 것 (nrow=, ncol = )

일반형 mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,

dimnames=list(char_vector_rownames, char_vector_colnames))

byrow=FALSE

• matrix 내용은 Column 우선으로 채움. (디폴트 = column-major order).

byrow=TRUE

• matrix 의 내용을 row 우선으로 채운다. dimnames

• column 및 row에 대한 optional labels 지정 y<-matrix(1:20, nrow=5,ncol=4) # 5 x 4 numeric matrix 생성


또 다른 예 cells <- c(1,26,24,68)

rnames <- c("R1", "R2")

cnames <- c("C1", "C2")

mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,

dimnames=list(rnames, cnames))

row, column 및 항목의 식별은 첨자 (subscripts)를 이용 x[,4] # matrix의 4째 column

x[3,] # matrix의 3째 row of

x[2:4,1:3] # 2,3,4번 row의 1,2,3번 column


> y <-

matrix(c(1,2,3,4),nrow=2

)

> y

[,1] [,2]

[1,] 1 3

[2,] 2 4

> # matrix를 이용한

행렬계산

> y %*% y

[,1] [,2]

[1,] 7 15

[2,] 10 22

> 3 * y # scalar 곱

[,1] [,2]

[1,] 3 9

[2,] 6 12

> y + y

[,1] [,2]

[1,] 2 6

[2,] 4 8

> # 또 다른 행렬 이동의 예

> x <- matrix(nrow=3, ncol=3)

> y <- matrix(c(4,5,2,3),

nrow=2)

> y

[,1] [,2]

[1,] 4 2

[2,] 5 3

> x[2:3,2:3] <- y

> x

[,1] [,2] [,3]

[1,] NA NA NA

[2,] NA 4 2

[3,] NA 5 3

빅데이터분석교육(2015-

11)


Matrix의 row와 column에 함수를 적용하기 apply() 함수

• apply(m, dimcode, f, fargs) • m = matrix,

• dimcode = 1: row에 적용, 2: column에 적용,

• f=적용할 함수, fargs = optional arg’ts

> m <- matrix(c(1:10), nrow = 5, ncol = 2)

> # row별 평균

> apply(m, 1, mean)

[1] 3.5 4.5 5.5 6.5 7.5

> # column별 평균

> apply(m, 2, mean)

[1] 3 8

> # 각 항목을 2로 나눔

> apply(m, 1:2, function(x) x/2) 빅데이터분석교육(2015-11)

Matrix의 변경 > x <- matrix(c(12,3,7,16,4))

> x <- c(x,15)

> x

[1] 12 3 7 16 4 15

> x <- c(x[1:2], 15,x[3:6])

> x

[1] 12 3 15 7 16 4 15

> x <- x[-2:-3]

> x

[1] 12 7 16 4 15


Rbind(), cbind()를 이용한 변경 > B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)

> C = matrix(c(7, 4, 2), nrow=3, ncol=1)

> cbind(B, C)

[,1] [,2] [,3]

[1,] 2 1 7

[2,] 4 5 4

[3,] 3 7 2

> D = matrix(c(6, 2),nrow=1,ncol=2)

> D

[,1] [,2]

[1,] 6 2

> rbind(B,D)

[,1] [,2]

[1,] 2 1

[2,] 4 5

[3,] 3 7

[4,] 6 2


Maxtrix와 vector의 관계 Matrix는 vector + Matrix의 고유한 성질 > z <- matrix(1:8, nrow=4)

> z

[,1] [,2]

[1,] 1 5

[2,] 2 6

[3,] 3 7

[4,] 4 8

> length(z)

[1] 8

> class(z)

[1] "matrix"

> attributes(z)

$dim

[1] 4 2


데이터 타입 – Array

Matrices와 동일하나 2차원 이상의 항목을 가진다. 예: 4 x 3 x 2의 3차원 배열에 1~24의 값을 입력 > x <- array(1:36, c(4,3,3))

> x[1,,]

[,1] [,2] [,3]

[1,] 1 13 25

[2,] 5 17 29

[3,] 9 21 33

> x[,,1]

[,1] [,2] [,3]

[1,] 1 5 9

[2,] 2 6 10

[3,] 3 7 11

[4,] 4 8 12


데이터 타입 – List

객체목록으로서 항목의 순서가 의미를 가지는 것 (An ordered collection of objects). 하나의 이름으로 다양한 (possibly unrelated) 객체를 저장 가능.

# example of a list with 4 components -

# a string, a numeric vector, a matrix, and a scaler

w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)

w

…

w$age

…

# 예: 2개 list를 내포한 list

v <- c(list1,list2)

항목의 식별은 [[]] 를 이용. mylist[[2]] # 2nd component of the list

mylist[["mynumbers"]] # component named mynumbers in list


List 항목의 추가,삭제 > z <- list(a="abc", b=12)

> z

…

> z$c <- "sailing" # 항목추가

> z

…

> # 또는 다음과 같이 할 수도 있다.

> z[[4]] <-28

List에 함수 (lapply(), sapply())적용하기 > lapply(list(2:5,35:39), median) # list에 적용하는 함수

[[1]]

[1] 3.5

[[2]]

[1] 37

> sapply(list(2:5, 35:39), median) # vector 또는 matrix를 생성

[1] 3.5 37.0 빅데이터분석교육(2015-11)

데이터 타입 – Data Frame

Column마다 다른 모드(숫자, 문자, factor 등)의 항목을 가질 수 있다.

d <- c(1,2,3,4)

e <- c("red", "white", "red", NA)

f <- c(TRUE,TRUE,TRUE,FALSE)

mydata <- data.frame(d,e,f)

names(mydata) <- c("ID","Color","Passed") # 변수 명

data frame의 항목 식별: myframe[3:5] # 3,4,5번 column

myframe[c("ID","Age")] # ID 및 Age columns

myframe$X1 # x1 변수


> d <- c(1,2,3,4)

> e <- c("red", "white", "red", NA)

> f <- c(TRUE,TRUE,TRUE,FALSE)

> mydata <- data.frame(d,e,f)

> mydata

…

> names(mydata) <- c("ID","Color","Passed") # 변수 명

> mydata

ID Color Passed

1 1 red TRUE

2 2 white TRUE

3 3 red TRUE

4 4 <NA> FALSE


기술적으로는 데이터프레임은 list의 일종. 단지 그 항목이 같은 길이의 vector일 뿐

따라서 vector에 적용된 내용이 상당부분 그대로 적용된다.

데이터프레임의 merge 2개의 데이터프레임을 merge() # merge two data frames by ID

total <- merge(data frameA,data frameB,by="ID")

RDBMS에서의 join과 유사

함수의 적용 sapply(), apply() List에서와 마찬가지로 적용됨.


데이터셋의 Combine과 Merge 서로 다른 source에서 데이터를 결합(combine)하는 경우 다음의 3가지 방법 가능:

• columns 결합: row의 내용과 순서가 같은 경우 data.frame을 사용하거나 cbind().

• rows 결합: row의 내용과 순서가 같은 경우 rbind().

• 서로 다른 모양의 것을 결합: merge() 함수는 공통의 column을 찾아서 결합. (즉, 데이터베이스에서의 join).

• merge()의 경우 집합에서의 intersection, union 개념 구현이 가능하다. (참고: match() 및 %in%.)


데이터 타입 – Factor

변수가 명목변수(nominal 또는 categorical)일 때 사용 각 항목은 [ 1... k ] 범위의 숫자 vector로 인식

factor() 및 ordered() 함수의 option을 통해 문자와 순서 사이의 대응관계를 조절할 수 있다.

factor를 이용해서 value label을 만들 수도 있다. # 예: 20명의 "male(남성)"과 30명 "female(여성)"을 가지는 gender라는 변수

gender <- c(rep("male",20), rep("female", 30))

gender <- factor(gender)

# R은 gender를 nominal 변수로 처리. (내부적으로 1=female, 2=male)

summary(gender)

순서변수는 ordered factor를 이용 # 예: "large", "medium", "small＇로 지정된 rating이라는 변수

rating <- ordered(rating)

# 이 경우 rating변수를 ordinal로 처리 (1=large, 2=medium, 3=small)


Factor에 유용한 함수 tapply()

• Vector를 그룹별로 나눈 후 지정한 함수를 적용 > ages <- c(25,26,34,43,57)

> party <- c("새누리","민주","민주","새누리","기타")

> tapply(ages, party, mean)

기타 민주 새누리

57 30 34

split() • 그룹별로 나누기만 함.

> g <- c("M","F","M","I","M","F","I")

> split(1:7, g)

…


> mons = c("March","April","January","November","January", + "September","October","September","November","August",

+ "January","November","November","February","May","August",

+ "July","December","August","August","September","November",

+ "February","April")

> mons = factor(mons)

> table(mons)

mons April August December February January July

2 4 1 2 3 1

March May November October September 1

1 5 1 3

> summary(mons)


몇 가지 유용한 함수들

length(object) # object가 가진 항목의 개수

str(object) # object의 구조

class(object) # object의 class 또는 type

names(object) # names

c(object,object,...) # 객체를 결합(combine) vector를 만듬.

cbind(object, object, ...) # combine objects as columns

rbind(object, object, ...) # combine objects as rows

object # object를 출력

ls() # 현재의 object 목록을 출력

rm(object) # object 삭제

newobject <- edit(object) # 복사하여 새로 생성

fix(object) # 곧바로 수정, 변경


데이터타입 – table

2-way 테이블을 통해 범주형 (categorical) 데이터를 분석 contingency table 예: > trial <- matrix(c(34,11,9,32), ncol=2)

> colnames(trial) <- c('sick', 'healthy')

> rownames(trial) <- c('risk', 'no_risk')

> trial.table <- as.table(trial)

> trial.table

sick healthy

risk 34 9

no_risk 11 32


데이터 끌어오기 (Importing)

Import from: csv 텍스트 파일 첫 줄은 변수의 이름, 항목 구분자(separator)는 comma

각 줄(row)에는 변수 번호 (id) 적용

MS Windows의 경우 \ 대신 / 를 사용

mydata <- read.table("c:/mydata.csv", header=TRUE,

sep=",", row.names="id")


Import from: Excel CSV 파일로 저장한 후 앞서의 방법을 이용할 수 있다.

MS Windows – RODBC package 이용. 첫째 row는 column 명.

library(RODBC)

channel <- odbcConnectExcel("c:/myexel.xls")

mydata <- sqlFetch(channel, "mysheet")

odbcClose(channel)


키보드 입력 SAS, SPSS, Excel, Stata, DB, ASCII 파일에서 import할 때:

# create a data frame from scratch

age <- c(25, 30, 56)

gender <- c("male", "female", "male")

weight <- c(160, 110, 220)

mydata <- data.frame(age,gender,weight)

R의 자체 편집기를 이용할 수 있다. # enter data using editor

mydata <- data.frame(age=numeric(0), gender=character(0),

weight=numeric(0))

mydata <- edit(mydata)


DBMS의 액세스

RODBC package ODBC interface.

주요 함수:

Function Description

odbcConnect(dsn, uid="", pwd="") ODBC 데이터베이스에 connection open

sqlFetch(channel, sqtable) ODBC DB 테이블을 읽어서 data frame에 가져옴

sqlQuery(channel, query) Submit a query to an ODBC database and return the results

sqlSave(channel, mydf, tablename

= sqtable, append = FALSE)

Write or update (append=True) a data frame to a table in the ODBC database

sqlDrop(channel, sqtable) ODBC 데이터베이스에서 table 제거

close(channel) Close the connection


# 예:DBMS의 2개 테이블(Crime & Punishment)을

# 2개의 R 데이터프레임 (crimedat & pundat)으로 import

library(RODBC)

myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")

crimedat <- sqlFetch(myconn, Crime)

pundat <- sqlQuery(myconn, "select * from Punishment")

close(myconn)

기타의 Interfaces RMySQL package –interface to MySQL.

ROracle package – interface for Oracle.

RJDBC package – JDBC interface.


데이터 보내기 (exporting)

Tab Delimited Text File write.table(mydata, "c:/mydata.txt", sep="\t")

Excel Spreadsheet library(xlsReadWrite)

write.xls(mydata, "c:/mydata.xls")


Dataset에 대한 정보 획득

ls() # objects 목록 출력

names(mydata) # mydata에 있는 변수 목록

str(mydata) # mydata의 구조 출력

levels(mydata$v1) # mydata의 v1 factor의 level

dim(object) # object의 차원 (dimensions)

class(object) # object (numeric, matrix, data frame, etc)의 class

mydata # mydata 출력

head(mydata, n=10) # mydata의 맨 앞 10개 row 출력

tail(mydata, n=5) # mydata의 맨 뒤 5개 row 출력 빅데이터분석교육(2015-11)

Value Labels

factor 함수를 통해 자체의 value label을 만들 수 있다. # 작업목표: 변수 v1이 1=red, 2=blue, 3=green의 값을 가지도록 함

mydata$v1 <- factor(mydata$v1,

levels = c(1,2,3),

labels = c("red", "blue", "green"))

# variable y is coded 1, 3 or 5

# we want to attach value labels 1=Low, 3=Medium, 5=High

mydata$v1 <- ordered(mydata$y,

levels = c(1,3, 5),

labels = c("Low", "Medium", "High"))

nominal 데이터는 factor(), ordinal 데이터는 ordered() 이용.


결측치 (Missing Data)

결측치는 NA (not available)로 표시된다.

불능값 (예: dividing by zero)는 NaN (not a number)로 표시. 문자, 숫자에 상관없이 같은 symbol 사용

결측치 여부 검사 is.na(x) # TRUE of x가 결측치면 TRUE를 반환

y <- c(1,2,3,NA)

is.na(y) # vector (F F F T) 변환

결측치에 대한 대처 mydata$v1[mydata$v1==99] <- NA # v1 변수에서 99값은 결측치로 해석


분석 시 결측치를 배제하지 않으면 결과 자체가 결측치가 된다. x <- c(1,2,NA,3)

mean(x) # returns NA

mean(x, na.rm=TRUE) # returns 2

complete.cases() – complete 여부에 따라 논리값 출력. # list rows of data that have missing values

mydata[!complete.cases(mydata),]

na.omit() – 결측값 제거 # 결측치는 생략한 채 처리하여 결과를 newdata에 저장

newdata <- na.omit(mydata)

기타의 결측치 처리 R 함수 별 옵션 이용.


날짜/시간 (Date) 데이터

1970-01-01 이후의 날자 수로 표현 (음수는 이전 시점 표시).

# as.Date( ) – string을 date로 변환

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

# number of days between 6/22/07 and 2/13/04

days <- mydates[1] - mydates[2]

Sys.Date( ) – 오늘

date() – 현재 날짜 및 시간


다음은 format( ) 함수에서 이용 가능한 symbol.

예: 오늘 날짜 출력 today <- Sys.Date()

format(today, format="%B %d %Y")

"June 20 2007"

Symbol 의미 예

%d day as a number (0-31) 01-31

%a

%A 단축 형 요일 표시 비 단축 형 요일 표시

Mon Monday

%m month (00-12) 00-12

%b

%B 단축 형 월 표시 비 단축 형 월 표시

Jan January

%y

%Y 2-digit year 4-digit year

07 2007


날짜 변환 문자 날짜

• as.Date(x, "format") 함수 이용.

• x는 문자데이터, format은 필요한 포맷 지정. # date 정보를 'mm/dd/yyyy＇포맷으로 변환

strDates <- c("01/05/1965", "08/16/1975")

dates <- as.Date(strDates, "%m/%d/%Y")

• 디폴트 포맷: yyyy-mm-dd mydates <- as.Date(c("2007-06-22", "2004-02-13"))

날짜 문자

• as.Character( ) 함수 이용. strDates <- as.character(dates)


데이터 관리


데이터 관리 <목차>

새로운 변수의 생성

연산자 (Operators)

내장 함수

제어문

사용자 작성 함수

데이터의 정렬

데이터 병합 (Merge)

Aggregating Data

Reshaping Data

Subsetting Data

apply() 함수


새로운 변수의 생성

치환 연산자 <- 를 이용 다음 3가지 중 하나 선택 가능

mydata$sum <- mydata$x1 + mydata$x2

mydata$mean <- (mydata$x1 + mydata$x2)/2

attach(mydata)

mydata$sum <- x1 + x2

mydata$mean <- (x1 + x2)/2

detach(mydata)

mydata <- transform( mydata,

sum = x1 + x2,

mean = (x1 + x2)/2

)


변수를 Recoding하기

# 예1: 2개의 나이별(age) 범주 그룹(categories)를 생성

mydata$agecat <- ifelse(mydata$age > 70,

c("older"), c("younger"))

# 예2: 3개의 age categories를 생성

attach(mydata)

mydata$agecat[age > 75] <- "Elder"

mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"

mydata$agecat[age <= 45] <- "Young"

detach(mydata)


연산자 (Operators)

Binary 연산자는 vector, matrix 및 scalar 모두에 적용됨.

Arithmetic Operators

Operator Description

+ 더하기

- 빼기

* 곱하기

/ 나누기

^ or **

지수 (제곱)

x %% y 나머지 (x mod y) 5%%2 is 1

x %/% y

integer division 5%/%2 is 2


Logical Operators

Operator Description

< less than

<= less than or equal to

> greater than

>= greater than or equal to

== exactly equal to

!= not equal to

!x Not x

x | y x OR y

x & y x AND y

isTRUE(x) test if X is TRUE


# 예:

x <- c(1:10)

x

1 2 3 4 5 6 7 8 9 10

x > 8

F F F F F F F F T T

x < 5

T T T T F F F F F F

x > 8 | x < 5

T T T T F F F F T T

x[c(T,T,T,T,F,F,F,F,T,T)]

1 2 3 4 9 10

x[(x>8) | (x<5)] # 결과: 1 2 3 4 9 10 빅데이터분석교육(2015-11)

문자함수

문자 함수


substr(x, start=n1,

stop=n2)

문자vector에서 substring을 추출 또는 변경

grep(pattern, x ,

ignore.case=FALSE,

fixed=FALSE)

Search for pattern in x. fixed =FALSE pattern은 정규표현 식. fixed=TRUE pattern 은 텍스트 문자열이며 해당 index를 산출 grep("A", c("b","A","c"), fixed=TRUE) 2

sub(pattern,

replacement, x,

ignore.case =FALSE,

fixed=FALSE)

x에서 pattern을 찾아서 변경시킴. fixed=FALSE pattern 은 정규표현 식. fixed = T pattern은 텍스트 문자열. sub("\\s",".","Hello There") "Hello.There"

strsplit(x, split) 문자열의 지정 element를 분리 (Split).

strsplit("abc", "") 3 개의 vector로 분리. 즉, "a","b","c"

paste(..., sep="") sep으로 구분시키면서 문자열 연결 (Concatenate)

toupper(x)

tolower(x)

대문자로 변환 소문자로 변환


예 x <- "abcdef"

substr(x, 2, 4) # "bcd"

substr(x, 2, 4) <- "22222" # "a222ef"

paste("x",1:3,sep="") # c("x1","x2" "x3")

paste("x",1:3,sep="M") # c("xM1","xM2" "xM3")

paste("Today is", date())

확률 x <- pretty(c(-3,3), 30) # 표준정규곡선

y <- dnorm(x)

plot(x, y, type='l', xlab="Normal Deviate", ylab="Density",

yaxs="i")

pnorm(1.96) is 0.975

#50 random normal variates with mean=50, sd=10

x <- rnorm(50, m=50, sd=10)

# 10회 실시에서 head가 0~5 나올 확률

dbinom(0:5, 10, .5)

# 10회 실시에서 head가 5번 이하 나올 확률

pbinom(5, 10, .5)


기타의 유용한 함수

기타 함수


seq(from , to, by) 수열 (sequence) 생성 indices <- seq(1,10,2)

#indices is c(1, 3, 5, 7, 9)

rep(x, ntimes) n 회 반복 y <- rep(1:3, 2)

# y is c(1, 2, 3, 1, 2, 3)

cut(x, n) divide continuous variable in factor with n levels y <- cut(x, 5)


제어문

다음 표시 중 expr에 { }를 이용하여 복합문을 이용할 수 있다.

if-else if (cond) expr

if (cond) expr1 else expr2

for for (var in seq) expr

while while (cond) expr

switch switch(expr, ...)

ifelse ifelse(test,yes,no)


예제 - matrix의 전치(轉置) (단, 내장함수 t() 참조) # matrix의 전치(轉置) – 가급적 내장함수 t() 를 이용할 것

mytrans <- function(x) {

if (!is.matrix(x)) {

warning("argument is not a matrix: returning NA")

return(NA_real_)

}

y <- matrix(1, nrow=ncol(x), ncol=nrow(x))

for (i in 1:nrow(x)) {

for (j in 1:ncol(x)) {

y[j,i] <- x[i,j]

}

}

return(y)

}

# try it

z <- matrix(1:10, nrow=5, ncol=2)

tz <- mytrans(z) 빅데이터분석교육(2015-11)

사용자 작성 함수

사용자 함수의 표준형

myfunction <- function(arg1, arg2, ... ){

statements

return(object)

}


함수 내의 Object는 지역(local) 변수. # 예: central tendency와 분산도(spread) 계산.

mysummary <- function(x,npar=TRUE,print=TRUE) {

if (!npar) {

center <- mean(x); spread <- sd(x)

} else {

center <- median(x); spread <- mad(x)

}

if (print & !npar) {

cat("Mean=", center, "\n", "SD=", spread, "\n")

} else if (print & npar) {

cat("Median=", center, "\n", "MAD=", spread, "\n")

}

result <- list(center=center,spread=spread)

return(result)

}


# 함수 호출

set.seed(1234)

x <- rpois(500, 4)

y <- mysummary(x)

Median= 4

MAD= 1.4826

# y$center는 median (4) ,

# y$spread는 median absolute deviation (1.4826)

y <- mysummary(x, npar=FALSE, print=FALSE)

# no output

# y$center is the mean (4.052)

# y$spread is the standard deviation (2.01927)

함수명을 ( ) 없이 지정하면 소스코드를 볼 수 있다.


데이터의 정렬

order( )

디폴트는 ASCENDING.

sorting 변수 앞에 – (minus) 표시를 하면 DESCENDING order. # 예: mtcars 데이터 셋을 정렬

attach(mtcars)

# sort by mpg

newdata <- mtcars[order(mpg),]

# sort by mpg and cyl

newdata <- mtcars[order(mpg, cyl),]

#sort by mpg (ascending) and cyl (descending)

newdata <- mtcars[order(mpg, -cyl),]

detach(mtcars) 빅데이터분석교육(2015-11)

데이터 병합 (merge)

Column 추가 merge 함수 – 2개 데이터 프레임(datasets)을 수평적으로 merge

대개, 공통의 key변수에 의해 join (i.e., an inner join). # merge two data frames by ID

total <- merge(data frameA,data frameB,by="ID")

# merge two data frames by ID and Country

total <- merge(data frameA,data frameB,by=c("ID","Country"))

Rows 추가 rbind 함수 - 2개 데이터 프레임(datasets)을 수직적으로 merge.

양 데이터 프레임는 같은 변수를 가질 것. (순서는 달라도 무방). total <- rbind(data frameA, data frameB)

특정 변수가 데이터프레임 A 에는 있지만 B에는 없는 경우: • 1. Delete the extra variables in data frameA or

• 2. Create the additional variables in data frameB and set them to NA (missing) before joining them with rbind( ).


데이터 총량화 (Aggregating Data)

BY 변수와 함수지정을 통해 데이터를 압축 (collapse) # mtcars를 aggregate. 숫자변수에 대해 평균계산

attach(mtcars)

aggdata <-aggregate(mtcars, by=list(cyl,vs), FUN=mean, na.rm=TRUE)

print(aggdata)

detach(mtcars)

단, by 변수는 항상 list에 있을 것

함수는 내장, 사용자 작성 모두 가능.

참조: summarize() in the Hmisc package

summaryBy() in the doBy package


데이터의 모양 변경 (Reshaping)

전치 (轉置: Transpose) t() – transpose a matrix or a data frame.

# 예:

mtcars

t(mtcars)

Reshape Package "melt" data 각 row가 고유한 id-variable 조합의 형태가 됨.

그런 후 원하는 형태로 변경 ("cast“)


mydata

Id time x1 x2

1 1 5 6

1 2 3 5

2 1 6 1

2 2 2 4

# example of melt function

library(reshape)

mdata <- melt(mydata, id=c("id","time"))

newdata

Id Time Variable Value

1 1 x1 5

1 2 x1 3

2 1 x1 6

2 2 x1 2

1 1 x2 6

1 2 x2 5

2 1 x2 1

2 2 x2 4 빅데이터분석교육(2015-11)

# cast the melted data

# cast(data, formula, function)

subjmeans <- cast(mdata, id~variable, mean)

timemeans <- cast(mdata, time~variable, mean)

subjmeans

timemeans

melt( ) and cast( ) 함수에는 그 밖의 다양한 기능이 많음.

id x1 x2

1 4 5.5

2 4 2.5

time x1 x2

1 5.5 3.5

2 2.5 4.5


Subsetting Data

R의 indexing 기능을 통해 특정 변수 등을 선택.

변수 선택 (Keeping)의 예:

# select variables v1, v2, v3

myvars <- c("v1", "v2", "v3")

newdata <- mydata[myvars]

# another method

myvars <- paste("v", 1:3, sep="")

newdata <- mydata[myvars]

# select 1st and 5th thru 10th variables

newdata <- mydata[c(1,5:10)]


특정 변수 배제 (DROPPING) 의 예: # v1, v2, v3 변수를 배제

myvars <- names(mydata) %in% c("v1", "v2", "v3")

newdata <- mydata[!myvars]

# 3째 및 5째 변수를 배제

newdata <- mydata[c(-3,-5)]

# v3 및 v5를 삭제

mydata$v3 <- mydata$v5 <- NULL


특정 row (Observations) 선택 # 맨 앞의 5개 observerations

newdata <- mydata[1:5,]

# 변수 값에 의거하는 경우

newdata <- mydata[ which(mydata$gender=='F'

& mydata$age > 65), ]

# 또는

attach(newdata)

newdata <- mydata[ which(gender=='F' & age > 65),]

detach(newdata)


Subset 함수를 통한 선택 # 예(1): 나이가 20 이상 또는 10 미만인 사람의 ID와 Weight.

newdata <- subset(mydata,age>= 20|age < 10,select=c(ID, Weight))

# 예(2): 나이가 25 이상인 남자의 weight부터 income까지 모든 column.

newdata <- subset(mydata, sex=="m" & age > 25,

select=weight:income)


무작위 (Random) 표본 sample( ) 함수 –random sample of size n from a dataset.

# take a random sample of size 50 from a dataset mydata

# sample without replacement

mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]


정리

Dates dates 는 문자 또는 숫자로 변환이 가능.

예제

to one long vector

to matrix

to data frame

from Vector

c(x,y) cbind(x,y)

rbind(x,y)

data.frame(x,y)

from Matrix

as.vector(my

matrix)

as.data.frame(

mymatrix)

from data frame

as.matrix(my

frame)


apply() 함수

행렬의 행과 열에 대해 원하는 함수를 적용 형태: apply(data, dim, function)

• dim=1 이면 각 행(row)에 function을 적용

• dim=2 이면 각 열(column)에 function을 적용


## Compute row and column sums for a matrix:

x <- cbind(x1 = 3, x2 = c(4:1, 2:5))

dimnames(x)[[1]] <- letters[1:8]

apply(x, 2, mean, trim = .2)

col.sums <- apply(x, 2, sum)

row.sums <- apply(x, 1, sum)

rbind(cbind(x, Rtot = row.sums), Ctot = c(col.sums, sum(col.sums)))

stopifnot( apply(x,2, is.vector)) # not ok in R <= 0.63.2

apply(x, 2, sort) # column을 정렬

##- function with extra args:

cave <- function(x, c1,c2) c(mean(x[c1]),mean(x[c2]))

apply(x,1, cave, c1="x1", c2=c("x1","x2"))

ma <- matrix(c(1:4, 1, 6:8), nr = 2)

apply(ma, 1, table) #--> a list of length 2

apply(ma, 1, quantile)# 5 x n matrix with rownames

stopifnot(dim(ma) == dim(apply(ma, 1:2, sum)))


R의 그래프 기초


R의 그래프 기초 <목차>

R 그래프 개요

plot() 함수

그래프 생성

밀도 Plots

점 (Dot) Plots

막대 (Bar) Plots

선 도표 (Line Charts)

파이 차트 (Pie Charts)

상자 그림 (Boxplots)

Scatter Plots


R 그래프 개요

R에서의 다양한 그래프 기능의 예 demo(graphics); > demo(persp)

Sine 함수와 cosine 함수 이용 예: x = (0:20) * pi / 10; y = cos(x)

x = (0:20) * pi / 10

y = cos(x)

plot(x,y)

ysin = sin(x)

ysin2 = sin(x) ^2

plot(x,y)

lines(x,ysin2)

par(mfrow = c(2,2))

plot(x,y, type="p")

plot(x,y, type="l")

plot(x,y, type="b")

plot(x,y, type="p", pch=19, col="red") 빅데이터분석교육(2015-11)

그래프의 저장 그래픽 환경에서의 메뉴를 이용:

• File -> Save As.

다음의 함수를 이용.

Function Output to

pdf("mygraph.pdf") pdf 파일

win.metafile("mygraph.wmf") windows metafile

png("mygraph.png") png 파일

jpeg("mygraph.jpg") jpeg 파일

bmp("mygraph.bmp") bmp 파일

postscript("mygraph.ps") postscript 파일


여러 개의 그래프를 동시에 이용하기 새 그래프는 기존 그림을 덮어 쓰므로 이를 피하려면 다음 함수를

이용하여 새 그래프 생성 전에 미리 새 graph window를 열 것.

• 이를 통해 여러 개 graph windows를 한번에 열 수 있다.

• help(dev.cur)

또는 첫 graph window를 연 후, • 메뉴에서 History -> Recording 선택한 후 Previous 및 Next 를 이용

Function Platform

windows() Windows

X11() Unix

quartz() Mac


plot() 함수

plot( ) 함수 graph widow를 열고 plotting

대화식으로 그래프 생성 # Creating a Graph

attach(mtcars)

plot(wt, mpg)

abline(lm(mpg~wt))

title("Regression of MPG on Weight")

plot()함수의 옵션: 뒷면


파라미터 Option 및 설명

type = 그래프의 형태를 지정 type=“p” 점(point) 그래프 type=“l” 선(line) 그래프 type=“b” 점과 선으로 이어서 그림 type=“o” 선이 점 위에 겹쳐진 형태 type=“h” 수직선으로 그림 type=“s” 계단(step)형 그래프

xlim = ylim = x축과 y축의 상한과 하한. xlim = c(1,10) 또는 xlim = range(x)

xlab = ylba = x축과 y축의 이름(label) 부여

main = 그래프의 위쪽에 놓이는 주 제목(main title).

sub = 그래프의 아래쪽에 놓이는 소 제목(subtitle).

bg= 그래프의 배경화면 색깔

bty= 그래프를 그리는 상자의 모양


pch

lty

파라미터 Option 및 설명

pch = 표시되는 점의 모양

lty = 선의 종류 1: 실선(solid line) 2: 파선 (dashed) 3: 점선: 점선(dotted) 4: dot-dash

col= 색깔 지정 “red”,”green”,”blue” 및 색상을 나타내는 숫자

mar = c(bottom, left, top, right) 의 순서로 가장자리 여분 값을 지정. 디폴트는 c(5,4,4,2) + 0.1

빅데이터분석교육(2015-11) 빅데이터분석교육(2015-

11)

예: par(mfrow = c(2,2))

plot(x,y, type="b", main = "cosie 그래프", sub = "type = b")

plot(x,y, type="o", las = 1, bty = "u", sub = "type = o")

plot(x,y, type="h", bty = "7", sub = "type = h")

plot(x,y, type="s", bty = "n", sub = "type = s")


abline() 직선

abline(a,b) # 절편=a, 기울기=b인 직선

abline(h=y) # 수평선

abline(v=x) # 수직선

abline(lm.obj) # lm.obj에 지정된 직선

예: data(cars)

attach(cars)

par(mfrow=c(2,2))

plot(speed, dist, pch=1); abline(v=15.4)

plot(speed, dist, pch=2); abline(h=43)

plot(speed, dist, pch=3); abline(-14,3)

plot(speed, dist, pch=8); abline(v=15.4); abline(h=43)


히스토그램

Histograms hist(x) 함수

• x 는 plotting하려는 값의 숫자 vector

• freq=FALSE option 빈도 대신 확률밀도

• breaks= option bin의 개수 지정

Histogram의 단점 • 구간의 개수에 크게 영향을 받는다.

# 가장 단순한 히스토그램

hist(mtcars$mpg)

# 구간의 개수를 지정. 색상 지정

hist(mtcars$mpg, breaks=12, col="red")


밀도 Plot

핵 밀도(Kernel Density) Plots plot(density(x)) 단, x는 수치 vector.

# Kernel Density Plot

d <- density(mtcars$mpg)

plot(d) # plots the results

# Filled Density Plot

d <- density(mtcars$mpg)

plot(d, main="Kernel Density of Miles Per Gallon")

polygon(d, col="red", border="blue")


Kernel Density를 이용한 Group 비교 sm package의 sm.density.compare(x, factor)

• x는 숫자 vector, factor는 grouping 변수.

• superimpose the kernal density plots of two or more groups. # Compare MPG distributions for cars with 4,6, or 8 cylinders

library(sm)

attach(mtcars)

# create value labels

cyl.f <- factor(cyl, levels= c(4,6,8),

labels = c("4 cylinder", "6 cylinder", "8 cylinder"))

# 밀도곡선의 plotting

sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")

title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click

colfill<-c(2:(2+length(levels(cyl.f))))

legend(locator(1), levels(cyl.f), fill=colfill) 빅데이터분석교육(2015-11)


점 plotting

점 Plotting (Dot Plot) dotchart(x, labels=)

x는 숫자 vector, labels은 각 점의 레이블.

groups= option x를 그룹화할 factor 지정.

dotchart(mtcars$mpg,labels=row.names(mtcars),

cex=.7,main="Gas Milage for Car Models",

xlab="Miles Per Gallon")


# Dotplot: 그룹별, 정렬 (기준: mpg, group), 색깔 (by cylinder)

x <- mtcars[order(mtcars$mpg),] # sort by mpg

x$cyl <- factor(x$cyl)

x$color[x$cyl==4] <- "red"

x$color[x$cyl==6] <- "blue"

x$color[x$cyl==8] <- "darkgreen"

dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,

main="Gas Milage for Car Models\ngrouped by cylinder",

xlab="Miles Per Gallon", gcolor="black", color=x$color)


막대 (Bar) Plots

barplot(height) height는 vector 또는 matrix.

• If (height is a vector) • 값에 따라 높이가 달라짐.

• If (height is a matrix AND option beside=FALSE) • 각 bar는 height column에 대응되고 값은 stacked “sub-bars”의 높이)

• If (height is a matrix AND beside=TRUE) • the values in each column are juxtaposed rather than stacked.

• option names.arg=(character vector) 막대에 대한 label

• option horiz=TRUE 수평 barplot 생성

단, Bar plot은 빈도수 외에 여타의 통계량도 bar plotting가능. (means, medians, sd 등)

• aggregate( ) 함수의 결과를 barplot( ) 에 전달하는 방식


# Simple Bar Plot

counts <- table(mtcars$gear)

barplot(counts, main="Car Distribution", xlab="Number of Gears")

# Simple Horizontal Bar Plot with Added Labels

counts <- table(mtcars$gear)

barplot(counts, main="Car Distribution", horiz=TRUE,

names.arg=c("3 Gears", "4 Gears", "5 Gears"))


Stacked Bar Plot # Stacked Bar Plot with Colors and Legend

counts <- table(mtcars$vs, mtcars$gear)

barplot(counts, main="Car Distribution by Gears와 VS에 따른 자동차

분포“, xlab="Gear의 수", col=c("darkblue","red"),

legend = rownames(counts))


Grouped Bar Plot # Grouped Bar Plot

counts <- table(mtcars$vs, mtcars$gear)

barplot(counts, main=“Gears와 VS에 따른 자동차 분포",

xlab="Gear의 수", col=c("darkblue","red"),

legend = rownames(counts), beside=TRUE)


선 도표 (Line Charts)

선 도표 (Line Charts) lines(x, y, type=)

• x와 y는 연결하고자 하는 점좌표의 숫자 vector

• type= 다음의 값을 가짐

Type Description

p 점

l 선

o overplotted points와 lines

b, c 선으로 연결(join)된 points ("c“의 경우 비었음)

s, S stair steps

h histogram-like vertical lines

n 아무 것도 출력치 않음


lines( ) function • 자체만으로는 그래프 생성 못하고 plot(x, y) 명령 후 사용됨.

• 디폴트: plot( ) plots the (x,y) points.

• plot( ) 의 type="n" option – 그래프에 점을 plotting 하지 않으면서 axes, titles 등을 만들어냄.

• only looks different from the type="b" option if the plotting of points is suppressed in the plot( ) command.

• 예: x <- c(1:5); y <- x # create some data

par(pch=22, col="red") # plotting symbol and color

par(mfrow=c(2,4)) # all plots on one page

opts = c("p","l","o","b","c","s","S","h")

for(i in 1:length(opts)){

heading = paste("type=",opts[i])

plot(x, y, type="n", main=heading)

lines(x, y, type=opts[i])

}



plot( ) 에서의 type= options 의 예 x <- c(1:5); y <- x # create some data

par(pch=22, col="blue") # plotting symbol and color

par(mfrow=c(2,4)) # all plots on one page

opts = c("p","l","o","b","c","s","S","h")

for(i in 1:length(opts){

heading = paste("type=",opts[i])

plot(x, y, main=heading)

lines(x, y, type=opts[i])

}


# Create Line Chart

Orange$Tree <- as.numeric(Orange$Tree) # factor numeric 변환

ntrees <- max(Orange$Tree)

xrange <- range(Orange$age); yrange <- range(Orange$circumference)

# set up the plot

plot(xrange, yrange, type="n", xlab="Age (days)", ylab="Circumference (mm)" )

colors <- rainbow(ntrees); linetype <- c(1:ntrees)

plotchar <- seq(18,18+ntrees,1)

# add lines

for (i in 1:ntrees) {

tree <- subset(Orange, Tree==i)

lines(tree$age, tree$circumference, type="b", lwd=1.5,

lty=linetype[i], col=colors[i], pch=plotchar[i])

}

# add a title and subtitle

title("Tree Growth", "example of line plot")

# add a legend

legend(xrange[1], yrange[2], 1:ntrees, cex=0.8, col=colors,

pch=plotchar, lty=linetype, title="Tree")



파이 차트

pie(x, labels=) x 는 non-negative numeric vector (각 slice의 면적표시)

labels= • 각 slice 이름의 문자 vector

# Simple Pie Chart

slices <- c(10, 12,4, 16, 8)

lbls <- c("US", "UK", "Australia", "Germany", "France")

pie(slices, labels = lbls, main="Pie Chart of Countries")


Pie 차트에서 백분율 표시 # Pie Chart with Percentages

slices <- c(10, 12, 4, 16, 8)


pct <- round(slices/sum(slices)*100)

lbls <- paste(lbls, pct) # add percents to labels

lbls <- paste(lbls,"%",sep="") # ad % to labels

pie(slices,labels = lbls, col=rainbow(length(lbls)),

main="Pie Chart of Countries")


3D Pie Chart plotrix package의 pie3D( ) function

• 3D exploded pie chart 출력 # 3D Exploded Pie Chart

library(plotrix)

slices <- c(10, 12, 4, 16, 8)


pie3D(slices,labels=lbls,explode=0.1,

main="Pie Chart of Countries ")


Box Plot

상자 plot 각 변수별 또는 그룹별로 Boxplot 가능.

boxplot(x, data=) • x 는 formula, data= 에서 데이터프레임 지정

• formula 예 : y~group

• varwidth=TRUE 를 추가하면 폭이 제곱근에 비례

• horizontal=TRUE 를 추가하면 축 방향이 반대가 됨. # Boxplot of MPG by Car Cylinders

boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",

xlab="Number of Cylinders", ylab="Miles Per Gallon")



# Notched Boxplot of Tooth Growth Against 2 Crossed Factors

# boxes colored for ease of interpretation

boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE,

col=(c("gold","darkgreen")),

main="Tooth Growth", xlab="Suppliment and Dose")


Violin Plots Boxplot과 kernel density plot의 혼합형태.

vioplot package의 vioplot( ) library(vioplot)

x1 <- mtcars$mpg[mtcars$cyl==4]



vioplot(x1,x2,x3,names=c("4 cyl", "6 cyl", "8 cyl"),col="gold")

title("Violin Plots of Miles Per Gallon")


Scatterplots

Scatterplot의 단순한 방식 plot(x, y)

• x, y는 numeric vector로서 plot할 점 (x,y)을 표시 attach(mtcars)

plot(wt, mpg, main="Scatterplot Example",

xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)


# Add fit lines

abline(lm(mpg~wt), col="red") # regression line (y~x)

lines(lowess(wt,mpg), col="blue") # lowess line (x,y)


car package의 scatterplot( ) 다양한 기능제공.

예: fit lines, marginal box plots, conditioning on a factor, and interactive point identification.

# Enhanced Scatterplot of MPG vs. Weight by no. of Car Cylinders

library(car)

scatterplot(mpg ~ wt | cyl, data=mtcars,

xlab="Weight of Car", ylab="Miles Per Gallon",

main="Enhanced Scatter Plot", labels=row.names(mtcars))


Scatterplot Matrices 예:

pairs(~mpg+disp+drat+wt,data=mtcars,

main="Simple Scatterplot Matrix")


lattice package options to condition the scatterplot matrix on a factor.

# Scatterplot Matrices from the lattice Package

library(lattice)

splom(mtcars[c(1,3,5,6)], groups=cyl, data=mtcars,

panel=panel.superpose,

key=list(title="Three Cylinder Options", columns=3,

points=list(pch=super.sym$pch[1:3], col=super.sym$col[1:3]),

text=list(c("4 Cylinder","6 Cylinder","8 Cylinder"))))



car package factor에 따라 scatterplot matrix 조절

선택적으로: • lowess 및 linear best fit lines, boxplot, densities, histograms in

the principal diagonal, rug plots in the margins of the cells. # car Package로 부터의 Scatterplot Matrix

library(car)

scatterplot.matrix(~mpg+disp+drat+wt|cyl,

data=mtcars,

main="Three Cylinder Options")


gclus package options to rearrange the variables so that those with higher

correlations are closer to the principal diagonal. It can also color code the cells to reflect the size of the correlations.

# Scatterplot Matrices from the glus Package

library(gclus)

dta <- mtcars[c(1,3,5,6)] # get data

dta.r <- abs(cor(dta)) # get correlations

dta.col <- dmat.color(dta.r) # get colors

# 변수를 재배열해서 상관계수가 높은 것이 대각선에 가까워짐

dta.o <- order.single(dta.r)

cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,

main="Variables Ordered and Colored by Correlation" )



High Density Scatterplots The hexbin(x, y) function in the hexbin package provides

bivariate binning into hexagonal cells (it looks better than it sounds).

# High Density Scatterplot with Binning

library(hexbin)

x <- rnorm(1000)

y <- rnorm(1000)

bin<-hexbin(x, y, xbins=50)

plot(bin, main="Hexagonal Binning")


sunflowerplot. # High Density Scatterplot with Color Transparency

pdf("c:/scatterplot.pdf")

x <- rnorm(1000)

y <- rnorm(1000)

plot(x,y, main="PDF Scatterplot Example",

col=rgb(0,100,0,50,maxColorValue=255), pch=16)

dev.off()


3D Scatterplots scatterplot3d package의 scatterplot3d(x, y, z).

# 3D Scatterplot

library(scatterplot3d)

attach(mtcars)

scatterplot3d(wt,disp,mpg, main="3D Scatterplot")


library(scatterplot3d)

attach(mtcars)

s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,

type="h", main="3D Scatterplot")

fit <- lm(mpg ~ wt+disp)

s3d$plane3d(fit)