Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP -

Data Analytics김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도

2

ContentsPart l

1. Introduction - Data Analytics Cases- What is Data Analytics? - OLTP, OLAP- ROLAP- MOLAP- Column store

2. Applications &Data Management System- MATLAB- R- Impala- Splunk- HANA

3. Research Trend- Hyper: (ICDE 2011) Combination of OLTP & OLAP- Starfish (CIDR 2011)- Crisis Informatics (ICSE 2011)

3

ContentsPart ll

4. Demo1: Statistical Computing- MATLAB- R

5. Demo2: Analytics on Hadoop- Pig- Hive- Impala

6. Demo3: Real-time Analytics- Splunk

4

StatisticalAnalytics

QueryProcessing

Time-seriesAnalytics

DataVisualization

OpenSource

MATLAB O X O O X

R O X O O O

Impala O X O

Splunk X O O O X

HANA O X X

Overview

5

- MATLAB- R

Demo 1Statistical Computing

6

MATLAB

• Engineering software which provides numericalanalytics environment

- Matrix manipulations

- Plotting of functions and data

- Implementation of algorithms

- Creation of user interfaces

- MATLAB can interfacing with C, C++, Java, Fortran, Python

7

MATLAB

• Interface

8

MATLAB

• Too slow to manage large data

9

MATLAB

• Code example

10

Demo: Plot

11

Demo: Data Linking

12

Demo: Regression

13

Demo: Polynomial Fitting

14

R

• Programming language for statistical computing andgraphics

- Widely used among statisticians and data analyist- Can run on Windows, Mac, Lunix- Can use for free- Easily extensible through functions- Provides statistical techniques - Provides high quality graphical techniques- A lot of library from third party

15

R• R Language Example

16

R• R Studio Example

17

R• Vector Example

18

R• Matrix Example

19

R• Scatter Plot & Visualization Example

20

R• Plentiful Library Example

21

R• Heatmap Example

22

R• Line Graph Example

23

R• Linear Regression Example

24

- Pig- Hive- Impala

Demo 2Analytics on Hadoop

25

Analytics on Hadoop

1. Mapper and Reducer programs- Writing Java programs to analyze data at HDFS

2. SQL-like queries- Writing high-level query language like Oracle or MySQL

26

Analytics on Hadoop

1. Mapper and Reducer for word countpublic class WordCount { public static void main(String[] args) { int res = ToolRunner.run(new WordCount(), args);}

public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "wordcount"); job.setJarByClass(this.getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1;}

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private long numRecords = 0; private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");

public void map(LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException { String line = lineText.toString(); Text currentWord = new Text(); for (String word : WORD_BOUNDARY.split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text word, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } }}

Ref. Cloudera Hadoop Tutorial

27

Analytics on Hadoop

2. SQL-like queries for word count

CREATE TABLE doc( text string);

LOAD DATA LOCAL INPATH '/home/Documents/sentiment/Wikipedia.txt'OVERWRITE INTO TABLE doc;

SELECT word, COUNT(*)FROM (SELECT explode(split(text, ' ')) AS word FROM doc)GROUP BY word;

28

SQL on Hadoop

1. Pig- SQL-like scripting language is called Pig Latin- They are translated into MapReduce jobs Automatically

2. Hive- SQL-like scripting language is called HiveQL(HQL)- They are also translated into MapReduce jobs

3. Impala- Supports most of HiveQL and additional statements- Distributed processing(impalad) instead of MapReduce

They enable users to write complex data transformationswithout knowing Java!

29

Pig Hive ImpalaReleased Year 2006 2008 2012Dev.Language Java Java C++

SQL Pig Latin HiveQL HiveQL (+)

Query Pro-cessing

Tuple-at-a-time(MapReduce)

Tuple-at-a-time(MapReduce)

Block-at-a-time(Impalad)

ODBC/JDBC Yes Yes Yes

Latency High High LowSuitable Jobs Batch Batch Real-time

SQL on Hadoop

30

Benchmark

• System EnvironmentCluster 13 Nodes (1 master + 12 slaves)

CPU Intel i5

Memory 32.0GB (each node)

HDD 5.0TB each (each node)

OS Ubuntu 12.0.4

Hadoop 2.3.0

Pig 0.12.0

Hive 0.13.1

Impala 2.1.1

31

Benchmark

• Data Set- Randomly generated 1GB sales transaction from TPC-DS

Store_SalesDate_FKCustomer_FKItem_FK

numbercostwhole_costtax

Date_FK

quaterdaymonthyear

Date_Dim

ItemItem_FK

colorcompany

Customer_FK

namesalutationcountry

Customer

32

Benchmark

• Query 1: Average sales cost in first half year

SELECT AVG(ss.ss_ext_wholesale_cost)FROM date_dim AS d, store_sales AS ssWHERE d.d_date_sk = ss.ss_sold_date_sk AND d.d_qoy < 3;

Aggregation

Join

Range

Point

Rank

Hive & Impala

33

Benchmark

• Query 1: Average sales cost in first half yearss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray, … , ss_net_profit:int);

d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);

metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;

result = FILTER metadata BY d_qoy < 3;

grouped = GROUP result ALL;avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost);

STORE avg_sales INTO 'query1.txt';

Pig

34

Benchmark

• Query 2: Average sales cost on Sunday

SELECT AVG(s.ss_ext_wholesale_cost)FROM store_sales AS s, date_dim AS dWHERE d.d_date_sk = s.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday';

Aggregation

Join

Range

Point

Rank

Hive & Impala

35

Benchmark

• Query 2: Average sales cost on Sundayss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray, … , ss_net_profit:int);

d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);

metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;

result = FILTER metadata BY d_day_name == ‘Sunday’;

grouped = GROUP result ALL;avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost);

STORE avg_sales INTO 'query2.txt';

Pig

36

Benchmark

• Query 3: Bottom 20 customer’s birth country ordered by average sales cost on Sunday

SELECT c.c_birth_country, AVG(ss.ss_ext_wholesale_cost) AS avg_salesFROM store_sales AS ss, customer AS c, date_dim AS dWHERE c.c_customer_sk = ss.ss_customer_sk AND d.d_date_sk = ss.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday' AND c.c_birth_country != ''GROUP BY c.c_birth_countryORDER BY avg_salesLIMIT 20;

Aggregation

Join

Range

Point

Rank

Hive & Impala

37

Benchmark• Query 3: Bottom 20 customer’s birth country ordered

by average sales cost on Sundayss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray,… , ss_net_profit:int);d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);c = LOAD '/user/user01/customer.csv' USING PigStorage(',')AS (c_customer_sk:chararray, … , c_last_review_date:int);

metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;metadata2 = JOIN ss BY ss_customer_sk, c BY c_customer_sk;

result = FILTER metadata2 BY (d.d_day_name == ‘Sunday’) AND (c.c_birth_country != ‘’);

grouped = GROUP result BY c.c_birth_country;avg_table = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost) as avg_sales;

ordered = ORDER avg_table BY avg_sales;STORE ordered INTO 'query3.txt';

Pig

38

Benchmark

• Our results

39

Benchmark

• Results from Cloudera documents

Ref. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

40

- Splunk

Demo 3Real-time Analytics

41

SplunkAn engine for real-time machine data- Collection, indexing, analyzing and visualizing machine data to

identify problems, patterns, risks and opportunities and drivebetter decisions for IT and the business

Machine data(Unstructured data, No predefined schema)

- Logs, Application queries, Records(Billing, Call detail, Events),Click Stream

42

Overview of SplunkData indexing

Search language

search | command arguments | command arguments | …

sourcetype=syslog [ search login error | return 1user ]

[+|-]<integer><unit>@<snap_time_unit>error earliest=-1d@d latest=-h@h

43

Splunk demo (1)Simple commands using Windows application logs

44

Splunk demo (2)Foot traffic analytics using Cisco Meraki data

45

Reference[1] Cloudera hadoop tutorial, http://

www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CDH5/Hadoop-Tutorial/ht_wordount1_source.html, 2015.05.29.

[2] Introduction to HIVE, http://amalgjose.wordpress.com/2013/10/19/an-introduction-to-apache-hive, 2015.05.29[3] SQL on Hadoop, Intelligent Data Systems Lab, Seoul Nat’l University.[4] TPC Benchmarks Standard Specification, version 1.3.1, Transaction Processing Performance Council, 2015.02.

http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CDH5/Hadoop-Tutorial/ht_wordount1_source.html