Upload
cecil-lang
View
228
Download
4
Embed Size (px)
Citation preview
Data Analytics김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도
2
ContentsPart l
1. Introduction - Data Analytics Cases- What is Data Analytics? - OLTP, OLAP- ROLAP- MOLAP- Column store
2. Applications &Data Management System- MATLAB- R- Impala- Splunk- HANA
3. Research Trend- Hyper: (ICDE 2011) Combination of OLTP & OLAP- Starfish (CIDR 2011)- Crisis Informatics (ICSE 2011)
3
ContentsPart ll
4. Demo1: Statistical Computing- MATLAB- R
5. Demo2: Analytics on Hadoop- Pig- Hive- Impala
6. Demo3: Real-time Analytics- Splunk
4
StatisticalAnalytics
QueryProcessing
Time-seriesAnalytics
DataVisualization
OpenSource
MATLAB O X O O X
R O X O O O
Impala O X O
Splunk X O O O X
HANA O X X
Overview
5
- MATLAB- R
Demo 1Statistical Computing
6
MATLAB
• Engineering software which provides numericalanalytics environment
- Matrix manipulations
- Plotting of functions and data
- Implementation of algorithms
- Creation of user interfaces
- MATLAB can interfacing with C, C++, Java, Fortran, Python
7
MATLAB
• Interface
8
MATLAB
• Too slow to manage large data
9
MATLAB
• Code example
10
Demo: Plot
11
Demo: Data Linking
12
Demo: Regression
13
Demo: Polynomial Fitting
14
R
• Programming language for statistical computing andgraphics
- Widely used among statisticians and data analyist- Can run on Windows, Mac, Lunix- Can use for free- Easily extensible through functions- Provides statistical techniques - Provides high quality graphical techniques- A lot of library from third party
15
R• R Language Example
16
R• R Studio Example
17
R• Vector Example
18
R• Matrix Example
19
R• Scatter Plot & Visualization Example
20
R• Plentiful Library Example
21
R• Heatmap Example
22
R• Line Graph Example
23
R• Linear Regression Example
24
- Pig- Hive- Impala
Demo 2Analytics on Hadoop
25
Analytics on Hadoop
1. Mapper and Reducer programs- Writing Java programs to analyze data at HDFS
2. SQL-like queries- Writing high-level query language like Oracle or MySQL
26
Analytics on Hadoop
1. Mapper and Reducer for word countpublic class WordCount { public static void main(String[] args) { int res = ToolRunner.run(new WordCount(), args);}
public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "wordcount"); job.setJarByClass(this.getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1;}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private long numRecords = 0; private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");
public void map(LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException { String line = lineText.toString(); Text currentWord = new Text(); for (String word : WORD_BOUNDARY.split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text word, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } }}
Ref. Cloudera Hadoop Tutorial
27
Analytics on Hadoop
2. SQL-like queries for word count
CREATE TABLE doc( text string);
LOAD DATA LOCAL INPATH '/home/Documents/sentiment/Wikipedia.txt'OVERWRITE INTO TABLE doc;
SELECT word, COUNT(*)FROM (SELECT explode(split(text, ' ')) AS word FROM doc)GROUP BY word;
28
SQL on Hadoop
1. Pig- SQL-like scripting language is called Pig Latin- They are translated into MapReduce jobs Automatically
2. Hive- SQL-like scripting language is called HiveQL(HQL)- They are also translated into MapReduce jobs
3. Impala- Supports most of HiveQL and additional statements- Distributed processing(impalad) instead of MapReduce
They enable users to write complex data transformationswithout knowing Java!
29
Pig Hive ImpalaReleased Year 2006 2008 2012Dev.Language Java Java C++
SQL Pig Latin HiveQL HiveQL (+)
Query Pro-cessing
Tuple-at-a-time(MapReduce)
Tuple-at-a-time(MapReduce)
Block-at-a-time(Impalad)
ODBC/JDBC Yes Yes Yes
Latency High High LowSuitable Jobs Batch Batch Real-time
SQL on Hadoop
30
Benchmark
• System EnvironmentCluster 13 Nodes (1 master + 12 slaves)
CPU Intel i5
Memory 32.0GB (each node)
HDD 5.0TB each (each node)
OS Ubuntu 12.0.4
Hadoop 2.3.0
Pig 0.12.0
Hive 0.13.1
Impala 2.1.1
31
Benchmark
• Data Set- Randomly generated 1GB sales transaction from TPC-DS
Store_SalesDate_FKCustomer_FKItem_FK
numbercostwhole_costtax
Date_FK
quaterdaymonthyear
Date_Dim
ItemItem_FK
colorcompany
Customer_FK
namesalutationcountry
Customer
32
Benchmark
• Query 1: Average sales cost in first half year
SELECT AVG(ss.ss_ext_wholesale_cost)FROM date_dim AS d, store_sales AS ssWHERE d.d_date_sk = ss.ss_sold_date_sk AND d.d_qoy < 3;
Aggregation
Join
Range
Point
Rank
Hive & Impala
33
Benchmark
• Query 1: Average sales cost in first half yearss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray, … , ss_net_profit:int);
d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);
metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;
result = FILTER metadata BY d_qoy < 3;
grouped = GROUP result ALL;avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost);
STORE avg_sales INTO 'query1.txt';
Pig
34
Benchmark
• Query 2: Average sales cost on Sunday
SELECT AVG(s.ss_ext_wholesale_cost)FROM store_sales AS s, date_dim AS dWHERE d.d_date_sk = s.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday';
Aggregation
Join
Range
Point
Rank
Hive & Impala
35
Benchmark
• Query 2: Average sales cost on Sundayss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray, … , ss_net_profit:int);
d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);
metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;
result = FILTER metadata BY d_day_name == ‘Sunday’;
grouped = GROUP result ALL;avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost);
STORE avg_sales INTO 'query2.txt';
Pig
36
Benchmark
• Query 3: Bottom 20 customer’s birth country ordered by average sales cost on Sunday
SELECT c.c_birth_country, AVG(ss.ss_ext_wholesale_cost) AS avg_salesFROM store_sales AS ss, customer AS c, date_dim AS dWHERE c.c_customer_sk = ss.ss_customer_sk AND d.d_date_sk = ss.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday' AND c.c_birth_country != ''GROUP BY c.c_birth_countryORDER BY avg_salesLIMIT 20;
Aggregation
Join
Range
Point
Rank
Hive & Impala
37
Benchmark• Query 3: Bottom 20 customer’s birth country ordered
by average sales cost on Sundayss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',')AS (ss_sold_date_sk:chararray,… , ss_net_profit:int);d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',')AS (d_date_sk:chararray, … , d_current_year:int);c = LOAD '/user/user01/customer.csv' USING PigStorage(',')AS (c_customer_sk:chararray, … , c_last_review_date:int);
metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk;metadata2 = JOIN ss BY ss_customer_sk, c BY c_customer_sk;
result = FILTER metadata2 BY (d.d_day_name == ‘Sunday’) AND (c.c_birth_country != ‘’);
grouped = GROUP result BY c.c_birth_country;avg_table = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost) as avg_sales;
ordered = ORDER avg_table BY avg_sales;STORE ordered INTO 'query3.txt';
Pig
38
Benchmark
• Our results
39
Benchmark
• Results from Cloudera documents
Ref. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
40
- Splunk
Demo 3Real-time Analytics
41
SplunkAn engine for real-time machine data- Collection, indexing, analyzing and visualizing machine data to
identify problems, patterns, risks and opportunities and drivebetter decisions for IT and the business
Machine data(Unstructured data, No predefined schema)
- Logs, Application queries, Records(Billing, Call detail, Events),Click Stream
42
Overview of SplunkData indexing
Search language
search | command arguments | command arguments | …
sourcetype=syslog [ search login error | return 1user ]
[+|-]<integer><unit>@<snap_time_unit>error earliest=-1d@d latest=-h@h
43
Splunk demo (1)Simple commands using Windows application logs
44
Splunk demo (2)Foot traffic analytics using Cisco Meraki data
45
Reference[1] Cloudera hadoop tutorial, http://
www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CDH5/Hadoop-Tutorial/ht_wordount1_source.html, 2015.05.29.
[2] Introduction to HIVE, http://amalgjose.wordpress.com/2013/10/19/an-introduction-to-apache-hive, 2015.05.29[3] SQL on Hadoop, Intelligent Data Systems Lab, Seoul Nat’l University.[4] TPC Benchmarks Standard Specification, version 1.3.1, Transaction Processing Performance Council, 2015.02.