제4차산업혁명과 빅데이터 기술prof.ks.ac.kr/cschung/2017-Big-Data.pdf · Current data +...

Preview:

Citation preview

©2017 Yunmook Nah

제4차산업혁명과 빅데이터 기술

(4th Industrial Revolution and BigData Technology)

April 07, 2017

Prof. Yunmook Nah, Ph.D. (나연묵)

ymnah@dankook.ac.kr

Professor, Department of Applied Computer Engineering Chairman, Department of Data Science, Graduate School

Senior Director, Research Institute of Information and Culture Technology Dankook Univeristy

OUTLINE • Overview • Data • Bigdata • Bigdata technology • Bigdata issues

©2017 Yunmook Nah

OVERVIEW • 4th industrial revolution

– Core technology • Mobile => Smart phone, wearable devices, vehicles • Processing power, storage capacity => Computing

equipment, data center, cloud • Knowledge => Bigdata

– Emerging technology • AI, robotics, IoT, autonomous vehicles, 3-D printing,

nanotech, biotech, materials science, energy storage • CPS, AR/VR, etc

©2017 Yunmook Nah

The possibilities of billions of people connected by mobile devices, with unprecedented processing power, storage capacity, and access to knowledge, are unlimited. And these possibilities will be multiplied by emerging technology breakthroughs in fields such as artificial intelligence, robotics, the Internet of Things, autonomous vehicles, 3-D printing, nanotechnology, biotechnology, materials science, energy storage, and quantum computing. [Klaus Schwab]

DATA • Data: fact • Information = Processing (Data) • Knowledge

– Knowledge discovery = data mining – Ontology, semantic network – Deep learning, artificial intelligence

• Collection of data = databases (DB) • Current data + historical data = data

warehouse – OLAP, BI(Business Intelligence), data mining

• BigData: very large volume of data • Data about data: metadata

©2017 Yunmook Nah

BIGDATA • Features

– Data size more than Tera Bytes, Peta Bytes, Exa Bytes

– Examples • Server log, Web log, database log, … • Search engine data: WebMap • Social network data: Text messages • Non-formatted data: Text, Video • User behavior pattern • Traffic data • IoT-enabled data

– Usually handled by batch processing, but we want real-time processing

©2017 Yunmook Nah

• Data ownership: who has big data? – Internet portal: Google, Yahoo, Naver, baidu – Social networking: Facebook, Twitter, YouTube, Instagram,

Kakao – E-commerce site: e-bay, Amazon, interpark, Rakuten,

alibaba, etc – Government: 안전행정부, 교육부, 고용노동부, 보건복지부,

산업부, 미래부, 서울시, 경기도, etc => 행정정보공유, 데이터 공개

– Public agency: 에너지관리공단 – Telco: KT, SK Telecom – 금융권: 금융감독원, 증권거래소, Shinhan card, Shinhan

bank, etc – 의료분야: 건강보험평가원, 서울대병원, 삼성병원, 아산병원,

etc – Education: MOOC sites

©2017 Yunmook Nah

• Data related with 4th industrial revolution – 바이오: genome data – 의료: patient record, sensor data, image, video data – 로봇: robot-captured data (sensor, video, …) – 제조: MES data, IoT-device enabled data – 교통: traffic data (trajectory data) – 보안: CCTV, 112 voice data – 금융: credit/debit data, stock trading – 에너지: 발전량, smart sensor data – 유통: logistics data by RFID – 행정: government-owned data – 복지: 국민연금 data – 국방치안: 국방부, 경찰청, 소방방재청 data – 농업: IoT sensor captured data

©2017 Yunmook Nah

NYC taxi 2013 trip data: Start point, end point, timestamps, taxi id, fare, tip amount => 173 million trips anonymized

• BigData source – Sensor, CCTV – IoT – Wearable device: CGM(Continuous

Glucose Monitoring) – Monitoring tool: EMS, BEMS, … – 위치 정보: GPS

– 도로공사: VDS (Vehicle Detection

System), AVI (Automatic Vehicle Identification) system, the TCS (Toll Collection System), Hi-Pass system

©2017 Yunmook Nah

Detected screen

Sample Data

• Wearable Device Applications

©2017 Yunmook Nah

Application Product Categories

Fitness and Wellness - Sports and Activity Monitors - Fitness and Heart Rate Monitor - Smart Sports Glasses - Smart Clothing - Sleep Sensors - Emotional Measurement

Healthcare and Medical - Continuous Glucose Monitoring - ECG Monitoring - Pulse Oximetry - Blood Pressure Monitors - Drug Delivery - Hearing Aids - Wearable Patches - Defibrillators

Industrial & Military - Hand-worn Terminals - Augmented Reality Headsets - Smart Clothing

Infotainment - Smart Watches - Augmented Reality Headsets - Smart Glasses - Wearable Imaging Devices

[출처: IMSresearch, 2012.8]

아디다스 심박측정 브라

나이키 운동강도 기록 스마트 운동화

BIGDATA TECHNOLOGY

©2017 Yunmook Nah

From Data to Knowledge

Crawling Extraction

Cleansing

Visualization

Classification Clustering Regression

• Data collection

– Log data collection – Using relational databases – Web crawling – Using open API (social data collection)

• Data storing

– Distributed file system: Hadoop, HDFS – Distributed databases: NoSQL, Apache Hbase,

MongoDB – In-memory data management: redis

©2017 Yunmook Nah

• Traffic data collection and storing

©2017 Yunmook Nah

Historical Traffic Data Management System

Traffic Data Warehouses Dimension Information

Aggregate Data1 Aggregate Data2 Aggregate Datan

Refined Data1 Refined Data2 Refined Datan

Raw Historical Data

FTMSDatabase

ARTISDatabase

BasicInformation

Historical Traffic Data Analysis Refined Historical Data Evaluation

Raw Historical Data Refinement

©2017 Yunmook Nah

Hadoop 2.0 Ecosystem

• Data processing – Distributed parallel processing: MapReduce, YARN – SQL on Hadoop: Hive, Tajo, Shark – Stream data processing: Storm, Spark – MapReduce processing on top of virtual cluster

(Bigdata on Cloud) • On top of Xen-based virtual cluster • On top of Docker-based virtual cluster

©2017 Yunmook Nah

©2017 Yunmook Nah

Hadoop cluster vs Spark cluster [S.Han & Y.Nah, ICNGC 2017]

Data processing time (4GB memery per node)

Bidata on Cloud • Bigdata platform

– Google MapReduce and Hadoop MapReduce – Based on multiple physical nodes

• Virtualization – Xen, KVM, VMware, … – Support multiple VMs for one physical node

• Cloud computing – Amazon Web Service, MS Azure and IBM Bluemix

• Bigdata on cloud (VMs) – Hadoop on Xen VMs – Hadoop on Xen VMs and Docker containers

©2017 Yunmook Nah

©2017 Yunmook Nah

MapReduce processing on top of Xen-based vs Docker-based virtual cluster

• TeraGen – Docker is 2.71 times faster than Xen

• TeraSort – Docker is 2.92 times faster than Xen

• Main reason – Docker enables resource sharing by virtualizing host operating

system and it allocates minimum resources to each application, thus maximizing utilization of host resources

©2017 Yunmook Nah

TeraGen and TeraSort processing time [H.Chung & Y.Nah, DASFAA 2017]

• Block size 128MB – Xen-based virtual cluster is always better than Docker – Docker shows 0.79 times slower performance than Xen

• Block size 64MB – There exist some cases where Xen is better than Docker

• Main reason – For the case of Xen, the increase of block size seems to be

related with decrease of block number, decrease of metadata and decrease of name node overhead.

– Therefore, it seems necessary to select carefully the appropriate block size according to the application purposes.

©2017 Yunmook Nah

Throughput of write operations [H.Chung & Y.Nah, DASFAA 2017]

• Data analysis – Machine learning: Spark MLlib – Data mining: Hadoop Mahout – Statistical analysis: Regression analysis, R – Deep learning: Google Tensorflow, DL4J

©2017 Yunmook Nah

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

©2017 Yunmook Nah

Data mining example: Decision tree based classifier (Hunt’s Algorithm)

• Normal distribution:

– One for each (Ai,ci) pair

• For (Income, Class=No): – If Class=No

• sample mean = 110 • sample variance = 2975

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

2

2

2)(

221)|( ij

ijiA

ij

jiecAP σ

µ

πσ

−−

=

0072.0)54.54(2

1)|120( )2975(2)110120( 2

===−

eNoIncomePπ

©2017 Yunmook Nah

Data mining example: Naïve Bayes Classifier

• Hadoop data mining utility Mahout – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – Dirichlet process clustering – Latent Dirichlet Allocation – Singular value decomposition – Parallel Frequent Pattern mining – Complementary Naive Bayes classifier – Random forest decision tree based classifier – High performance java collections (previously colt

collections)

©2017 Yunmook Nah

BIGDATA ISSUES • Major issues

– Data ownership – Privacy and data release

• Anonymized for social good -> released data is de-anonymized -> loss of privacy of individuals [Divesh Srivastava, DASFAA 2017]

– Data sovereignty • Our data is being uploaded to foreign cloud

– Storage, computing power, data center • Google and Facebook is building much more data centers

©2017 Yunmook Nah

Can find trips starting at “sensitive” locations -> examining one of the clusters ... only one of the five likely drop-off addresses; a search revealed customer’s name ...

Thank You !

©2017 Yunmook Nah

Recommended