제4차산업혁명과 빅데이터 기술prof.ks.ac.kr/cschung/2017-Big-Data.pdf · Current data +...

제4차산업혁명과 빅데이터 기술

(4th Industrial Revolution and BigData Technology)

April 07, 2017

Prof. Yunmook Nah, Ph.D. (나연묵)

ymnah@dankook.ac.kr

Professor, Department of Applied Computer Engineering Chairman, Department of Data Science, Graduate School

Senior Director, Research Institute of Information and Culture Technology Dankook Univeristy

OUTLINE • Overview • Data • Bigdata • Bigdata technology • Bigdata issues

OVERVIEW • 4th industrial revolution

– Core technology • Mobile => Smart phone, wearable devices, vehicles • Processing power, storage capacity => Computing

equipment, data center, cloud • Knowledge => Bigdata

– Emerging technology • AI, robotics, IoT, autonomous vehicles, 3-D printing,

nanotech, biotech, materials science, energy storage • CPS, AR/VR, etc

The possibilities of billions of people connected by mobile devices, with unprecedented processing power, storage capacity, and access to knowledge, are unlimited. And these possibilities will be multiplied by emerging technology breakthroughs in fields such as artificial intelligence, robotics, the Internet of Things, autonomous vehicles, 3-D printing, nanotechnology, biotechnology, materials science, energy storage, and quantum computing. [Klaus Schwab]

DATA • Data: fact • Information = Processing (Data) • Knowledge

– Knowledge discovery = data mining – Ontology, semantic network – Deep learning, artificial intelligence

• Collection of data = databases (DB) • Current data + historical data = data

warehouse – OLAP, BI(Business Intelligence), data mining

• BigData: very large volume of data • Data about data: metadata

BIGDATA • Features

– Data size more than Tera Bytes, Peta Bytes, Exa Bytes

– Examples • Server log, Web log, database log, … • Search engine data: WebMap • Social network data: Text messages • Non-formatted data: Text, Video • User behavior pattern • Traffic data • IoT-enabled data

– Usually handled by batch processing, but we want real-time processing

• Data ownership: who has big data? – Internet portal: Google, Yahoo, Naver, baidu – Social networking: Facebook, Twitter, YouTube, Instagram,

Kakao – E-commerce site: e-bay, Amazon, interpark, Rakuten,

alibaba, etc – Government: 안전행정부, 교육부, 고용노동부, 보건복지부,

산업부, 미래부, 서울시, 경기도, etc => 행정정보공유, 데이터 공개

– Public agency: 에너지관리공단 – Telco: KT, SK Telecom – 금융권: 금융감독원, 증권거래소, Shinhan card, Shinhan

bank, etc – 의료분야: 건강보험평가원, 서울대병원, 삼성병원, 아산병원,

etc – Education: MOOC sites

• Data related with 4th industrial revolution – 바이오: genome data – 의료: patient record, sensor data, image, video data – 로봇: robot-captured data (sensor, video, …) – 제조: MES data, IoT-device enabled data – 교통: traffic data (trajectory data) – 보안: CCTV, 112 voice data – 금융: credit/debit data, stock trading – 에너지: 발전량, smart sensor data – 유통: logistics data by RFID – 행정: government-owned data – 복지: 국민연금 data – 국방치안: 국방부, 경찰청, 소방방재청 data – 농업: IoT sensor captured data

NYC taxi 2013 trip data: Start point, end point, timestamps, taxi id, fare, tip amount => 173 million trips anonymized

• BigData source – Sensor, CCTV – IoT – Wearable device: CGM(Continuous

Glucose Monitoring) – Monitoring tool: EMS, BEMS, … – 위치 정보: GPS

– 도로공사: VDS (Vehicle Detection

System), AVI (Automatic Vehicle Identification) system, the TCS (Toll Collection System), Hi-Pass system

Detected screen

Sample Data

• Wearable Device Applications

Application Product Categories

Fitness and Wellness - Sports and Activity Monitors - Fitness and Heart Rate Monitor - Smart Sports Glasses - Smart Clothing - Sleep Sensors - Emotional Measurement

Healthcare and Medical - Continuous Glucose Monitoring - ECG Monitoring - Pulse Oximetry - Blood Pressure Monitors - Drug Delivery - Hearing Aids - Wearable Patches - Defibrillators

Industrial & Military - Hand-worn Terminals - Augmented Reality Headsets - Smart Clothing

Infotainment - Smart Watches - Augmented Reality Headsets - Smart Glasses - Wearable Imaging Devices

[출처: IMSresearch, 2012.8]

아디다스 심박측정 브라

나이키 운동강도 기록 스마트 운동화

BIGDATA TECHNOLOGY

From Data to Knowledge

Crawling Extraction

Cleansing

Visualization

Classification Clustering Regression

• Data collection

– Log data collection – Using relational databases – Web crawling – Using open API (social data collection)

• Data storing

– Distributed file system: Hadoop, HDFS – Distributed databases: NoSQL, Apache Hbase,

MongoDB – In-memory data management: redis

• Traffic data collection and storing

Historical Traffic Data Management System

Traffic Data Warehouses Dimension Information

Aggregate Data1 Aggregate Data2 Aggregate Datan

Refined Data1 Refined Data2 Refined Datan

Raw Historical Data

FTMSDatabase

ARTISDatabase

BasicInformation

Historical Traffic Data Analysis Refined Historical Data Evaluation

Raw Historical Data Refinement

Hadoop 2.0 Ecosystem

• Data processing – Distributed parallel processing: MapReduce, YARN – SQL on Hadoop: Hive, Tajo, Shark – Stream data processing: Storm, Spark – MapReduce processing on top of virtual cluster

(Bigdata on Cloud) • On top of Xen-based virtual cluster • On top of Docker-based virtual cluster

Hadoop cluster vs Spark cluster [S.Han & Y.Nah, ICNGC 2017]

Data processing time (4GB memery per node)

Bidata on Cloud • Bigdata platform

– Google MapReduce and Hadoop MapReduce – Based on multiple physical nodes

• Virtualization – Xen, KVM, VMware, … – Support multiple VMs for one physical node

• Cloud computing – Amazon Web Service, MS Azure and IBM Bluemix

• Bigdata on cloud (VMs) – Hadoop on Xen VMs – Hadoop on Xen VMs and Docker containers

MapReduce processing on top of Xen-based vs Docker-based virtual cluster

• TeraGen – Docker is 2.71 times faster than Xen

• TeraSort – Docker is 2.92 times faster than Xen

• Main reason – Docker enables resource sharing by virtualizing host operating

system and it allocates minimum resources to each application, thus maximizing utilization of host resources

TeraGen and TeraSort processing time [H.Chung & Y.Nah, DASFAA 2017]

• Block size 128MB – Xen-based virtual cluster is always better than Docker – Docker shows 0.79 times slower performance than Xen

• Block size 64MB – There exist some cases where Xen is better than Docker

• Main reason – For the case of Xen, the increase of block size seems to be

related with decrease of block number, decrease of metadata and decrease of name node overhead.

– Therefore, it seems necessary to select carefully the appropriate block size according to the application purposes.

Throughput of write operations [H.Chung & Y.Nah, DASFAA 2017]

• Data analysis – Machine learning: Spark MLlib – Data mining: Hadoop Mahout – Statistical analysis: Regression analysis, R – Deep learning: Google Tensorflow, DL4J

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Data mining example: Decision tree based classifier (Hunt’s Algorithm)

• Normal distribution:

– One for each (Ai,ci) pair

• For (Income, Class=No): – If Class=No

• sample mean = 110 • sample variance = 2975

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

221)|( ij

jiecAP σ

−−

0072.0)54.54(2

1)|120( )2975(2)110120( 2

===−

eNoIncomePπ

Data mining example: Naïve Bayes Classifier

• Hadoop data mining utility Mahout – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – Dirichlet process clustering – Latent Dirichlet Allocation – Singular value decomposition – Parallel Frequent Pattern mining – Complementary Naive Bayes classifier – Random forest decision tree based classifier – High performance java collections (previously colt

collections)

BIGDATA ISSUES • Major issues

– Data ownership – Privacy and data release

• Anonymized for social good -> released data is de-anonymized -> loss of privacy of individuals [Divesh Srivastava, DASFAA 2017]

– Data sovereignty • Our data is being uploaded to foreign cloud

– Storage, computing power, data center • Google and Facebook is building much more data centers

Can find trips starting at “sensitive” locations -> examining one of the clusters ... only one of the five likely drop-off addresses; a search revealed customer’s name ...

Facebook's cold storage facility in Prineville, Oregon. Because the cold storage facility is an archive rather than hot storage, Facebook programmed the Open Vault storage servers to be dormant most of the time.

A tray of hard drives in a cold storage rack(open vault).

Thank You !

제4차산업혁명과 빅데이터 기술prof.ks.ac.kr/cschung/2017-Big-Data.pdf · Current data +...

Documents

Basis Data.pdf

BAB III DATA DAN ANALISIS DATA A - idr.uin-antasari.ac.ididr.uin-antasari.ac.id/188/3/BAB III DATA DAN ANALISIS DATA.pdf · BAB III DATA DAN ANALISIS DATA A.Data 1. Riwayat Hidup

Analisis data - Staff Site Universitas Negeri Yogyakartastaffnew.uny.ac.id/upload/131808346/pendidikan/Analisis+Data.pdf · 1. Apa saja langkah-langkah analisis data? 2. Bagaimana

Lecture-KD 02 Transmisi Data - Gunadarma Universitysupriyan.staff.gunadarma.ac.id/.../Lecture-KD+02+Transmisi+Data.pdf · TRANSMISI DATA Dua faktor yang mempengaruhi keberhasilan

Kamus Data.pdf

P1 - Aspek Keamanan Data - reza_chan.staff.gunadarma.ac.idreza_chan.staff.gunadarma.ac.id/.../52525/P1+-+Aspek+Keamanan+Data.pdf · • Berhubungandenganketersediaan informasi ketika

Template Colloqium Tesis - Gunadarmarama_ds.staff.gunadarma.ac.id/.../70592/02.+Tipe+Data.pdf · Tipe Data Sederhana 3. Boolean Disebut juga jenis data logical.Anggota {true atau

Triangulasi data.pdf

Lingkungan Basis Data.pdf

MODE TRANSMISI DATA LAPISAN FISIKbudhiirawan.staff.telkomuniversity.ac.id/files/...Transmisi-Data.pdf · Mode Transmisi Serial Proses pengiriman data pada mode transmisi serial adalah

Diktat Sturktur Data.pdf

Big Data Analytics Cases - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Big Data.pdf• 조건: Big Data 분석의value 이해 리더 • 데이터기반의사결정안하는리더

MATA KULIAH SISTEM BASIS DATA - STMIK Handayani Makassarmi.handayani.ac.id/.../07/RPS-Sistem-Basis-Data.pdf · 1. Pengantar basis data 2. Lingkungan basis data 3. Model data relasional

KONSEP AKUISISI DATA DAN KONVERSI - robby.c.staff ...robby.c.staff.gunadarma.ac.id/.../files/16720/konsep-akuisisi-data.pdf · Konsep Akuisisi Data dan Konversi 1 KONSEP AKUISISI

Transmisi data.pdf

Struktur Data - achsan.staff.gunadarma.ac.idachsan.staff.gunadarma.ac.id/Downloads/files/4184/Struktur Data.pdf · dari struktur data linier dan non linier. 4 Struktur Data linier

PENGUMPULAN DATA Data Collection - Universitas Brawijayaradiasari.lecture.ub.ac.id/files/2014/09/P3-Pengumpulan-data.pdf · TEKNIK PENGUMPULAN DATA Observasi •merupakan salah satu

2ª data.pdf

MODEL DATA.pdf

Buku Struktur data.pdf