Download pdf - 2016 August POWER Up Your Insights - IBM System Summit Mumbai

IBM Systems

Anand HaridassChief Engineer POWER Integrated Solutions (BD&A)Senior Technical Staff MemberIndia Systems Development [email protected]

POWER Up Your Insights

IBM Systems

Acknowledgement

Sources of these slides are numerous IBM presentations/tutorials/studies

– Thank you

| 2

IBM Systems

Agenda

� The Big Picture about Big Data

� Hadoop

� Spark

� IBM POWER Systems – Big Data

IBM Systems

VarietyVolume Velocity

Information is THE resource of the 21st Century …

� 2.5 quintillion bytes of data/day

� 90% of data created in 2 years

� 35 zettabytes in 2020 ! Rich Media

Weather

Consumer

Geospatial

Internet of Things

Social Media

Webpages

An unprecedented increase in use of digital devices is causing humungous amount of data to begenerated and captured by businesses. This tremendous amount of digital data, also known as

Big Data has the potential to transform businesses and create value.

There will be over 200 billion connected devices

There will be over 12 billion machine-to-machine devices

Machine generated data will be 42% of all data

IBM Systems

Merging the Traditional & Big Data approaches

| 5

IT

Structures the data to answer that question

IT

Delivers a platform to enable creative discovery

Business

Explores what questions could be asked

Business Users

Determine what question to ask

Monthly sales reports

Profitability analysis

Customer surveys

Brand sentiment

Product strategy

Maximum asset utilization

Big Data Approach

Iterative & Exploratory

Traditional Approach

Structured & Repeatable

IBM Systems

Big Data : Value from Insights

DescriptiveWhat is happening

CognitiveWhat did I learn

Value

PrescriptiveWhat should I do

PredictiveWhat could happen

DiagnosticWhy did it happen

Cognitive computing defines systems

that learn at scale, reason with

purpose & interact with humans

naturally. Cognitive systems are

probabilistic, this is a core point of

difference as it means they are not

programmed, instead they have been

trained. Cognitive systems can

generate not just answers to questions

but hypotheses, reassured responses &

recommendations about more complex

& meaningful data.

IBM Systems

What is Hadoop?

• Open source project to enable processing of large data sets

• Batch oriented

• Structured, unstructured, semi-structured data

• Written in Java

• Scalable to thousands of machines

• Fault tolerant

• Core components: HDFS, MapReduce, Hadoop Common

Data 1TB

Disk Read 200MB/s

1 server

1 Disk 5000 sec

10 Disks 500 sec

100 server (x10 Disks) 5 sec

IBM Systems

Hadoop Basic Flow

ReduceProcesses data, write output

Logs

SocialData

MapCreate key/value pairs

HDFS (3 copies of data)

Shuffle

ShuffleSort key/value pairs

Map Reduce

Extract

Data

Read ResultsInput Data

Devices

DBs

IBM Systems 99

IBM Systems

What is Spark?

• Open Source, Apache 2.0, version 1.x

• Written in Scala

• In-Memory, On-Disk, Batch, Interactive, Streaming (Near Real-Time)

• Rapid in-memory processing of resilient distributed datasets (RDDs)

• Multiple Workflows

• Multiple Libraries

• Multiple API’s

Fast flexible engine for big data processing - 10x (on disk) to

100x (in memory) faster than MapReduce

IBM Systems 11

Spark SQL Spark module for structured data processing using either SQL or a DataFrame API.

Provides a common way to access a wide range of data sources.

Spark Streaming Micro-batch processing engine that enables applications to process real-time

streams of data with latency as low as 0.5 seconds.

GraphX API for graphs and graph computation - is a graph processing engine

MLlib It is a collection of machine learning libraries that can run on a distributed cluster

SparkR enables R programmers to use existing tools (Rstudio) while Spark does the actual

processing behind the scenes

IBM Systems

Apache Spark - Resilient Distributed Dataset (RDD)

IBM Systems 13

Spark as a Service Spark Standalone

Spark on Hadoop Spark with Mesos

IBM Systems

Open Data Platform Initiative

14

• ODPi has an open governance model. Developers form a Technical Steering Committee

• All members have an equal vote on ODPi Core decisions.

• ODPi has a Board of Directors responsible for the financial, legal and promotional aspects of ODPi.

• Non-profit organization accelerating the

delivery of Big Data solutions by powering

a platform called ODPi Core.

• The ODPi Core focuses on a small but

critical set of projects

• Goal: enables a rapid start and an industry

driven definition

ODPi Members include: Ampool, Altiscale, ArenaData, AsiaInfo, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, NEC, Pivotal, PLDT, SAS, Squid Solutions, SyncSort, Telstra, Toshiba, UNIFi, VMware, WANdisco, Xiilab, zData and Zettaset.

ODPi & Apache Software Foundation (ASF)ODPi supports the ASF missionASF provides governance around individual projects without looking at ecosystem and collections of projectsODPi provides a vendor-led consistent packaging model and certification for Big Data components as an ecosystem - Test once ; Run anywhere for big data applications

� Improves ecosystem interoperability

� Unlocks customer choice

� Eliminates wasteful guesswork

IBM Systems

Hadoop and Spark Offer Significant Business Benefits

15

Operations Data Warehousing Line of Business and

Analytics

New Business

Imperatives

Big Data Maturity High

High

Low

Data-Informed

Decision Making

• Full dataset analysis

(no more sampling)

• Extract value from

non-relational data

• 360o

view of all

enterprise data

• Exploratory analysis

and discovery

Warehouse

Modernization

• Data lake

• Data offload

• ETL offload

• Queryable archive

and staging

Lower the Cost

of Storage

Business

Transformation

• Create new business

models

• Risk-aware decision

making

• Fight fraud and

counter threats

• Optimize operations

• Attract, grow, retain

customers

Value

IBM Systems

IBM POWER Systems

IBM Systems

Driving Innovation Beyond The Chip

17

Microprocessors alone no longer drive sufficient Cost/Performance improvements

System stack innovations are required to drive Cost/Performance

IBM Systems

• Moore’s law no longer satisfies performance gain

• Numerous IT consumption models

• Mature Open software ecosystem

Open Development

open software, open hardware

Collaboration of thought leaders

simultaneous innovation, multiple disciplines

Performance of POWER architecture

amplified capability

•Rich software ecosystem

•Spectrum of power servers

•Multiple hardware options

• Derivative POWER chips

Market Shifts New Open Innovation

18

The OpenPOWER Foundation

Technology FAB

I/O Networking Storage

FW Open Source SYS

ODM OEM

SW Linux ISV Open Source

Chip SoC Dev IP Dev

Technology FAB

I/O Networking Storage

FW Open Source SYS

ODM OEM

SW Linux ISV Open Source

Chip

SoC Dev IP Dev

WEB 2.0 Data Center MSP Cloud

Members And growing ….

120+ The goal of the OpenPOWER Foundation

is to create an open ecosystem, using the

POWER Architecture to share expertise, investment, and server-class intellectual

property to serve the evolving needs of

customers.

Platinum

Members

IBM Systems

POWER8 Processor

Bus Interfaces� Integrated PCIe Gen3� SMP Interconnect� CAPI

Accelerators� Crypto & memory expansion� Transactional Memory � Data Move / VM Mobility

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

L3 Cache & Chip Interconnect

8M L3

Region

Mem. Ctrl.Mem. Ctrl.

SM

P L

ink

sA

cc

ele

rato

rsS

MP

Lin

ks

PC

Ie

Caches� 64K Data cache (L1)� 512 KB SRAM L2 / core� 96 MB eDRAM shared L3� Up to 128 MB eDRAM L4 (off-chip)

Cores � 12 cores (SMT8)� 8 dispatch, 10 issue, 16 exec pipe� 2X internal data flows/queues� Enhanced prefetching

Memory� Dual memory Controllers � 230 GB/sec Sustained bandwidth

Technology22nm SOI, eDRAM, 15 ML 650mm2

IBM Journal of Research and Development Issue 1 • Date Jan.-Feb. 2015 On IEEE Explore - Link

Energy Management� On-chip Power Management Micro-controller� Integrated Per-core VRM� Critical Path Monitors

IBM Systems

POWER8 is designed & optimized for Big Data & Analytics

20

Processorsflexible, fast execution of

analytics algorithms

Memorylarge, fast workspace to

maximize business insight

Cacheensure continuous data load

for fast responses

4Xthreads per core vs. x86

(up to 1536 threads per system)

~4Xmemory bandwidth vs. x861

(up to 16TB of memory)

4Xmore cache vs. x862

(up to 231MB cache per socket)

Optimized for a broad range of big data & analytics workloads:

Industry Solutions

5XFaster

Supports growth of users, reports and complex

queries

Delivers fast analytics results for real-time

decision-making

Handles large volumes of data for better response

times

Yateesh Vusirika – Open Databases SQL / NoSQL – What’s on offer ?

IBM Systems 21

� Streaming and SQL benefit from High Thread Density and Concurrency

� Processing multiple packets of a stream and different stages of a message stream pipeline

� Processing multiple rows from a query

� Machine Learning benefits from Large Caches and Memory Bandwidth

� Iterative Algorithms on the same data

� Fewer core pipeline stalls and overall higher throughput

� Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength

� Flexibility to go from 8 SMT threads per core to 4 or 2

� Manage Balance between thread performance and throughput

� Headroom

� Balanced resource utilization, more efficient scale-out

� Multi-tenant deployments

POWER Advantages for Spark

IBM Systems 22

Machine Learning SQL Graph

1.5X•Spend 33% less on infrastructure supporting the same amount of workload

•Spend the same on infrastructure but host 50% more workload

* - based on SoftLayer pricing – subject to change22

Price Performance of Spark on POWER Cloud

IBM Systems

GPU Use Case Example: Adverse Drug Reaction Prediction built on Spark

23

Fast and general engine for large-

scale data processing

• 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression)• Transparent to the Spark Application• Game changer for Personalized Medicine

IBM Systems

IBM Big Data on Power Offerings

24

Stage 1: Prove Value

Stage 2: Scale for Multiple Projects Stage 3: Scale for Mixed Analytics

Digital Start for Big Data on

Power

IBM Data Engine for Hadoop and

Spark

IBM Data Engine for Analytics

�Ready access for Power customers

�On Premise or Cloud

� Organization: Line of Business (LOB) or Data Science team

�Simplify operations: easy to deploy & manage

�Advanced resource & storage management

�Better resilience for big data

�Spark: 2X better price perf vs x86

� Organization: LOB or Data Science team

�Designed for consolidation and mixed analytics workloads: streams, at rest, text

�Lowest $/TB and less than half storage infrastructure

�Leadership resilience for big data environment

�Adapt and scale to your changing analytics needs

� Organization: IT infrastructure team supporting LoB’s and data team

Limited data investment per project, often <10TB

Single project, limited use cases

Moderate data investment per project50TB to PB

Many independent use case projects across LOB’s

Significant data investment per project1/2 to multi PB

Multiple use cases with diverse SLA’s

24

IBM Systems

POWER Hadoop and Spark Integrated OptionsIBM Data Engine for

Analytics

• Compute only servers with shared storage

• Single replica of data

• Newer write oriented workloads

• Sophisticated scheduler

• POSIX compliant file system

• Ideal for larger deployments

Integrated Solution

IBM Data Engine forHadoop and Spark

• Scale-out storage rich servers

• Three replicas of data

• Traditional read dominated workloads

• Ideal for simpler workload patterns

• Ideal for smaller deployments

Integrated Solution

IBM Systems

IBM Data Engine for Hadoop and Spark: IDE-HS

OpenPOWER

IOP +

OpenPOWER

IOP +

OpenPOWER

IOP +

Spectrum Scale FPO Option

• Internal replicated disk

• POSIX compliant

• Encryption/replication

Opt. PlatformSymphony

• Higher utilization

• Shared cluster

• Better throughput

OpenPOWER (POWER8) S812LC

• 2x x86 core performance

• Lowest cost Power HW

Solution

• Pre-assembled/tested cluster

• On-site services

• Lower risk & faster time to value

IBM Open Platform

• Open Hadoop

• Value Add Options

PlatformCluster Mgr.

Simplified physical

cluster management

OpenPOWER innovation with IBM Open Platform with Apache Hadoop for a high performance, storage dense and fully integrated cluster offering.

IBM Systems

IBM Data Engine for Analytics: IDEA

Platform

Cluster Mgr.

POWER8

BigInsights

POWER8

BigInsights

POWER8

BigInsights

Platform

Symphony

Spectrum Scale ESS

• One copy of data

• POSIX compliant

• Erasure coding

• Encryption/replication

POWER8 - S

• 2X x86 core performance

• Fewer nodes

IBM Open Platform

• Industry standard Hadoop

Solution

• Grow disk/CPU separately

• Pre-assembled/tested cluster

• On-site services

• Lower risk & faster time to value

Simplified physical

cluster management

• Higher utilization

• Shared cluster

• Better throughput

Spec ScaleSpec ScaleSpec Scale

A fully integrated solution with software and infrastructure optimized for Big Data & Analytics

S822L

Appliance-Like but much more Versatile!

IBM Systems28

Storage Intensive

Com

pute

Inte

nsiv

e

Add M

ore

Serv

ers

Add More Storage

� Add servers or storage or both as needed

� Adjust compute to storage ratio as workload needs change

� Standard Hadoop configurations with local storage and triple replica can result in overprovisioned compute to meet the storage demands

� Data Engine for Analytics allows right sizing of compute and storage independently to create an optimized configuration

Data Engine for Analytics offers Independent Scaling of Servers & Storage

IBM Systems

Client: Multinational Telecommunications CompanyA multinational telecommunication company with over 6M subscribers. Strategic value as they influence the IT decisions in other countries.

Challenges� Expectations of a Real Time Marketing (RTM) based solution to

run event-based campaigns� Enable event-based marketing, analysing various sources of

input data containing information regarding subscribers actions� Dispatch the triggered events to downstream applications such

as campaign management, for associated campaign execution.

Architecture• IBM Data Engine for Analytics: 20 X Power S822L, 2 X ESS GL4, Spectrum Scale, PCM

•BigInsights, Streams, SPSS Modeller, SPSS Analytics Server

Solution and Approach� Solution was to provide a Hadoop-based Big Data platform, integrated

to the RTM decision engine, that will enable data monetisation opportunities, including location based analytics

� Customer was not comfortable with the huge number of x86 Data Nodes approach of typical Hadoop Architecture

� The IBM team designed the Power solution and conducted a technical workshop with the client on newly redefined Hadoop architecture based on IDEA.

� Demonstrated the one IBM team value as an integrated approach to the client

Key Client Benefits� Optimized Big Data deployment architecture with IDEA Architecture

with Linux on Power, Elastic Storage Server and Spectrum Scale� Lower TCO with 4 Racks on Power against 12 racks on x86� More IO bandwidth with 40GbE Power network against 10GbE on

x86 based solution

3x less racks for 2 PB Big Data solution

4 vs. 12

Client Example – IDEA Architecture

IBM Systems 30

POWER Processor Roadmap

POWER8 Architecture POWER9 Architecture

2014POWER8

12 cores

22nm

New Micro-Architecture

New ProcessTechnology

2016POWER8w/ NVLink

12 cores

22nm

EnhancedMicro-

ArchitectureWith NVLink

2017P9 SO24 cores

14nm


Direct attachmemory


Optimized for Data-Centric Workloads

Integrated PCIe

CAPI Acceleration / I/O

Scale-Out Datacenter TCO Optimization

Scale-up performance Optimization

Acceleration Enhancements to CAPI and NVLINK

Modularity for OpenPOWER

TBDP9 SU

TBD cores

14nm

EnhancedMicro-

Architecture

BufferedMemory

POWER6 Architecture POWER7 Architecture

2007POWER6

2 cores

65nm



2008POWER6+

2 cores

65nm+

EnhancedMicro-

Architecture

EnhancedProcess

Technology

2010POWER7

8 cores

45nm



2012POWER7+

8 cores

32nm

EnhancedMicro-

Architecture


High Frequency

Enhanced RAS

Dynamic Energy Management

Large eDRAM L3 Cache

Optimized VSX

Enhanced Memory Subsystem

Focus on EnterpriseTechnology and Performance Driven

Focus on Scale-Out and EnterpriseCost and Acceleration Driven

2018 - 20P8/9 SO

10nm - 7nm

Existing Micro-

Architecture

FoundryTechnology

Partner ChipPOWER8/9

OpenPOWEREcosystem

DesignTargeting

Partner Markets & SystemsLeveraging Modulatrity

Price, performance, feature and ecosystem innovation

2020+


NewTechnology

POWER10

New Features and

Functions

Future

TBD

IBM Systems

What’s in the works ?

31

GPUs and FPGAs for Compute offload, Machine Learning, Graph and other specialized acceleration

CAPI Flash for Memory consolidation/expansion, and Storage acceleration

RDMA for better latency, better network utilization, lower CPU utilization, lower Memory utilization

OpenPOWER extends the ability to innovate around Spark into the hardware and accelerators.

IBM Systems

Backup

IBM Systems 33

Power Systems and NVIDIA GPU RoadmapNVIDIA GPU NVIDIA GPU with NVLink

Power Chip Power Chipwith NVLink

80 GB/sPeak*

PCIe x16

Current GPU Attach Future NVLink GPU Attachment

Graphics Memory

System Memory

Graphics Memory Graphics Memory

System Memory

40+40 GB/s

16+16 GB/s

�CPU to GPU NVLink Enables

� Easier Programming of GPU Accelerators

� Better Application Throughput

� Expanded Set of Accelerated Applications

� New Server: Early Shipments in 4Q ‘16

� IBM-NVIDIA NVLink Acceleration Lab

� Seeking Clients Now

� Apply at [email protected]

IBM Systems

IBM Data Engine for Analytics (IDEA) overview (1 of 2)

34

�Challenges with traditional Hadoop model�Typical Hadoop solutions use storage rich server based scale out solution�Lose control of storage since its lumped with compute�No separate storage capital planning�Usually no backup, archiving, disaster recovery facilities �Usually less storage security controls, auditing�Disk failures incur heavy rebuild penalty, consume network resources, reduces application performance�Multiple replicas are expensive and superfluous if the workload doesn’t need lots of tasks accessing the same data over and over (read mostly)�Typically cannot share resources with non-Hadoop workloads�Cannot reuse existing infrastructure, requires different infrastructure

IBM Systems

IBM Data Engine for Analytics (IDEA) overview (2 of 2)

35

�Value Proposition�Compute only servers with shared storage�Single replica of data�Better suited for workloads that tend to have a significant write component�Ideal for complex analytic solutions where additional components may need to interact with BigInsights

�Composed of several integrated components�Compute: POWER8 compute nodes – 2 or more �Storage: IBM Elastic Storage System with advanced distributed and parallel filesystem (Spectrum Scale Based)�FileSystem accessed over network supported by Spectrum Scale (GPFS) protocol�Networking: Ethernet and optionally InfiniBand (no Fibre Channel)�System Software: Linux, Cluster Provisioning and Management using PCM and xCAT�Middleware: Platform Symphony and BigInsights

IBM Systems

Solution Architecture Component Model

36

IBM Systems

Power Hadoop/Spark Solutions

Roll

Your Own

IBM Data Engine for

Hadoop and Spark

(IDE-HS)

IBM Data Engine for

Analytics

(IDEA)

• Point solution

• Classic Spark or

Hadoop architecture

• Open Platform for

Apache Hadoop

• Optional IBM value-

adds

• Based on Power LC

• Enterprise solution

• Shared external storage

(GPFS)

• BigInsights

• Platform Computing for

Resource and Cluster

Mgt.

• Based on Power L

IBM Integrated Offerings

IBM Systems

IBM Data Engine for Analytics Architecture

HadoopManagement

2 VMs each

Edge Nodes(Data Ingres)

Data Nodes1-2 VMs each

Physical ClusterManagement

Shared High BandwidthStorage

High Speed Network

+ Pre-assembled

+ On-site Services

IBM Systems

Physical Cluster Management

Power S812L, RHEL 6.5, 10 Cores, 32 GB Memory,

Platform Cluster Manager 4.2 Advanced Edition, XCAT 2.9

Single LPAR, HA server option available

Hadoop Management

Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory

BigInsights 4.1

Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition

Spectrum Scale 4.1 Client

Minimum 2 servers, 2 LPARs each

DataNodes

Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory

BigInsights 4.1

Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition

Spectrum Scale 4.1 Client

User specified number of servers, 1 LPAR. 2 LPARs when running Big SQL

SharedStorage

IBM Elastic Storage Server Models GL2, GL4, GL6, GS2, GS4, or GS6 GSS 2.2 Software, GSS Management/MaintenanceSpectrum Scale 4.1.1 TL1 ServerUser specified number of storage servers

EdgeNodes (Opt)

Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory,BigInsights 4.1Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard EditionSpectrum Scale 4.1 ClientUser specified number of edge nodes, 1 or 2 LPARs each

Network (Opt) Mellanox 10 or 40 Gb RoCE Ethernet or Mellanox 56 Gb InfiniBand

IBM Power Systems

Apache Hadoop Ecosystem

40

IBM Systems

Platform Symphony Differentiation vs Open Source

41

YARN

• Monitor CPU and Memory only

• XML file to setup prioritization/scheduling strategy

• Limited Pre-emption

� available with FAIR scheduler.

� Capacity scheduler does not have pre-emption so 100% elasticity unwise.

• Cron to change strategy by time-of-day

Platform Symphony

• 50% Faster Time to Insights with MapReduce

• Deeper performance insight with visualizations of 150 machine metrics to support tuning & planning

• GUI based setup/administration & Visual validation of policies

• Advanced Pre-emption

� Pre-empt the least running jobs

� Round robin pre-emption

• Different resource strategy by time of day

• Showback Reports

IBM Systems

Spectrum Scale will enhance your Hadoop Environment !

42

Hadoop HDFS

HDFS NameNode HA added in version 2.0.

NameNode HA in active/passive configuration

Difficulty to ingest data – special tools required

Lacking enterprise readiness

No single point of failure, distributed metadata in

active/active configuration since 1998

Ingest data using policies for data placement

Versatile, Multi-purpose,

Hybrid Storage (locality and shared)

Enterprise ready with support for advanced storage

features (Encryption, DR, replication, SW RAID etc)

Large block-sizes – poor support for small files Variable block sizes – suited to multiple types of

data and metadata access pattern

Scale compute and storage independently

(Policy based ILM)

Compute and Storage tightly coupled – leading to

very low CPU utilization

Single-purpose, Hadoop MapReduce only

POSIX file system – easy to use and manageNon-POSIX file system – obscure commands.

Does not support in-place updates.

IBM Spectrum Scale