IBM Systems
Anand HaridassChief Engineer POWER Integrated Solutions (BD&A)Senior Technical Staff MemberIndia Systems Development [email protected]
POWER Up Your Insights
IBM Systems
Acknowledgement
Sources of these slides are numerous IBM presentations/tutorials/studies
– Thank you
| 2
IBM Systems
Agenda
� The Big Picture about Big Data
� Hadoop
� Spark
� IBM POWER Systems – Big Data
IBM Systems
VarietyVolume Velocity
Information is THE resource of the 21st Century …
� 2.5 quintillion bytes of data/day
� 90% of data created in 2 years
� 35 zettabytes in 2020 ! Rich Media
Weather
Consumer
Geospatial
Internet of Things
Social Media
Webpages
An unprecedented increase in use of digital devices is causing humungous amount of data to begenerated and captured by businesses. This tremendous amount of digital data, also known as
Big Data has the potential to transform businesses and create value.
There will be over 200 billion connected devices
There will be over 12 billion machine-to-machine devices
Machine generated data will be 42% of all data
IBM Systems
Merging the Traditional & Big Data approaches
| 5
IT
Structures the data to answer that question
IT
Delivers a platform to enable creative discovery
Business
Explores what questions could be asked
Business Users
Determine what question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory
Traditional Approach
Structured & Repeatable
IBM Systems
Big Data : Value from Insights
DescriptiveWhat is happening
CognitiveWhat did I learn
Value
PrescriptiveWhat should I do
PredictiveWhat could happen
DiagnosticWhy did it happen
Cognitive computing defines systems
that learn at scale, reason with
purpose & interact with humans
naturally. Cognitive systems are
probabilistic, this is a core point of
difference as it means they are not
programmed, instead they have been
trained. Cognitive systems can
generate not just answers to questions
but hypotheses, reassured responses &
recommendations about more complex
& meaningful data.
IBM Systems
What is Hadoop?
• Open source project to enable processing of large data sets
• Batch oriented
• Structured, unstructured, semi-structured data
• Written in Java
• Scalable to thousands of machines
• Fault tolerant
• Core components: HDFS, MapReduce, Hadoop Common
Data 1TB
Disk Read 200MB/s
1 server
1 Disk 5000 sec
10 Disks 500 sec
100 server (x10 Disks) 5 sec
IBM Systems
Hadoop Basic Flow
ReduceProcesses data, write output
Logs
SocialData
MapCreate key/value pairs
HDFS (3 copies of data)
Shuffle
ShuffleSort key/value pairs
Map Reduce
Extract
Data
Read ResultsInput Data
Devices
DBs
IBM Systems 99
IBM Systems
What is Spark?
• Open Source, Apache 2.0, version 1.x
• Written in Scala
• In-Memory, On-Disk, Batch, Interactive, Streaming (Near Real-Time)
• Rapid in-memory processing of resilient distributed datasets (RDDs)
• Multiple Workflows
• Multiple Libraries
• Multiple API’s
Fast flexible engine for big data processing - 10x (on disk) to
100x (in memory) faster than MapReduce
IBM Systems 11
Spark SQL Spark module for structured data processing using either SQL or a DataFrame API.
Provides a common way to access a wide range of data sources.
Spark Streaming Micro-batch processing engine that enables applications to process real-time
streams of data with latency as low as 0.5 seconds.
GraphX API for graphs and graph computation - is a graph processing engine
MLlib It is a collection of machine learning libraries that can run on a distributed cluster
SparkR enables R programmers to use existing tools (Rstudio) while Spark does the actual
processing behind the scenes
IBM Systems
Apache Spark - Resilient Distributed Dataset (RDD)
IBM Systems 13
Spark as a Service Spark Standalone
Spark on Hadoop Spark with Mesos
IBM Systems
Open Data Platform Initiative
14
• ODPi has an open governance model. Developers form a Technical Steering Committee
• All members have an equal vote on ODPi Core decisions.
• ODPi has a Board of Directors responsible for the financial, legal and promotional aspects of ODPi.
• Non-profit organization accelerating the
delivery of Big Data solutions by powering
a platform called ODPi Core.
• The ODPi Core focuses on a small but
critical set of projects
• Goal: enables a rapid start and an industry
driven definition
ODPi Members include: Ampool, Altiscale, ArenaData, AsiaInfo, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, NEC, Pivotal, PLDT, SAS, Squid Solutions, SyncSort, Telstra, Toshiba, UNIFi, VMware, WANdisco, Xiilab, zData and Zettaset.
ODPi & Apache Software Foundation (ASF)ODPi supports the ASF missionASF provides governance around individual projects without looking at ecosystem and collections of projectsODPi provides a vendor-led consistent packaging model and certification for Big Data components as an ecosystem - Test once ; Run anywhere for big data applications
� Improves ecosystem interoperability
� Unlocks customer choice
� Eliminates wasteful guesswork
IBM Systems
Hadoop and Spark Offer Significant Business Benefits
15
Operations Data Warehousing Line of Business and
Analytics
New Business
Imperatives
Big Data Maturity High
High
Low
Data-Informed
Decision Making
• Full dataset analysis
(no more sampling)
• Extract value from
non-relational data
• 360o
view of all
enterprise data
• Exploratory analysis
and discovery
Warehouse
Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive
and staging
Lower the Cost
of Storage
Business
Transformation
• Create new business
models
• Risk-aware decision
making
• Fight fraud and
counter threats
• Optimize operations
• Attract, grow, retain
customers
Value
IBM Systems
IBM POWER Systems
IBM Systems
Driving Innovation Beyond The Chip
17
Microprocessors alone no longer drive sufficient Cost/Performance improvements
System stack innovations are required to drive Cost/Performance
IBM Systems
• Moore’s law no longer satisfies performance gain
• Numerous IT consumption models
• Mature Open software ecosystem
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
Performance of POWER architecture
amplified capability
•Rich software ecosystem
•Spectrum of power servers
•Multiple hardware options
• Derivative POWER chips
Market Shifts New Open Innovation
18
The OpenPOWER Foundation
Technology FAB
I/O Networking Storage
FW Open Source SYS
ODM OEM
SW Linux ISV Open Source
Chip SoC Dev IP Dev
Technology FAB
I/O Networking Storage
FW Open Source SYS
ODM OEM
SW Linux ISV Open Source
Chip
SoC Dev IP Dev
WEB 2.0 Data Center MSP Cloud
Members And growing ….
120+ The goal of the OpenPOWER Foundation
is to create an open ecosystem, using the
POWER Architecture to share expertise, investment, and server-class intellectual
property to serve the evolving needs of
customers.
Platinum
Members
IBM Systems
POWER8 Processor
Bus Interfaces� Integrated PCIe Gen3� SMP Interconnect� CAPI
Accelerators� Crypto & memory expansion� Transactional Memory � Data Move / VM Mobility
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
L3 Cache & Chip Interconnect
8M L3
Region
Mem. Ctrl.Mem. Ctrl.
SM
P L
ink
sA
cc
ele
rato
rsS
MP
Lin
ks
PC
Ie
Caches� 64K Data cache (L1)� 512 KB SRAM L2 / core� 96 MB eDRAM shared L3� Up to 128 MB eDRAM L4 (off-chip)
Cores � 12 cores (SMT8)� 8 dispatch, 10 issue, 16 exec pipe� 2X internal data flows/queues� Enhanced prefetching
Memory� Dual memory Controllers � 230 GB/sec Sustained bandwidth
Technology22nm SOI, eDRAM, 15 ML 650mm2
IBM Journal of Research and Development Issue 1 • Date Jan.-Feb. 2015 On IEEE Explore - Link
Energy Management� On-chip Power Management Micro-controller� Integrated Per-core VRM� Critical Path Monitors
IBM Systems
POWER8 is designed & optimized for Big Data & Analytics
20
Processorsflexible, fast execution of
analytics algorithms
Memorylarge, fast workspace to
maximize business insight
Cacheensure continuous data load
for fast responses
4Xthreads per core vs. x86
(up to 1536 threads per system)
~4Xmemory bandwidth vs. x861
(up to 16TB of memory)
4Xmore cache vs. x862
(up to 231MB cache per socket)
Optimized for a broad range of big data & analytics workloads:
Industry Solutions
5XFaster
Supports growth of users, reports and complex
queries
Delivers fast analytics results for real-time
decision-making
Handles large volumes of data for better response
times
Yateesh Vusirika – Open Databases SQL / NoSQL – What’s on offer ?
IBM Systems 21
� Streaming and SQL benefit from High Thread Density and Concurrency
� Processing multiple packets of a stream and different stages of a message stream pipeline
� Processing multiple rows from a query
� Machine Learning benefits from Large Caches and Memory Bandwidth
� Iterative Algorithms on the same data
� Fewer core pipeline stalls and overall higher throughput
� Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength
� Flexibility to go from 8 SMT threads per core to 4 or 2
� Manage Balance between thread performance and throughput
� Headroom
� Balanced resource utilization, more efficient scale-out
� Multi-tenant deployments
POWER Advantages for Spark
IBM Systems 22
Machine Learning SQL Graph
1.5X•Spend 33% less on infrastructure supporting the same amount of workload
•Spend the same on infrastructure but host 50% more workload
* - based on SoftLayer pricing – subject to change22
Price Performance of Spark on POWER Cloud
IBM Systems
GPU Use Case Example: Adverse Drug Reaction Prediction built on Spark
23
Fast and general engine for large-
scale data processing
• 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression)• Transparent to the Spark Application• Game changer for Personalized Medicine
IBM Systems
IBM Big Data on Power Offerings
24
Stage 1: Prove Value
Stage 2: Scale for Multiple Projects Stage 3: Scale for Mixed Analytics
Digital Start for Big Data on
Power
IBM Data Engine for Hadoop and
Spark
IBM Data Engine for Analytics
�Ready access for Power customers
�On Premise or Cloud
� Organization: Line of Business (LOB) or Data Science team
�Simplify operations: easy to deploy & manage
�Advanced resource & storage management
�Better resilience for big data
�Spark: 2X better price perf vs x86
� Organization: LOB or Data Science team
�Designed for consolidation and mixed analytics workloads: streams, at rest, text
�Lowest $/TB and less than half storage infrastructure
�Leadership resilience for big data environment
�Adapt and scale to your changing analytics needs
� Organization: IT infrastructure team supporting LoB’s and data team
Limited data investment per project, often <10TB
Single project, limited use cases
Moderate data investment per project50TB to PB
Many independent use case projects across LOB’s
Significant data investment per project1/2 to multi PB
Multiple use cases with diverse SLA’s
24
IBM Systems
POWER Hadoop and Spark Integrated OptionsIBM Data Engine for
Analytics
• Compute only servers with shared storage
• Single replica of data
• Newer write oriented workloads
• Sophisticated scheduler
• POSIX compliant file system
• Ideal for larger deployments
Integrated Solution
IBM Data Engine forHadoop and Spark
• Scale-out storage rich servers
• Three replicas of data
• Traditional read dominated workloads
• Ideal for simpler workload patterns
• Ideal for smaller deployments
Integrated Solution
IBM Systems
IBM Data Engine for Hadoop and Spark: IDE-HS
OpenPOWER
IOP +
OpenPOWER
IOP +
OpenPOWER
IOP +
Spectrum Scale FPO Option
• Internal replicated disk
• POSIX compliant
• Encryption/replication
Opt. PlatformSymphony
• Higher utilization
• Shared cluster
• Better throughput
OpenPOWER (POWER8) S812LC
• 2x x86 core performance
• Lowest cost Power HW
Solution
• Pre-assembled/tested cluster
• On-site services
• Lower risk & faster time to value
IBM Open Platform
• Open Hadoop
• Value Add Options
PlatformCluster Mgr.
Simplified physical
cluster management
OpenPOWER innovation with IBM Open Platform with Apache Hadoop for a high performance, storage dense and fully integrated cluster offering.
IBM Systems
IBM Data Engine for Analytics: IDEA
Platform
Cluster Mgr.
POWER8
BigInsights
POWER8
BigInsights
POWER8
BigInsights
Platform
Symphony
Spectrum Scale ESS
• One copy of data
• POSIX compliant
• Erasure coding
• Encryption/replication
POWER8 - S
• 2X x86 core performance
• Fewer nodes
IBM Open Platform
• Industry standard Hadoop
Solution
• Grow disk/CPU separately
• Pre-assembled/tested cluster
• On-site services
• Lower risk & faster time to value
Simplified physical
cluster management
• Higher utilization
• Shared cluster
• Better throughput
Spec ScaleSpec ScaleSpec Scale
A fully integrated solution with software and infrastructure optimized for Big Data & Analytics
S822L
Appliance-Like but much more Versatile!
IBM Systems28
Storage Intensive
Com
pute
Inte
nsiv
e
Add M
ore
Serv
ers
Add More Storage
� Add servers or storage or both as needed
� Adjust compute to storage ratio as workload needs change
� Standard Hadoop configurations with local storage and triple replica can result in overprovisioned compute to meet the storage demands
� Data Engine for Analytics allows right sizing of compute and storage independently to create an optimized configuration
Data Engine for Analytics offers Independent Scaling of Servers & Storage
IBM Systems
Client: Multinational Telecommunications CompanyA multinational telecommunication company with over 6M subscribers. Strategic value as they influence the IT decisions in other countries.
Challenges� Expectations of a Real Time Marketing (RTM) based solution to
run event-based campaigns� Enable event-based marketing, analysing various sources of
input data containing information regarding subscribers actions� Dispatch the triggered events to downstream applications such
as campaign management, for associated campaign execution.
Architecture• IBM Data Engine for Analytics: 20 X Power S822L, 2 X ESS GL4, Spectrum Scale, PCM
•BigInsights, Streams, SPSS Modeller, SPSS Analytics Server
Solution and Approach� Solution was to provide a Hadoop-based Big Data platform, integrated
to the RTM decision engine, that will enable data monetisation opportunities, including location based analytics
� Customer was not comfortable with the huge number of x86 Data Nodes approach of typical Hadoop Architecture
� The IBM team designed the Power solution and conducted a technical workshop with the client on newly redefined Hadoop architecture based on IDEA.
� Demonstrated the one IBM team value as an integrated approach to the client
Key Client Benefits� Optimized Big Data deployment architecture with IDEA Architecture
with Linux on Power, Elastic Storage Server and Spectrum Scale� Lower TCO with 4 Racks on Power against 12 racks on x86� More IO bandwidth with 40GbE Power network against 10GbE on
x86 based solution
3x less racks for 2 PB Big Data solution
4 vs. 12
Client Example – IDEA Architecture
IBM Systems 30
POWER Processor Roadmap
POWER8 Architecture POWER9 Architecture
2014POWER8
12 cores
22nm
New Micro-Architecture
New ProcessTechnology
2016POWER8w/ NVLink
12 cores
22nm
EnhancedMicro-
ArchitectureWith NVLink
2017P9 SO24 cores
14nm
New Micro-Architecture
Direct attachmemory
New ProcessTechnology
Optimized for Data-Centric Workloads
Integrated PCIe
CAPI Acceleration / I/O
Scale-Out Datacenter TCO Optimization
Scale-up performance Optimization
Acceleration Enhancements to CAPI and NVLINK
Modularity for OpenPOWER
TBDP9 SU
TBD cores
14nm
EnhancedMicro-
Architecture
BufferedMemory
POWER6 Architecture POWER7 Architecture
2007POWER6
2 cores
65nm
New Micro-Architecture
New ProcessTechnology
2008POWER6+
2 cores
65nm+
EnhancedMicro-
Architecture
EnhancedProcess
Technology
2010POWER7
8 cores
45nm
New Micro-Architecture
New ProcessTechnology
2012POWER7+
8 cores
32nm
EnhancedMicro-
Architecture
New ProcessTechnology
High Frequency
Enhanced RAS
Dynamic Energy Management
Large eDRAM L3 Cache
Optimized VSX
Enhanced Memory Subsystem
Focus on EnterpriseTechnology and Performance Driven
Focus on Scale-Out and EnterpriseCost and Acceleration Driven
2018 - 20P8/9 SO
10nm - 7nm
Existing Micro-
Architecture
FoundryTechnology
Partner ChipPOWER8/9
OpenPOWEREcosystem
DesignTargeting
Partner Markets & SystemsLeveraging Modulatrity
Price, performance, feature and ecosystem innovation
2020+
New Micro-Architecture
NewTechnology
POWER10
New Features and
Functions
Future
TBD
IBM Systems
What’s in the works ?
31
GPUs and FPGAs for Compute offload, Machine Learning, Graph and other specialized acceleration
CAPI Flash for Memory consolidation/expansion, and Storage acceleration
RDMA for better latency, better network utilization, lower CPU utilization, lower Memory utilization
OpenPOWER extends the ability to innovate around Spark into the hardware and accelerators.
IBM Systems
Backup
IBM Systems 33
Power Systems and NVIDIA GPU RoadmapNVIDIA GPU NVIDIA GPU with NVLink
Power Chip Power Chipwith NVLink
80 GB/sPeak*
PCIe x16
Current GPU Attach Future NVLink GPU Attachment
Graphics Memory
System Memory
Graphics Memory Graphics Memory
System Memory
40+40 GB/s
16+16 GB/s
�CPU to GPU NVLink Enables
� Easier Programming of GPU Accelerators
� Better Application Throughput
� Expanded Set of Accelerated Applications
� New Server: Early Shipments in 4Q ‘16
� IBM-NVIDIA NVLink Acceleration Lab
� Seeking Clients Now
� Apply at [email protected]
IBM Systems
IBM Data Engine for Analytics (IDEA) overview (1 of 2)
34
�Challenges with traditional Hadoop model�Typical Hadoop solutions use storage rich server based scale out solution�Lose control of storage since its lumped with compute�No separate storage capital planning�Usually no backup, archiving, disaster recovery facilities �Usually less storage security controls, auditing�Disk failures incur heavy rebuild penalty, consume network resources, reduces application performance�Multiple replicas are expensive and superfluous if the workload doesn’t need lots of tasks accessing the same data over and over (read mostly)�Typically cannot share resources with non-Hadoop workloads�Cannot reuse existing infrastructure, requires different infrastructure
IBM Systems
IBM Data Engine for Analytics (IDEA) overview (2 of 2)
35
�Value Proposition�Compute only servers with shared storage�Single replica of data�Better suited for workloads that tend to have a significant write component�Ideal for complex analytic solutions where additional components may need to interact with BigInsights
�Composed of several integrated components�Compute: POWER8 compute nodes – 2 or more �Storage: IBM Elastic Storage System with advanced distributed and parallel filesystem (Spectrum Scale Based)�FileSystem accessed over network supported by Spectrum Scale (GPFS) protocol�Networking: Ethernet and optionally InfiniBand (no Fibre Channel)�System Software: Linux, Cluster Provisioning and Management using PCM and xCAT�Middleware: Platform Symphony and BigInsights
IBM Systems
Solution Architecture Component Model
36
IBM Systems
Power Hadoop/Spark Solutions
Roll
Your Own
IBM Data Engine for
Hadoop and Spark
(IDE-HS)
IBM Data Engine for
Analytics
(IDEA)
• Point solution
• Classic Spark or
Hadoop architecture
• Open Platform for
Apache Hadoop
• Optional IBM value-
adds
• Based on Power LC
• Enterprise solution
• Shared external storage
(GPFS)
• BigInsights
• Platform Computing for
Resource and Cluster
Mgt.
• Based on Power L
IBM Integrated Offerings
IBM Systems
IBM Data Engine for Analytics Architecture
HadoopManagement
2 VMs each
Edge Nodes(Data Ingres)
Data Nodes1-2 VMs each
Physical ClusterManagement
Shared High BandwidthStorage
High Speed Network
+ Pre-assembled
+ On-site Services
IBM Systems
Physical Cluster Management
Power S812L, RHEL 6.5, 10 Cores, 32 GB Memory,
Platform Cluster Manager 4.2 Advanced Edition, XCAT 2.9
Single LPAR, HA server option available
Hadoop Management
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory
BigInsights 4.1
Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition
Spectrum Scale 4.1 Client
Minimum 2 servers, 2 LPARs each
DataNodes
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory
BigInsights 4.1
Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard Edition
Spectrum Scale 4.1 Client
User specified number of servers, 1 LPAR. 2 LPARs when running Big SQL
SharedStorage
IBM Elastic Storage Server Models GL2, GL4, GL6, GS2, GS4, or GS6 GSS 2.2 Software, GSS Management/MaintenanceSpectrum Scale 4.1.1 TL1 ServerUser specified number of storage servers
EdgeNodes (Opt)
Power S822L, RHEL 6.5, 24 Cores, 256 GB Memory,BigInsights 4.1Platform Symphony 6.1.1, Platform Cluster Manager 4.2 Standard EditionSpectrum Scale 4.1 ClientUser specified number of edge nodes, 1 or 2 LPARs each
Network (Opt) Mellanox 10 or 40 Gb RoCE Ethernet or Mellanox 56 Gb InfiniBand
IBM Power Systems
Apache Hadoop Ecosystem
40
IBM Systems
Platform Symphony Differentiation vs Open Source
41
YARN
• Monitor CPU and Memory only
• XML file to setup prioritization/scheduling strategy
• Limited Pre-emption
� available with FAIR scheduler.
� Capacity scheduler does not have pre-emption so 100% elasticity unwise.
• Cron to change strategy by time-of-day
Platform Symphony
• 50% Faster Time to Insights with MapReduce
• Deeper performance insight with visualizations of 150 machine metrics to support tuning & planning
• GUI based setup/administration & Visual validation of policies
• Advanced Pre-emption
� Pre-empt the least running jobs
� Round robin pre-emption
• Different resource strategy by time of day
• Showback Reports
IBM Systems
Spectrum Scale will enhance your Hadoop Environment !
42
Hadoop HDFS
HDFS NameNode HA added in version 2.0.
NameNode HA in active/passive configuration
Difficulty to ingest data – special tools required
Lacking enterprise readiness
No single point of failure, distributed metadata in
active/active configuration since 1998
Ingest data using policies for data placement
Versatile, Multi-purpose,
Hybrid Storage (locality and shared)
Enterprise ready with support for advanced storage
features (Encryption, DR, replication, SW RAID etc)
Large block-sizes – poor support for small files Variable block sizes – suited to multiple types of
data and metadata access pattern
Scale compute and storage independently
(Policy based ILM)
Compute and Storage tightly coupled – leading to
very low CPU utilization
Single-purpose, Hadoop MapReduce only
POSIX file system – easy to use and manageNon-POSIX file system – obscure commands.
Does not support in-place updates.
IBM Spectrum Scale