Upload
wes-mckinney
View
214
Download
0
Embed Size (px)
Citation preview
1 © Cloudera, Inc. All rights reserved.
Next-‐genera;on Python Big Data Tools, powered by Apache Arrow Wes McKinney @wesmckinn SF Big Analy;cs Meetup, 2016-‐04-‐05
2 © Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects
• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba;ng)}
• Mostly work in Python and Cython/C/C++
3 © Cloudera, Inc. All rights reserved.
In process: Python for Data Analysis: 2nd Edi4on Coming late 2016 / early 2017
4 © Cloudera, Inc. All rights reserved.
Python + Big Data: The State of things
• See “Python and Apache Hadoop: A State of the Union” from February 17 • Areas where much more work needed
• Binary file format read/write support (e.g. Parquet files) • File system libraries (HDFS, S3, etc.) • Client drivers (Spark, Hive, Impala, Kudu) • Compute system integra;on (Spark, Impala, etc.)
5 © Cloudera, Inc. All rights reserved.
Apache Arrow
Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow
6 © Cloudera, Inc. All rights reserved.
Arrow in a Slide
• New Top-‐level Apache Sofware Founda;on project • Announced Feb 17, 2016
• Focused on Columnar In-‐Memory Analy;cs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of
breed systems 3. Designed to work with any programming language 4. Support for both rela;onal and complex data as-‐is
• Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
7 © Cloudera, Inc. All rights reserved.
Apache Arrow: What is it?
• hkp://arrow.apache.org • Not a piece of sofware, exactly! • A standardized in-‐memory representa;on for columnar data • Enables
• Suitable for implemen;ng high-‐performance analy;cs in-‐memory (think like “pandas internals”)
• Cheap data interchange amongst systems, likle or no serializa;on • Flexible support for complex JSON-‐like data
• Targets: Impala, Kudu, Parquet, Spark
8 © Cloudera, Inc. All rights reserved.
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional Memory Buffer
Arrow Memory Buffer
• Cache Locality • Super-‐scalar & vectorized opera;on
• Minimal Structure Overhead • Constant value access
• With minimal structure overhead • Operate directly on columnar compressed data
9 © Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & ConvertCopy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
10 © Cloudera, Inc. All rights reserved.
Big Data Systems: Poor Python IO performance
h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/
11 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R • Problem: fast, language-‐agnos;c binary data frame file format
• Wriken by Wes McKinney (Python) Hadley Wickham (R)
• Read speeds close to disk IO performance
Arrow array 0Arrow array 1
…Arrow array n
Feather metadata
Feather file
Apache Arrow memory
Google flatbuffers
12 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R
library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)
import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
R Python
13 © Cloudera, Inc. All rights reserved.
Apache Parquet: Binary columnar storage format
• I just became a Parquet commiker! • github.com/apache/parquet-‐cpp • Python users will soon be able to read Parquet files via PyArrow • parquet-‐cpp <-‐> PyArrow <-‐> pandas
14 © Cloudera, Inc. All rights reserved.
Language Bindings • Target Languages
• Java (beta) • CPP (underway) • Python & Pandas (underway) • R • Julia
• Ini;al Focus • Read a structure • Write a structure • Manage Memory
15 © Cloudera, Inc. All rights reserved.
pandas and Arrow in context
16 © Cloudera, Inc. All rights reserved.
RPC & IPC: Moving Data Between Systems RPC • Avoid Serializa;on & Deserializa;on • Layer TBD: Focused on suppor;ng vectored io
• Scaker/gather reads/writes against socket
IPC • Alpha implementa;on using memory mapped files
• Moving data between Python and Drill • Working on shared alloca;on approach
• Shared reference coun;ng and well-‐defined ownership seman;cs
17 © Cloudera, Inc. All rights reserved.
Execu;ng data science languages in the compute layer
UIIbis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
Python, R, Julia, …?
18 © Cloudera, Inc. All rights reserved.
Real World Example: Python With Spark, Drill, Impala
in partition 0
…
in partition n - 1
SQL Engine
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
SQL Engine
19 © Cloudera, Inc. All rights reserved.
What’s Next • Parquet for Python & C++
• Using Arrow as intermediary • Available IPC Implementa;on • Spark, Drill Integra;on
• Faster UDFs, Storage interfaces
20 © Cloudera, Inc. All rights reserved.
Apache Arrow in prac;ce
21 © Cloudera, Inc. All rights reserved.
Get Involved • Join the community
• [email protected] • Slack: hkps://apachearrowslackin.herokuapp.com/ • hkp://arrow.apache.org • @ApacheArrow
22 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own