68
Large-scale array-oriented computing with Python PyCon Taiwan, June 9, 2012 Travis E. Oliphant Friday, June 8, 12

Large-scale Array-oriented Computing with Python

  • Upload
    pycontw

  • View
    111

  • Download
    3

Embed Size (px)

DESCRIPTION

Keynote for Day 1 of PyCon Taiwan 2012, by Travis Oliphant http://tw.pycon.org/2012/speaker/

Citation preview

Page 1: Large-scale Array-oriented Computing with Python

Large-scale array-oriented computing with Python

PyCon Taiwan, June 9, 2012

Travis E. Oliphant

Friday, June 8, 12

Page 2: Large-scale Array-oriented Computing with Python

My Roots

Friday, June 8, 12

Page 3: Large-scale Array-oriented Computing with Python

My RootsImages from BYU Mers Lab

Friday, June 8, 12

Page 4: Large-scale Array-oriented Computing with Python

Science led to Python

Raja Muthupillai

Armando Manduca

Richard Ehman1997

⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f)Uk,l (a, f)],j

Friday, June 8, 12

Page 5: Large-scale Array-oriented Computing with Python

Finding derivatives of 5-d data⌅ = r⇥U

Friday, June 8, 12

Page 6: Large-scale Array-oriented Computing with Python

Scientist at heart

Friday, June 8, 12

Page 7: Large-scale Array-oriented Computing with Python

Python origins.Version Date

0.9.0 Feb. 1991

0.9.4 Dec. 1991

0.9.6 Apr. 1992

0.9.8 Jan. 1993

1.0.0 Jan. 1994

1.2 Apr. 1995

1.4 Oct. 1996

1.5.2 Apr. 1999

http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

Friday, June 8, 12

Page 8: Large-scale Array-oriented Computing with Python

Python origins.Version Date

0.9.0 Feb. 1991

0.9.4 Dec. 1991

0.9.6 Apr. 1992

0.9.8 Jan. 1993

1.0.0 Jan. 1994

1.2 Apr. 1995

1.4 Oct. 1996

1.5.2 Apr. 1999

http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

Friday, June 8, 12

Page 9: Large-scale Array-oriented Computing with Python

Brief History

Person Package Year

Jim Fulton Matrix Object in Python

1994

Jim Hugunin Numeric 1995

Perry Greenfield, Rick White, Todd Miller Numarray 2001

Travis Oliphant NumPy 2005

Friday, June 8, 12

Page 10: Large-scale Array-oriented Computing with Python

1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis

environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.

In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would

be creating this uber-package which eventually became SciPy

Gaussian quadrature 5 Jan 1999

cephes 1.0 30 Jan 1999

sigtools 0.40 23 Feb 1999

Numeric docs March 1999

cephes 1.1 9 Mar 1999

multipack 0.3 13 Apr 1999

Helper routines 14 Apr 1999

multipack 0.6 (leastsq, ode, fsolve, quad)

29 Apr 1999

sparse plan described 30 May 1999

multipack 0.7 14 Jun 1999

SparsePy 0.1 5 Nov 1999

cephes 1.2 (vectorize) 29 Dec 1999

Plotting??

GistXPLOTDISLINGnuplot

Helping with f2py

Friday, June 8, 12

Page 11: Large-scale Array-oriented Computing with Python

SciPy 2001

Founded in 2001 with Travis Vaught

Eric JonesweaveclusterGA*

Pearu Petersonlinalg

interpolatef2py

Travis Oliphantoptimizesparse

interpolateintegratespecialsignalstats

fftpackmisc

Friday, June 8, 12

Page 12: Large-scale Array-oriented Computing with Python

Community effort• Chuck Harris• Pauli Virtanen• David Cournapeau• Stefan van der Walt• Dag Sverre Seljebotn• Robert Kern• Warren Weckesser• Ralf Gommers• Mark Wiebe• Nathaniel Smith

Friday, June 8, 12

Page 13: Large-scale Array-oriented Computing with Python

Why Python for Technical Computing

• Syntax (it gets out of your way)• Over-loadable operators• Complex numbers built-in early• Just enough language support for arrays• “Occasional” programmers can grok it• Supports multiple programming styles• Expert programmers can also use it effectively• Has a simple, extensible implementation• General-purpose language --- can build a system• Critical mass

Friday, June 8, 12

Page 14: Large-scale Array-oriented Computing with Python

What is wrong with Python?• Packaging is still not solved well (distribute, pip, and

distutils2 don’t cut it)• Missing anonymous blocks • The CPython run-time is aged and needs an overhaul

(GIL, global variables, lack of dynamic compilation support)• No approach to language extension except for

“import hooks” (lightweight DSL need)• The distraction of multiple run-times...• Array-oriented and NumPy not really understood by

most Python devs.

Friday, June 8, 12

Page 15: Large-scale Array-oriented Computing with Python

Putting Science back in Comp Sci

• Much of the software stack is for systems programming --- C++, Java, .NET, ObjC, web- Complex numbers?- Vectorized primitives?

• Array-oriented programming has been supplanted by Object-oriented programming

• Software stack for scientists is not as helpful as it should be

• Fortran is still where many scientists end up

Friday, June 8, 12

Page 16: Large-scale Array-oriented Computing with Python

Array-Oriented Computing

Example1: Fibonacci Numbers

fn = fn�1 + fn�2

f0 = 0f1 = 1

f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Friday, June 8, 12

Page 17: Large-scale Array-oriented Computing with Python

Common Python approaches

Recursive Iterative

Algorithm matters!!

Friday, June 8, 12

Page 18: Large-scale Array-oriented Computing with Python

Array-oriented approaches

Using LFilter

Using Formula

Friday, June 8, 12

Page 19: Large-scale Array-oriented Computing with Python

Array-oriented approaches

Friday, June 8, 12

Page 20: Large-scale Array-oriented Computing with Python

NumPy: an Array-Oriented Extension• Data: the array object

– slicing and shaping – data-type map to Bytes

• Fast Math: – vectorization – broadcasting– aggregations

Friday, June 8, 12

Page 21: Large-scale Array-oriented Computing with Python

NumPy Array

shape

Friday, June 8, 12

Page 22: Large-scale Array-oriented Computing with Python

Zen of NumPy

• strided is better than scattered• contiguous is better than strided• descriptive is better than imperative • array-oriented is better than object-oriented• broadcasting is a great idea • vectorized is better than an explicit loop• unless it’s too complicated --- then use Cython/Numba• think in higher dimensions

Friday, June 8, 12

Page 23: Large-scale Array-oriented Computing with Python

More NumPy Demonstration

Friday, June 8, 12

Page 24: Large-scale Array-oriented Computing with Python

Conway’s game of Life

• Dead cell with exactly 3 live neighbors will come to life

• A live cell with 2 or 3 neighbors will survive

• With too few or too many neighbors, the cell dies

Friday, June 8, 12

Page 25: Large-scale Array-oriented Computing with Python

Interesting Patterns emerge

Friday, June 8, 12

Page 26: Large-scale Array-oriented Computing with Python

APL : the first array-oriented language

• Appeared in 1964• Originated by Ken Iverson• Direct descendants (J, K, Matlab) are still

used heavily and people pay a lot of money for them

• NumPy is a descendent APL

J

K Matlab

NumericNumPy

Friday, June 8, 12

Page 27: Large-scale Array-oriented Computing with Python

Conway’s Game of Life

APL

NumPyInitialization

Update Step

Friday, June 8, 12

Page 28: Large-scale Array-oriented Computing with Python

Demo

Python Version

Array-oriented NumPy Version

Friday, June 8, 12

Page 29: Large-scale Array-oriented Computing with Python

Memory using Object-oriented

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Friday, June 8, 12

Page 30: Large-scale Array-oriented Computing with Python

Array-oriented (Table) approach

Attr1 Attr2 Attr3

Object1

Object2

Object3

Object4

Object5

Object6

Friday, June 8, 12

Page 31: Large-scale Array-oriented Computing with Python

Benefits of Array-oriented

• Many technical problems are naturally array-oriented (easy to vectorize)

• Algorithms can be expressed at a high-level• These algorithms can be parallelized more

simply (quite often much information is lost in the translation to typical “compiled” languages)

• Array-oriented algorithms map to modern hard-ware caches and pipelines.

Friday, June 8, 12

Page 32: Large-scale Array-oriented Computing with Python

We need more focus on complied array-oriented languages with fast compilers!

Friday, June 8, 12

Page 33: Large-scale Array-oriented Computing with Python

What is good about NumPy?• Array-oriented• Extensive Dtype System (including structures)• C-API• Simple to understand data-structure• Memory mapping• Syntax support from Python• Large community of users• Broadcasting• Easy to interface C/C++/Fortran code

Friday, June 8, 12

Page 34: Large-scale Array-oriented Computing with Python

What is wrong with NumPy

• Dtype system is difficult to extend• Immediate mode creates huge temporaries

(spawning Numexpr)• “Almost” an in-memory data-base comparable

to SQL-lite (missing indexes)• Integration with sparse arrays • Lots of un-optimized parts• Minimal support for multi-core / GPU• Code-base is organic and hard to extend

Friday, June 8, 12

Page 35: Large-scale Array-oriented Computing with Python

Improvements needed• NDArray improvements

• Indexes (esp. for Structured arrays)• SQL front-end• Multi-level, hierarchical labels• selection via mappings (labeled arrays)• Memory spaces (array made up of regions)• Distributed arrays (global array)• Compressed arrays• Standard distributed persistance• fancy indexing as view and optimizations• streaming arrays

Friday, June 8, 12

Page 36: Large-scale Array-oriented Computing with Python

Improvements needed

• Dtype improvements• Enumerated types (including dynamic enumeration)• Derived fields• Specification as a class (or JSON)• Pointer dtype (i.e. C++ object, or varchar)• Finishing datetime• Missing data with bit-patterns• Parameterized field names

Friday, June 8, 12

Page 37: Large-scale Array-oriented Computing with Python

Example of Object-defined Dtype

@np.dtypeclass Stock(np.DType): symbol = np.Str(4) open = np.Int(2) close = np.Int(2) high = np.Int(2) low = np.Int(2) @np.Int(2) def mid(self): return (self.high + self.low) / 2.0

Friday, June 8, 12

Page 38: Large-scale Array-oriented Computing with Python

Improvements needed• Ufunc improvements

• Generalized ufuncs support more than just contiguous arrays

• Specification of ufuncs in Python• Move most dtype “array functions” to ufuncs• Unify error-handling for all computations• Allow lazy-evaluation and remote computation ---

streaming and generator data• Structured and string dtype ufuncs• Multi-core and GPU optimized ufuncs• Group-by reduction

Friday, June 8, 12

Page 39: Large-scale Array-oriented Computing with Python

More Improvements needed• Miscellaneous improvements

• ABI-management• Eventual Move to library (NDLib)?• Integration with LLVM• Sparse dimensions• Remote computation• Fast I/O for CSV and Excel• Out-of-core calculations• Delayed-mode execution

Friday, June 8, 12

Page 40: Large-scale Array-oriented Computing with Python

New Project

Blaze

NumPy

Next Generation NumPyOut-of-core

Distributed Tables

Friday, June 8, 12

Page 41: Large-scale Array-oriented Computing with Python

Blaze Main Features• New ndarray with multiple memory segments• Distributed ndtable which can span the world• Fast, out-of-core algorithms for all functions• Delayed-mode execution: expressions build up

graph which gets executed where the data is• Built-in Indexes (beyond searchsorted)• Built-in labels (data-array)• Sparse dimensions (defined by attributes or

elements of another dimension)• Direct adapters to all data (move code to data)

Friday, June 8, 12

Page 42: Large-scale Array-oriented Computing with Python

Delayed execution

Demo Code Only

Friday, June 8, 12

Page 43: Large-scale Array-oriented Computing with Python

Dimensions defined by Attributes

Day Month Year High Low

15 3 2012 30 20

16 3 2012 35 25

20 3 2012 40 30

21 3 2012 41 29

dim1

Friday, June 8, 12

Page 44: Large-scale Array-oriented Computing with Python

OutlineNDTable

NDArray

Bytes

GFunc DType

Domain

Friday, June 8, 12

Page 45: Large-scale Array-oriented Computing with Python

NDTable (Example)

Proc0 Proc1 Proc2 Proc3

Proc0 Proc1 Proc2 Proc3

Proc0 Proc1 Proc2 Proc3

Proc4 Proc4 Proc4 Proc4

Each Partition:• Remote• Expression• NDArray

Friday, June 8, 12

Page 46: Large-scale Array-oriented Computing with Python

Data URLs

• Variables in script are global addresses (DATA URLs). All the world’s data you can see via web can be in used as part of an algorithm by referencing it as a part of an array.• Dynamically interpret bytes as data-type• Scheduler will push code based on data-type

to the data instead of pulling data to the code.

Friday, June 8, 12

Page 47: Large-scale Array-oriented Computing with Python

Overview

Main Script

Processing Node

Processing Node

Processing Node

Processing Node

Processing Node

Code

Code

Code

Code

Friday, June 8, 12

Page 48: Large-scale Array-oriented Computing with Python

NDArray

• Local ndarray (NumPy++)• Multiple byte-buffers (streaming or random

access)• Variable-length arrays• All kinds of data-types (everything...)• Multiple patterns of memory access possible

(Z-order, Fortran-order, C-order)• Sparse dimensions

Friday, June 8, 12

Page 49: Large-scale Array-oriented Computing with Python

GFunc

• Generalized Function• All NumPy functions• element-by-element• linear algebra• manipulation• Fourier Transform

• Iteration and Dispatch to low-level kernels• Kernels can be written in anything that builds a

C-like interface

Friday, June 8, 12

Page 50: Large-scale Array-oriented Computing with Python

Early Timeline

Date Milestone

July 2012 Pre-alpha release

December 2012 Early Beta Release

June 2013 Version 1.0

Friday, June 8, 12

Page 51: Large-scale Array-oriented Computing with Python

PyData

All computing modules known to work with Blaze will be placed under PyData umbrella of

projects over the coming years.

Friday, June 8, 12

Page 52: Large-scale Array-oriented Computing with Python

Introducing Numba (lots of kernels to write)

Friday, June 8, 12

Page 53: Large-scale Array-oriented Computing with Python

NumPy Users

• Want to be able to write Python to get fast code that works on arrays and scalars

• Need access to a boat-load of C-extensions (NumPy is just the beginning)

PyPy doesn’t cut it for us!

Friday, June 8, 12

Page 54: Large-scale Array-oriented Computing with Python

Dynamic compilationPython

Function

NumPy Runtime

Ufu

ncs

Gen

eral

ized

U

Func

s

Func

tion-

base

d In

dexi

ng

Mem

ory

Filte

rs

Win

dow

K

erne

l Fu

ncs

I/O F

ilter

s

Red

uctio

n Fi

lters

Com

pute

d C

olum

ns

Dynamic Compilation

function pointer

Friday, June 8, 12

Page 55: Large-scale Array-oriented Computing with Python

SciPy needs a Python compiler

integrateoptimize

odespecial

writing more of SciPy at high-level

Friday, June 8, 12

Page 56: Large-scale Array-oriented Computing with Python

Numba -- a Python compiler

• Replays byte-code on a stack with simple type-inference

• Translates to LLVM (using LLVM-py)• Uses LLVM for code-gen• Resulting C-level function-pointer can be

inserted into NumPy run-time• Understands NumPy arrays• Is NumPy / SciPy aware

Friday, June 8, 12

Page 57: Large-scale Array-oriented Computing with Python

NumPy + Mamba = Numba

LLVM 3.1

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVM-PY

Python Function Machine Code

Friday, June 8, 12

Page 58: Large-scale Array-oriented Computing with Python

Examples

Friday, June 8, 12

Page 59: Large-scale Array-oriented Computing with Python

Examples

Friday, June 8, 12

Page 60: Large-scale Array-oriented Computing with Python

Software Stack Future?

LLVM

Python

COBJCFORTRAN

R

C++

Plateaus of Code re-use + DSLs

MatlabSQL

TDPL

Friday, June 8, 12

Page 61: Large-scale Array-oriented Computing with Python

How to pay for all this?

Friday, June 8, 12

Page 62: Large-scale Array-oriented Computing with Python

Dual strategy

Blaze

Friday, June 8, 12

Page 63: Large-scale Array-oriented Computing with Python

NumFOCUS

Num(Py) Foundation for Open Code for Usable Science

Friday, June 8, 12

Page 64: Large-scale Array-oriented Computing with Python

NumFOCUS

• Mission• To initiate and support educational programs

furthering the use of open source software in science.

• To promote the use of high-level languages and open source in science, engineering, and math research

• To encourage reproducible scientific research• To provide infrastructure and support for open

source projects for technical computing

Friday, June 8, 12

Page 65: Large-scale Array-oriented Computing with Python

NumFOCUS

• Activites• Sponsor sprints and conferences• Provide scholarships and grants for people using

these tools• Pay for documentation development and basic

course development• Fund continuous integration and build systems• Work with domain-specific organizations• Raise funds from industries using Python and

NumPy

Friday, June 8, 12

Page 66: Large-scale Array-oriented Computing with Python

NumFOCUS

Core Projects

Other Projects (seeking more --- need representatives)

NumPy SciPy IPython Matplotlib

Scikits Image

Friday, June 8, 12

Page 67: Large-scale Array-oriented Computing with Python

NumFOCUS

• Directors• Perry Greenfield• John Hunter• Jarrod Millman• Travis Oliphant• Fernando Perez

• Members• Basically people who donate for now. In time, a

body that elects directors.

Friday, June 8, 12

Page 68: Large-scale Array-oriented Computing with Python

• Large-scale data analysis products• Python and NumPy training• NumPy support and consulting• Rich-client or web user-interfaces• Blaze and PyData Development

Friday, June 8, 12