Pydata2012 NYC

7/30/2019 Pydata2012 NYC

1/35

PyTablesAnon-diskbinarydatacontainer,queryengineandcomputa:onalkernel

FrancescAlted

TutorialforthePyDataConference,October2012,NewYorkCity

7/30/2019 Pydata2012 NYC

2/35

10thanniversaryofPyTablesHi!,

PyTables is a Python package which allows dealing with HDF5 tables. Such

a table is defined as a collection of records whose values are stored in

fixed-length fields. PyTables is intended to be easy-to-use, and tries to

be a high-performance interface to HDF5. To achieve this, the newest

improvements introduced in Python 2.2 (like generators or slots andmetaclasses in new-brand classes) has been used. Pyrex creation extension

tool has been chosen to access the HDF5 library.

This package should be platform independent, but until now I've tested it

only with Linux. It's the first public release (v 0.1), and it is in

alpha state."--FrancescAltedannouncingPyTables0.1,October2002

7/30/2019 Pydata2012 NYC

3/35

Overview

WhatPyTablesis? DatastructuresinPyTables Compressingdata Advancedcapabili:esinPyTables

7/30/2019 Pydata2012 NYC

4/35

Notebooksfortutorial

http://pytables.org/download/PyData2012-NYC.tar.gz

7/30/2019 Pydata2012 NYC

5/35

Whatitis

Abinarydatacontainerforon-disk,structureddata

Canperformopera:onswithdata*on-disk* Basedonthestandardde-factoHDF5format

Freesoware(BSDlicense)

7/30/2019 Pydata2012 NYC

6/35

AboutHDF5

(HierarchicalDataFileversion5)

Aversa&ledatamodelthatcanrepresentcomplexdataobjectsaswellasassociated

metadata

Aportablefileformatwithnolimitonthenumberorsizeofdataobjectsinthecollec:on

Implementsahigh-levelAPIwithC,C++,Fortran90,andJavainterfaces

Freesoware(BSD,MITkindoflicense)

7/30/2019 Pydata2012 NYC

7/35

PyTablesdis:nc:vefeatures

SupportsagoodrangeofcompressorsZlib,bzip2,LZOandBlosc

Powerfulquerycapabili:esforTableobjects,includingindexing

Canperformout-of-coreopera:onsveryefficiently

7/30/2019 Pydata2012 NYC

8/35

Whatitisnot

Notarela:onaldatabasereplacementNotadistributeddatabase

Notextremelysecureorsafe(itsmoreaboutspeed!)

NotamereHDF5wrapper

7/30/2019 Pydata2012 NYC

9/35

DATASTRUCTURES

7/30/2019 Pydata2012 NYC

10/35

Datastructures

Highlevelofflexibilityforstructuringyourdata

Datatypesscalars(numerical&strings),records,enumerated,:me

Tablessupportmul:dimensionalcellsandnestedrecords

Mu:dimensionalarraysVariablelengtharrays

7/30/2019 Pydata2012 NYC

11/35

TheArrayobject

Easytocreatefile.createArray(mygroup,array,numpy_arr)

Shapecannotchange Cannotbecompressed

7/30/2019 Pydata2012 NYC

12/35

TheCArrayobject

Dataisstoredinchunks Eachchunkcanbecompressedindependently Shapecannotchange

7/30/2019 Pydata2012 NYC

13/35

TheEArrayobject

DataisstoredinchunksCanbecompressed

Shapecanchange(eitherenlargedorshrunk) Shapemustbekeptregular

7/30/2019 Pydata2012 NYC

14/35

TheVLArrayobject

Dataisstoredinvariablelengthrows Canbeenlargedorshrunk Datacannotbecompressed

7/30/2019 Pydata2012 NYC

15/35

TheTableobject

Dataisstoredinchunks Canbecompressed Canbeenlargedorshrunk Fieldscannotbeofvariablelength

Col1

(int32)

Col2

(string10)

Col3

(bool)

Col4

(complex64)

Col5

(float32)

7/30/2019 Pydata2012 NYC

16/35

Datasethierarchy

root

group1

table1 table2

group2

array

7/30/2019 Pydata2012 NYC

17/35

Aributes

Metadataaboutdata

table1

DateJul242006

Observa:ons555

CF[0.1,0.3,0.6]

7/30/2019 Pydata2012 NYC

18/35

COMPRESSIONCAPABILITIES

7/30/2019 Pydata2012 NYC

19/35

Whycompression?

Letsyoustoremoredatausingthesamespace

UsesmoreCPU,butCPU:meischeapcomparedwithdiskaccess

Differentcompressorsfordifferentusesbzip2,zlib,lzo,blosc

7/30/2019 Pydata2012 NYC

20/35

WhyBlosc?

7/30/2019 Pydata2012 NYC

21/35

Accelera:ngI/O

Blosc

}

}

Othercompressors

7/30/2019 Pydata2012 NYC

22/35

MemoryaccessvsCPUcycle:me

7/30/2019 Pydata2012 NYC

23/35

Laptopcomputerbackin2005

7/30/2019 Pydata2012 NYC

24/35

Stateoftheartcomputerin2012

(singlenode)

7/30/2019 Pydata2012 NYC

25/35

OUT-OF-CORECOMPUTATIONS

7/30/2019 Pydata2012 NYC

26/35

Opera:ngwithdisk-basedarrays

tables.Exprisanop:mizedevaluatorforexpressionsofdisk-basedarrays.

Itisacombina:onoftheNumexpradvancedcompu:ngcapabili:eswiththehighI/O

performanceofPyTables.

SimilarlytoNumexpr,disk-temporariesareavoided,andmul:-threadedopera:onis

preserved.

7/30/2019 Pydata2012 NYC

27/35

AvoidingtemporarieswithNumexpr

Tables.Exprfollowsthesameapproach,

butwithdiskinsteadofmemory

7/30/2019 Pydata2012 NYC

28/35

Performingout-of-corecomputa:ons

withPyTables

Virtualmachinenumexpr

7/30/2019 Pydata2012 NYC

29/35

ADVANCEDQUERYCAPABILITIES

7/30/2019 Pydata2012 NYC

30/35

Differentquerymodes

Regularquery [ r[c1] for r in table

if r[c2] > 2.1 and r[c3] == True)) ]

In-kernelquery [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]

Indexedquery table.cols.c2.createIndex() table.cols.c3.createIndex() [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]

7/30/2019 Pydata2012 NYC

31/35

Regularandin-kernelqueries

7/30/2019 Pydata2012 NYC

32/35

Customizableindexes

7/30/2019 Pydata2012 NYC

33/35

Indexedqueryperformance

7/30/2019 Pydata2012 NYC

34/35

Conceptstotakehome

PyTablesisop:mizedtodealwithdataondisk Mostoftheopera:onsusetheiterator/generatormachineryinPythonthegoalisnottobloatmemorywithdata

Queries,indexesandout-of-coreopera:onsaregoodexamplesoftheabove

7/30/2019 Pydata2012 NYC

35/35

Thankyou!

Ques:ons?

Documents

Pydata2012 NYC