Upload
daniel-mcknight
View
227
Download
0
Embed Size (px)
Citation preview
7/30/2019 Pydata2012 NYC
1/35
PyTablesAnon-diskbinarydatacontainer,queryengineandcomputa:onalkernel
FrancescAlted
TutorialforthePyDataConference,October2012,NewYorkCity
7/30/2019 Pydata2012 NYC
2/35
10thanniversaryofPyTablesHi!,
PyTables is a Python package which allows dealing with HDF5 tables. Such
a table is defined as a collection of records whose values are stored in
fixed-length fields. PyTables is intended to be easy-to-use, and tries to
be a high-performance interface to HDF5. To achieve this, the newest
improvements introduced in Python 2.2 (like generators or slots andmetaclasses in new-brand classes) has been used. Pyrex creation extension
tool has been chosen to access the HDF5 library.
This package should be platform independent, but until now I've tested it
only with Linux. It's the first public release (v 0.1), and it is in
alpha state."--FrancescAltedannouncingPyTables0.1,October2002
7/30/2019 Pydata2012 NYC
3/35
Overview
WhatPyTablesis? DatastructuresinPyTables Compressingdata Advancedcapabili:esinPyTables
7/30/2019 Pydata2012 NYC
4/35
Notebooksfortutorial
http://pytables.org/download/PyData2012-NYC.tar.gz
7/30/2019 Pydata2012 NYC
5/35
Whatitis
Abinarydatacontainerforon-disk,structureddata
Canperformopera:onswithdata*on-disk* Basedonthestandardde-factoHDF5format
Freesoware(BSDlicense)
7/30/2019 Pydata2012 NYC
6/35
AboutHDF5
(HierarchicalDataFileversion5)
Aversa&ledatamodelthatcanrepresentcomplexdataobjectsaswellasassociated
metadata
Aportablefileformatwithnolimitonthenumberorsizeofdataobjectsinthecollec:on
Implementsahigh-levelAPIwithC,C++,Fortran90,andJavainterfaces
Freesoware(BSD,MITkindoflicense)
7/30/2019 Pydata2012 NYC
7/35
PyTablesdis:nc:vefeatures
SupportsagoodrangeofcompressorsZlib,bzip2,LZOandBlosc
Powerfulquerycapabili:esforTableobjects,includingindexing
Canperformout-of-coreopera:onsveryefficiently
7/30/2019 Pydata2012 NYC
8/35
Whatitisnot
Notarela:onaldatabasereplacementNotadistributeddatabase
Notextremelysecureorsafe(itsmoreaboutspeed!)
NotamereHDF5wrapper
7/30/2019 Pydata2012 NYC
9/35
DATASTRUCTURES
7/30/2019 Pydata2012 NYC
10/35
Datastructures
Highlevelofflexibilityforstructuringyourdata
Datatypesscalars(numerical&strings),records,enumerated,:me
Tablessupportmul:dimensionalcellsandnestedrecords
Mu:dimensionalarraysVariablelengtharrays
7/30/2019 Pydata2012 NYC
11/35
TheArrayobject
Easytocreatefile.createArray(mygroup,array,numpy_arr)
Shapecannotchange Cannotbecompressed
7/30/2019 Pydata2012 NYC
12/35
TheCArrayobject
Dataisstoredinchunks Eachchunkcanbecompressedindependently Shapecannotchange
7/30/2019 Pydata2012 NYC
13/35
TheEArrayobject
DataisstoredinchunksCanbecompressed
Shapecanchange(eitherenlargedorshrunk) Shapemustbekeptregular
7/30/2019 Pydata2012 NYC
14/35
TheVLArrayobject
Dataisstoredinvariablelengthrows Canbeenlargedorshrunk Datacannotbecompressed
7/30/2019 Pydata2012 NYC
15/35
TheTableobject
Dataisstoredinchunks Canbecompressed Canbeenlargedorshrunk Fieldscannotbeofvariablelength
Col1
(int32)
Col2
(string10)
Col3
(bool)
Col4
(complex64)
Col5
(float32)
7/30/2019 Pydata2012 NYC
16/35
Datasethierarchy
root
group1
table1 table2
group2
array
7/30/2019 Pydata2012 NYC
17/35
Aributes
Metadataaboutdata
table1
DateJul242006
Observa:ons555
CF[0.1,0.3,0.6]
7/30/2019 Pydata2012 NYC
18/35
COMPRESSIONCAPABILITIES
7/30/2019 Pydata2012 NYC
19/35
Whycompression?
Letsyoustoremoredatausingthesamespace
UsesmoreCPU,butCPU:meischeapcomparedwithdiskaccess
Differentcompressorsfordifferentusesbzip2,zlib,lzo,blosc
7/30/2019 Pydata2012 NYC
20/35
WhyBlosc?
7/30/2019 Pydata2012 NYC
21/35
Accelera:ngI/O
Blosc
}
}
Othercompressors
7/30/2019 Pydata2012 NYC
22/35
MemoryaccessvsCPUcycle:me
7/30/2019 Pydata2012 NYC
23/35
Laptopcomputerbackin2005
7/30/2019 Pydata2012 NYC
24/35
Stateoftheartcomputerin2012
(singlenode)
7/30/2019 Pydata2012 NYC
25/35
OUT-OF-CORECOMPUTATIONS
7/30/2019 Pydata2012 NYC
26/35
Opera:ngwithdisk-basedarrays
tables.Exprisanop:mizedevaluatorforexpressionsofdisk-basedarrays.
Itisacombina:onoftheNumexpradvancedcompu:ngcapabili:eswiththehighI/O
performanceofPyTables.
SimilarlytoNumexpr,disk-temporariesareavoided,andmul:-threadedopera:onis
preserved.
7/30/2019 Pydata2012 NYC
27/35
AvoidingtemporarieswithNumexpr
Tables.Exprfollowsthesameapproach,
butwithdiskinsteadofmemory
7/30/2019 Pydata2012 NYC
28/35
Performingout-of-corecomputa:ons
withPyTables
Virtualmachinenumexpr
7/30/2019 Pydata2012 NYC
29/35
ADVANCEDQUERYCAPABILITIES
7/30/2019 Pydata2012 NYC
30/35
Differentquerymodes
Regularquery [ r[c1] for r in table
if r[c2] > 2.1 and r[c3] == True)) ]
In-kernelquery [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
Indexedquery table.cols.c2.createIndex() table.cols.c3.createIndex() [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
7/30/2019 Pydata2012 NYC
31/35
Regularandin-kernelqueries
7/30/2019 Pydata2012 NYC
32/35
Customizableindexes
7/30/2019 Pydata2012 NYC
33/35
Indexedqueryperformance
7/30/2019 Pydata2012 NYC
34/35
Conceptstotakehome
PyTablesisop:mizedtodealwithdataondisk Mostoftheopera:onsusetheiterator/generatormachineryinPythonthegoalisnottobloatmemorywithdata
Queries,indexesandout-of-coreopera:onsaregoodexamplesoftheabove
7/30/2019 Pydata2012 NYC
35/35
Thankyou!
Ques:ons?