Pydata2012 NYC

Embed Size (px)

Citation preview

  • 7/30/2019 Pydata2012 NYC

    1/35

    PyTablesAnon-diskbinarydatacontainer,queryengineandcomputa:onalkernel

    FrancescAlted

    TutorialforthePyDataConference,October2012,NewYorkCity

  • 7/30/2019 Pydata2012 NYC

    2/35

    10thanniversaryofPyTablesHi!,

    PyTables is a Python package which allows dealing with HDF5 tables. Such

    a table is defined as a collection of records whose values are stored in

    fixed-length fields. PyTables is intended to be easy-to-use, and tries to

    be a high-performance interface to HDF5. To achieve this, the newest

    improvements introduced in Python 2.2 (like generators or slots andmetaclasses in new-brand classes) has been used. Pyrex creation extension

    tool has been chosen to access the HDF5 library.

    This package should be platform independent, but until now I've tested it

    only with Linux. It's the first public release (v 0.1), and it is in

    alpha state."--FrancescAltedannouncingPyTables0.1,October2002

  • 7/30/2019 Pydata2012 NYC

    3/35

    Overview

    WhatPyTablesis? DatastructuresinPyTables Compressingdata Advancedcapabili:esinPyTables

  • 7/30/2019 Pydata2012 NYC

    4/35

    Notebooksfortutorial

    http://pytables.org/download/PyData2012-NYC.tar.gz

  • 7/30/2019 Pydata2012 NYC

    5/35

    Whatitis

    Abinarydatacontainerforon-disk,structureddata

    Canperformopera:onswithdata*on-disk* Basedonthestandardde-factoHDF5format

    Freesoware(BSDlicense)

  • 7/30/2019 Pydata2012 NYC

    6/35

    AboutHDF5

    (HierarchicalDataFileversion5)

    Aversa&ledatamodelthatcanrepresentcomplexdataobjectsaswellasassociated

    metadata

    Aportablefileformatwithnolimitonthenumberorsizeofdataobjectsinthecollec:on

    Implementsahigh-levelAPIwithC,C++,Fortran90,andJavainterfaces

    Freesoware(BSD,MITkindoflicense)

  • 7/30/2019 Pydata2012 NYC

    7/35

    PyTablesdis:nc:vefeatures

    SupportsagoodrangeofcompressorsZlib,bzip2,LZOandBlosc

    Powerfulquerycapabili:esforTableobjects,includingindexing

    Canperformout-of-coreopera:onsveryefficiently

  • 7/30/2019 Pydata2012 NYC

    8/35

    Whatitisnot

    Notarela:onaldatabasereplacementNotadistributeddatabase

    Notextremelysecureorsafe(itsmoreaboutspeed!)

    NotamereHDF5wrapper

  • 7/30/2019 Pydata2012 NYC

    9/35

    DATASTRUCTURES

  • 7/30/2019 Pydata2012 NYC

    10/35

    Datastructures

    Highlevelofflexibilityforstructuringyourdata

    Datatypesscalars(numerical&strings),records,enumerated,:me

    Tablessupportmul:dimensionalcellsandnestedrecords

    Mu:dimensionalarraysVariablelengtharrays

  • 7/30/2019 Pydata2012 NYC

    11/35

    TheArrayobject

    Easytocreatefile.createArray(mygroup,array,numpy_arr)

    Shapecannotchange Cannotbecompressed

  • 7/30/2019 Pydata2012 NYC

    12/35

    TheCArrayobject

    Dataisstoredinchunks Eachchunkcanbecompressedindependently Shapecannotchange

  • 7/30/2019 Pydata2012 NYC

    13/35

    TheEArrayobject

    DataisstoredinchunksCanbecompressed

    Shapecanchange(eitherenlargedorshrunk) Shapemustbekeptregular

  • 7/30/2019 Pydata2012 NYC

    14/35

    TheVLArrayobject

    Dataisstoredinvariablelengthrows Canbeenlargedorshrunk Datacannotbecompressed

  • 7/30/2019 Pydata2012 NYC

    15/35

    TheTableobject

    Dataisstoredinchunks Canbecompressed Canbeenlargedorshrunk Fieldscannotbeofvariablelength

    Col1

    (int32)

    Col2

    (string10)

    Col3

    (bool)

    Col4

    (complex64)

    Col5

    (float32)

  • 7/30/2019 Pydata2012 NYC

    16/35

    Datasethierarchy

    root

    group1

    table1 table2

    group2

    array

  • 7/30/2019 Pydata2012 NYC

    17/35

    Aributes

    Metadataaboutdata

    table1

    DateJul242006

    Observa:ons555

    CF[0.1,0.3,0.6]

  • 7/30/2019 Pydata2012 NYC

    18/35

    COMPRESSIONCAPABILITIES

  • 7/30/2019 Pydata2012 NYC

    19/35

    Whycompression?

    Letsyoustoremoredatausingthesamespace

    UsesmoreCPU,butCPU:meischeapcomparedwithdiskaccess

    Differentcompressorsfordifferentusesbzip2,zlib,lzo,blosc

  • 7/30/2019 Pydata2012 NYC

    20/35

    WhyBlosc?

  • 7/30/2019 Pydata2012 NYC

    21/35

    Accelera:ngI/O

    Blosc

    }

    }

    Othercompressors

  • 7/30/2019 Pydata2012 NYC

    22/35

    MemoryaccessvsCPUcycle:me

  • 7/30/2019 Pydata2012 NYC

    23/35

    Laptopcomputerbackin2005

  • 7/30/2019 Pydata2012 NYC

    24/35

    Stateoftheartcomputerin2012

    (singlenode)

  • 7/30/2019 Pydata2012 NYC

    25/35

    OUT-OF-CORECOMPUTATIONS

  • 7/30/2019 Pydata2012 NYC

    26/35

    Opera:ngwithdisk-basedarrays

    tables.Exprisanop:mizedevaluatorforexpressionsofdisk-basedarrays.

    Itisacombina:onoftheNumexpradvancedcompu:ngcapabili:eswiththehighI/O

    performanceofPyTables.

    SimilarlytoNumexpr,disk-temporariesareavoided,andmul:-threadedopera:onis

    preserved.

  • 7/30/2019 Pydata2012 NYC

    27/35

    AvoidingtemporarieswithNumexpr

    Tables.Exprfollowsthesameapproach,

    butwithdiskinsteadofmemory

  • 7/30/2019 Pydata2012 NYC

    28/35

    Performingout-of-corecomputa:ons

    withPyTables

    Virtualmachinenumexpr

  • 7/30/2019 Pydata2012 NYC

    29/35

    ADVANCEDQUERYCAPABILITIES

  • 7/30/2019 Pydata2012 NYC

    30/35

    Differentquerymodes

    Regularquery [ r[c1] for r in table

    if r[c2] > 2.1 and r[c3] == True)) ]

    In-kernelquery [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]

    Indexedquery table.cols.c2.createIndex() table.cols.c3.createIndex() [ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]

  • 7/30/2019 Pydata2012 NYC

    31/35

    Regularandin-kernelqueries

  • 7/30/2019 Pydata2012 NYC

    32/35

    Customizableindexes

  • 7/30/2019 Pydata2012 NYC

    33/35

    Indexedqueryperformance

  • 7/30/2019 Pydata2012 NYC

    34/35

    Conceptstotakehome

    PyTablesisop:mizedtodealwithdataondisk Mostoftheopera:onsusetheiterator/generatormachineryinPythonthegoalisnottobloatmemorywithdata

    Queries,indexesandout-of-coreopera:onsaregoodexamplesoftheabove

  • 7/30/2019 Pydata2012 NYC

    35/35

    Thankyou!

    Ques:ons?