XLDB and the Large Synoptic Survey Telescopeidke.ruc.edu.cn/xldb/ Asia - LSST.pdfNear-neighbor queries involve self-join on multi-billion row table, but spatially localized ... −

Joint Interface and Management Review • Tucson, Arizona • May 30th – June 1st, 2012XLDB Asia 2012 • Beijing, China • June 22-23, 2012 1

XLDB and the Large Synoptic Survey Telescope

Kian-Tat Lim - 林建达LSST Data Management System Architect

XLDB Asia 2012

XLDB Asia 2012 • Beijing, China • June 22-23, 2012

What is LSST?

2

Proposed telescope to be built in Chile


Large

3

3.2 gigapixel camera

8.4 meter diameter mirror


Synoptic Survey

Wide: entire visible sky

Fast: image every 15 seconds

Deep: faint and distant objects

4


Results

− Thousand-framemovie of the sky

5


Results

− Catalogs

6

Image Metadata

Moving ObjectsCatalog

Object Catalog

Source Catalog

Difference Image Source Catalog

ProvenanceStatistics

Summaries

Calibration Engineering and Facility Database

Lots of databases, but Object and Source (and ForcedSource) are most important and largest.


How Big?

− Tens of billions of Objects• Hundreds of columns per Object

− Trillions of Sources• High signal-to-noise observations of Objects• Dozens of columns per Source

− Tens of trillions of ForcedSources• All observations of Objects• 7 columns

− Total space required at end of survey including all overheads, replication, and compression: 35 PB

7


Queries

− All about an object− All objects meeting criteria− All objects near objects meeting criteria− All objects with interesting time series− All pairs of objects with similar time series

8

Criteria may involve 1–30 attributes/columns, not entire row Selectivity on individual attributes may be low When interesting objects are identified, may need large fraction of the rowNear-neighbor queries involve self-join on multi-billion row table, but spatially localizedPairing time series may involve self-join on multi-trillion row table!


Usual Needs

ScalableFast

Fault-tolerantCost-effectiveOpen Source

9


qserv

Prototype system

Demonstrates feasibility

Useful for large-scale Data Challenges

Will be turned into production system during construction

10

Don’t expect too much. Mostly the work of one person, Daniel Wang.


Supporting ad hoc Queries

− Random small queries• Indexing and sharding (also key/value)

− Narrow, full-table scans and aggregates• Vertical partitioning

− Diverse, simultaneous scans• Shared scans

− qserv may need to support all three

11


Architecture

− MPP RDBMS on shared-nothing commodity cluster, with incremental scaling, non-disruptive failure recovery

− Data clustered spatially and by time, partitioned with overlaps• Two-level partitioning

• 2nd level materialized on-the-fly

• Transparent to end-users

− Selective indices to speed up interactive queries, spatial searches, joins including time series analysis

− Shared scans− Custom software based on open source:

RDBMS (MySQL) + xrootd• SciSQL: MySQL UDFs for HTM-based spatial indexing

−

12

Apologies to Martin Kersten for independently choosing a name close to his SciQL.


Baseline Architecture

13


Prototype Implementation

14

Intercepting user queries

Worker dispatch, query fragmentation

generation, spatial indexing, query

recovery, optimizations, scheduling, aggregation

Communication, replication

Metadata, result cache

MySQL dispatch, shared scanning, optimizations,

scheduling

Single node RDBMS

RDBMS-agnostic

XLDB Asia 2012 • Beijing, China • June 22-23, 2012 15

Large Scale Tests

− Setup• 150 nodes• ~10% of DR1 data set: realistically distributed

2 billion objects, 55 billion sources, total ~32 TB

− Tested queries• Interactive (object retrieval,

object time series, spatially restricted filter)

• Scans (full sky filter, densities)• Joins (near neighbor,

sources not near objects)• Concurrency


Large Scale Tests







Object retrieval

~4-9s


Large Scale Tests







Full-sky density

~3-8m


Large Scale Tests







~10m – 5h


Concurrency Test

Large Scale Tests








Scalability Testing

− Constant data/node

16


Status

− Cleaning up for end-user testing− Then adding features:

• Shared scans• User tables• Fault tolerance• Updates• Query management

− Code available:git://git.lsstcorp.org/LSST/DMS/qserv.githttps://launchpad.net/scisql

17

https://launchpad.net/scisql

https://launchpad.net/scisql


Thoughts on the Future

− qserv• Incorporate MonetDB back-end

− SciDB• What about the petabytes of raw image data?• Perhaps store in an array database• Cutouts, mosaics, image manipulation become queries• UDFs for detection, measurement• Evaluation before end of 2013

18

Documents

XLDB and the Large Synoptic Survey Telescopeidke.ruc.edu.cn/xldb/ Asia - LSST.pdfNear-neighbor queries involve self-join on multi-billion row table, but spatially localized ... −