Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Parallel SQL Joel Bernstein

Search Engineer, Alfresco [email protected]

3

03Introduction

•  Joel Bernstein •  Lucene/Solr Committer •  Search Engineer at Alfresco •  Live and work in NYC

4

03 Alfresco

•  Open source ECM (Enterprise Content Management) •  Alfresco is a system of record for documents •  Uses Solr for search •  1800+ customers •  11 million active user accounts •  Alfresco Solr: Document level access control,

eventually consistent, transactional, multi-master, distributed search and faceting (coming in Alfresco 5.1)

5

01Agenda

1. SQL Unleashed (What can it do?) 2. SQL Under the Hood (How does it work?)

6

01

SQL Unleashed (In Solr 6.0)

7

01Why SQL?

•  Solr has many awesome features. •  But all of these feature create complexity. •  Which faceting API to use? When to Stream? Which parameters to use for optimal performance? •  The complexity level increases dramatically when distributed joins come into play •  With SQL we can provide an optimizer to choose the best query plan.

8

01The SQL Interface at Glance •  SQL over Map/Reduce: supports high

cardinality aggregations and distributed joins. •  SQL over Facets: high performance on

moderate cardinality aggregations. •  SQL with Solr Search Predicates •  SQL is fully integrated with SolrCloud

9

01SQL Syntax: Limited and Unlimited SELECT

•  select colA, colB from tableB •  select colA, colB from tableB limit 100 •  Unlimited selects return the entire result

set. Return fields must be DocValues. •  Limited selects can sort by score and retrieve any stored field.

10

01SQL Syntax: ORDER BY

•  select a, b from tableB order by a desc, b desc

•  Unlimited selects sort the entire result set

11

01 The Predicate: Phrase Searching

•  select a, b from tableB where c = ‘hello world’

•  Searches for the phrase ‘hello world’ in field c.

12

01The Predicate: Boolean searching

•  select a, b from tableB where c = ‘(hello world)’ •  Adding parens searches for (hello OR world). •  Supports Solr query syntax inside the parens.

13

01The Predicate: Range query

•  select a, b from tableB where c = ‘[0 TO 100]’

14

01 The Predicate: Arbitrary Boolean clauses

•  select a, b from tableB where (c = ‘hello world’ AND d = ‘[0 TO 100]’)

15

01 SQL Syntax: Select Distinct

•  select distinct a, b from tableB •  Map/Reduce Implementation: Tuples •  are shuffled to worker nodes where the

distinct operation is performed. •  JSON Facet Implementation: distinct operation is pushed down into the search engine •  Map/Reduce for high cardinality •  Facet for high QPS

16

01 Shuffle vs Push Down

•  Shuffling: high cardinality and parallel relational algebra (Distributed Joins) •  Pushdown (Facet): blazing fast, high QPS, moderate cardinality •  aggregationMode flag is available with

the JDBC driver and http interface [map_reduce or facet]

17

01 Aggregations: Stats

•  select count(*), sum(a) from tableA •  Uses the StatsComponent under the covers •  Initial release supports count, sum, avg, min,

max •  Aggregation logic is always pushed down into the search engine.

18

01 Aggregations: GROUP BY •  select a, b count(*), sum(c) from tableB group by

a, b having count(*) > 50 order by sum(c) desc •  Supports complex having clause: having (count(*)

> 50 AND sum(b) < 1000) •  Has Map/Reduce implementation (shuffle) •  And JSON Facet implementation (push down) •  Map/Reduce can handle high cardinality multi- dimension aggregations.

19

01 JDBC Driver

•  Ships with Solrj •  Poolable Connection and Statement •  SolrCloud Aware Load Balancing •  Connection has aggregationMode switch [map_reduce or facet]

20

01

SQL Under the Hood

21

01 SQL Parsing

•  Presto SQL Parser handles the parsing •  SQL Statements are compiled to TupleStream objects •  The TupleStream is the base interface of the Streaming API •  The Streaming API is a general purpose parallel computing API for SolrCloud

22

01 Parallel Computing Framework

•  Shuffling •  Worker Collections •  Streaming API •  Streaming Expressions •  Parallel SQL

23

01 Shuffling (sorting & partitioning) •  Shuffling is pushed down into the search engine •  Sorting: /export handler “stream sorts” entire result sets. •  Partitioning: HashQParserPlugin, hash partitioning filter. Partitions results on arbitrary fields. •  Tuples (search results) begin streaming instantly to worker nodes. Shuffling never requires a spill to disk. •  All replicas shuffle in parallel for the same query. Allows for massive throughput.

24

01 Shuffling (sorting & partitioning)

Worker 2 Worker 1

Shard 1 Replica 1

Shard 2 Replica 1

Shard 1 Replica 2

Shard 2 Replica 2

Client

Each worker isshuffled ½

the result set

Tuples are sorted and

partitioned on keys

25

01 Worker Collections

•  Are Generic SolrCloud Collections •  Can hold data, or just perform work •  Search results are shuffled to the workers •  Configured with the /stream handler

26

01 Streaming API •  Java Programming API for the parallel computing framework •  Real-time Map/Reduce and Parallel Relational Algebra •  Abstracts search results as Streams of tuples (TupleStream) •  Streams are transformed in parallel by pluggable Decorator streams. •  Parallel transformations include: group by, roll

up, union, intersect, complement and join

27

01 Streaming Expressions

•  Contributed by Dennis Gove (Bloomberg) •  String Query Language and Serialization format for the Streaming API •  Streaming Expressions compile to TupleStreams •  TupleStreams serialize to Streaming Expressions

28

01 Parallel SQL

•  Compiles SQL to a TupleStream •  The TupleStream is serialized to a Streaming Expression and sent to worker nodes. •  Worker nodes translate the Streaming Expression back into TupleStream •  Worker nodes open() and read() the TupleStream in parallel. Tuples are returned from each worker

29

01 From SQL to Streaming Expression

select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX' group by str_s rollup( search(collection1, q="(text:XXXX)", qt="/export", fl="str_s, field_i", partitionKeys=str_s, sort="str_s asc", zkHost="127.0.0.1:64149/solr"), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i))

30

01 Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce)

Client

Worker 2

Shard 3 Replica 2

Worker 3

Worker 1

Worker 4

Worker 5

Shard 1 Replica 2

Shard 1 Replica 3

Shard 2 Replica 3

Shard 2 Replica 2

Shard 2 Replica 1

Shard 1 Replica 1 Shard 3

Replica 1

Shard 3 Replica 3

Shard 4 Replica 3

Shard 4 Replica 2

Shard 4 Replica 1

Shard 5 Replica 3

Shard 5 Replica 2

Shard 5 Replica 1

/SQL handler

31

01 Jira Tickets

•  SOLR-7560: Parallel SQL Support •  SOLR-7377: Solr Streaming Expressions •  SOLR-7082: Streaming Aggregation for SolrCloud •  SOLR-7441: Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL

32

01 Getting Involved

• SQL is in Trunk • Releasing with Solr 6 • Streaming API and Streaming Expressions are located in the Solrj libraries (solrj.io) • Patches welcome • Testers and feedback needed

33

01 Questions

Thanks!

Technology

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco