60
Approaching Join Index yet another one join algorithm Mikhail Khludnev principal engineer

Mikhail khludnev: approaching-join index for lucene

Embed Size (px)

Citation preview

Approaching Join Indexyet another one join algorithm

Mikhail Khludnevprincipal engineer

PRIVILEGED AND CONFIDENTIAL

• Grid Dynamics is a Silicon Valley-based leading provider of scalable, next-generation commerce technology solutions

• Record of outperformance with Tier 1 retail clients

• Fortune 1000 client relationships

About Me● principal engineer at Grid Dynamics

● spoke at few last LuceneRevolutions

● contributed BlockJoin query parser for Solr - {!parent}

● blogged about it at http://blog.griddynamics.com/

● tried to fix threads at DataImportHandler

http://google.com/+MikhailKhludnev

You are expected to know● how Lucene searches/filters

● how it counts facets

● that there are segments

● what is DocValues

● why to join

● RDBMS joins: nested loop join, sort-merge join and hash join.

I’m expected to know● query-time join

● index-time join

● yet another one join

Lucene/Solr/Elastic Is Strong

● searching

○ filtering

● analytics

○ facets

○ pivots

○ stats

SKU_ID: 13PROD_ID: 1TYPE: SKU

SIZE: 8COLOR:Blue

SKU_ID: 12PROD_ID: 1TYPE: SKU

SIZE: 8COLOR:Green

There is a weakness

● robust joins

○ multiple entities

○ relations

PROD_ID: 1TYPE: PROD

BRAND: NikeNAME: ShoesPRICE: $50

SKU_ID: 11PROD_ID: 1TYPE: SKU

SIZE: 7COLOR: Black

Joins in General

Joins in General

Joins in General

PK=FK

Joins in General

PK=FK

Joins in General

PK=FK

children

Joins in General

1:M

parents

Executing Join Query

q

Executing Join Query

q

Executing Join Query

q

Executing Join Query

qfq

Executing Join Query

qfq

Join in General

parents ∩ join-relation ∩ children

JoinUtil

q

“25”

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]

JoinUtil

q

FK[doc#]

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

“25”

“25”

“17”

...

JoinUtil

q

PK

“1” →△…“17”→△“25”→△...

“25”

“17”

...

“25”

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]

JoinUtil

q

“1” →△…“17”→△“25”→△...

“25”

“17”

...

fq“25”

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]PK

Block Joindoc#

Block Joindoc#

432

Block Joindoc#1

0

0

1

0

0

1

0

0

1

0

Block Joindoc#1

0

0

1

0

0

1

0

0

1

0

q

Block Joindoc#1

0

0

1

0

0

1

0

0

1

0

qfq

Block Joindoc#1

0

0

1

0

0

1

0

0

1

0

qfq

Comparison

JoinUtil BlockJoin

searching slow fast

reindexing

Comparison

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparisondoc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparisondoc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparisondoc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparison

JoinUtil BlockJoin

searching slow < ? < fast

reindexing fast > ? > slow

Join Index

q

doc#[doc#]

3

6

0

3

10

0

63

Join Index

q

doc#[doc#]

3

6

0

3

10

0

63

Join Index

q

doc#[doc#]

fq 3

6

0

3

10

0

63

Join Index

qfq

doc#[doc#]

2

4

1

6

5 10

9

8

Join Index

q

doc#[doc#]

fq

2

4

1

6

5 10

9

8

Join Index

q

doc#[doc#]

fq

2

4

1

6

5 10

9

8

meanwhile… in LUCENE-6352

GlobalOrdinalsQuery

“25”

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

SortedDocValues

GlobalOrdinalsQuery

“25”

“17”

“17”

“25”

“25”“56”

“56”

“56”

“25”

“4”

“61”

SortedDocValues

0 “17”

1 “25”

2 “4”

3 “56”

4 “61”

Ordinals0

0

1

1

3

3

3

1

2

4

1

GlobalOrdinalsQuery

q

0

0

1

1

3

3

3

1

2

4

1

1

1

0

0

0

fq

GlobalOrdinalsQuery

q

0

0

1

1

3

3

3

1

2

4

1

1

1

0

0

0

Benchmarking 2.9 M docs

https://github.com/m-khl/lucene-solr/tree/dvjoin-benchmark-5-1

Latency, ms the bigger the worse

7

28

14

BlockJoin(i-time)

Join Index

JoinUtil(q-time)

GloblOrdinals

JoinUtil JoinIndex GlobalOrdinals

BlockJoin

searching slow fast fast fasteranyway

reindexing fast uber slow fast slow

<

doc#[doc#]Indexing is still a problem

2

43

6

103

10

0

63

5 10

9

8

Further Plans 2.0● incremental join-index update

● perhaps just calculate and cache it

● or put to dedicated index

● join in both directions

● calculate optimal execution plan of segments enumeration

● edge case for benchmark

Summary● Joins in General

● JoinUtil vs Block-join vs GlobalOrdinals

● updatable DocValues

● opportunities for improving query-time joins:○ eliminate term enum○ choose lower cardinality side for enumeration○ GlobalOrdinalsJoin

Thanks for Joining us!

https://goo.gl/hjsYZW

Off scope

Joins’ Zoo in Lucene True Joins

● query-time join

○ JoinUtil

○ {!join }

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join {!

parent}

Joins’ Zoo in Lucene Workarounds

● term positions/SpanQueries

● FieldCollapsing/Grouping

● term decoration

○ spatial

● multivalue fields

True Joins● query-time join

○ JoinUtil

○ {!join },

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join {!

parent}

Joins’ Zoo in Lucene Workarounds

● term positions/SpanQueries

● FieldCollapsing/Grouping

● term decoration

○ spatial

● multivalue fields

True Joins● query-time join

○ JoinUtil

○ {!join },

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join {!

parent}

Two phase update problem

Subject: How to openIfChanged the most recent merge?at: [email protected]

JoinUtil● query-time

● indexing is fast

● searching is slow, why?

○ expensive term enum

○ single enumeration order

BlockJoin● index-time

● reindexing whole block is as expensive as mandatory

● searching is darn fast, however

○ can’t reorder child docs

store ref = segment#, doc#put ref to previous and current segment in DVwhen add new segment, join IDs with previous segments- for parent, just ref to all children docnums- for children, add plain field refsToSeg:seg#-

when score parents on some segment- buffer them with the link refs, then - intersect buffered link refs with children query on previous segments- search all segments for refsToSeg:seg#, intersect with children query, obtain

perent ref from DV intersect with buffered