99 Problems, But The Search Ain't One

Preview:

DESCRIPTION

ElasticSearch is the new kid on the search block. Built on top of Lucene and adhering to the best concepts of so-called NoSQL movement, ElasticSearch is a distributed, highly available, fast RESTful search engine, ready to be plugged into Web applications.

Citation preview

99 Problems, ButThe Search Ain’t OneAndrei Zmievski • PHP UK •!Feb 25, 2011

who am I?

curl http://localhost:9200/speaker/info/andrei

{“name”: “Andrei Zmievski”, “projects”: [“PHP”, “PHP-GTK”, “Smarty”, “Unicode/i18n”], “likes”: [“coding”, “beer”, “brewing”, “photography”], “twitter”: “@a”, “email”: “andrei@zmievski.org”}

what is elasticsearch?

a search engine for the NoSQL generation

domain-driven

distributed

RESTful

Hitchhiker’s Guide to the Galaxy (no, really)

document model

document-oriented

JSON-based

schema-free

based on Lucene

multi-tenancy

distributed, out of the box

engine

nomenclature

index

type

document

_id

node

3 easy steps

1. index!"#$%&'()*+%,--./00$1!2$,13-/45660!17803.92:9#0;%&<=

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7==-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N=

requ

est

>

%%%%?1:?/-#"9

%%%%?OB7<9P?/?!178?

%%%%?O-I.9?/?3.92:9#?

%%%%?OB<?/?;?

Nresp

onse

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?5?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

!!!!"#$#%&"!'!()%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?5?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

total number of hits

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

!!!!!!"*+,-./"!'!"0$,1")%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?5?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse the index of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

!!!!!!"*#23."!'!"43.%5.6")%%%%%%?OB<?%/%?5?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse the type of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the id of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the id of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the hit score

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

8!!!!",%9."'!":,-6.+!;9+.<45+")!!!!"#%&5"'!"==!>6$?&.94)!?@#!#A.!B.%60A!:+,C#!D,.")!!!!"&+5.4"'!E"0$-+,F")!"?..6")!"3A$#$F6%3A2"G)!!!!"#H+##.6"'!"%")!!!!"A.+FA#"'!(IJK%N%J%N%N

resp

onse

the original source

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%"#$$5"!'!L)%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?5?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the execution time

3. profit

that’s up to you

demo

distributed model

provides:

performance

resiliency (high-availability)

shards

a portion of the document space

each one is a separate Lucene index

thus, many per-index settings are available

document is sharded by its _id value

but can be assigned (routed) to a shard deterministically

zero-conf discovery

zen (multicast and unicast)

cloud (EC2 via API)

auto-routing

master node:

maintains cluster state

reassigns shards if nodes leave/join cluster

any node can serve as the request router

the query is handled via scatter-gather mechanism

replicas

each shard can have 1 or more replicas

# of replicas can be updated dynamically after index creation

replicas can be used for querying in parallel

shard allocationnode 1

start with a single node

shard allocation

PUT /person { “index”: { “number_of_shards”: 2, “number_of_replicas”: 1}}

node 1person1person2

shard allocationnode 1person1person2

node 2person1person2

start the second node

shard allocationnode 1 node 2 node 3 node 4person1person2

person1person2

start 2 more nodes

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

start 2 more nodes

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

hashed to shard 1PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

replicated

PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

PUT /person/info/2{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

hashed to shard 2

PUT /person/info/2{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

replicated

PUT /person/info/2{ … }

scatter-gathernode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

transactional model

per-document consistency

no need to commit/flush

uses write-behind transaction log

write consistency (W) can be controlled

one, quorum, or all

(near) real-time search

1 second refresh rate by default

_refresh API also

index storage

node data considered transient

can be stored in local file system, JVM heap, native OS memory, or FS & memory combination

persistent storage requires a gateway

gateways

persistent store for cluster state and indices

asynchronous, translog-based write strategy

allows full recovery if a cluster restart is needed

supported gateways:local

shared FS

Hadoop via HDFS

S3

mapping

describes document structure to the search engine

automatically created with sensible defaults

explicit mapping can be provided (generally, a good idea)

can run into merge conflicts

mapping

important meta fields:

_source

_all

_boost

mapping types

simple:

string, integer/long, float/double, boolean, and null)

complex:

array, object

sample mapping

>?"39#?/%%%%%%?<9#B!:?E

%?-B-$9?/%%%%%?W17X-%(27B!?E

%?-2H3?/%%%%%%G?.#18B$B7H?E%?<9F"HHB7H?E%?.,.?JE

%?.13-W2-9?/%%?56;6&;5&55+;M/;Y/;5?E

%?.#B1#B-I?/%%5Ndocu

men

t

>?.13-?/%>

%%?.#1.9#-B93?%/%>

%%%%?"39#?/%>?-I.9?/%?3-#B7H?E%?B7<9P?/%?71-O272$IZ9<?NE

%%%%?@9332H9?/%>?-I.9?/%?3-#B7H?E%[F113-\/%;UVNE

%%%%?-2H3?/%>?-I.9?/%?3-#B7H?E%?B7!$"<9OB7O2$$?/%?71?NE

%%%%?.13-W2-9?%/%>?-I.9?%/%?<2-9?E%[3-1#9\/%[71\NE

%%%%?.#B1#B-I?%/%>?-I.9?%/%?B7-9H9#?N

NNN

map

ping

analyzers

break down (tokenize) and normalize fields during indexing and query strings at search time

analyzer = tokenizer + token filters (0 or more)

*-27<2#<%A72$IZ9#%S

%%%*-27<2#<%+1:97BZ9#%]

%%%%%%%*-27<2#<%+1:97%^B$-9#%]

%%%%%%%_1K9#!239%+1:97%^B$-9#%]

%%%%%%%*-1.%+1:97%^B$-9#

analyzers

analyzers, tokenizers, and filters can be customizedB7<9P/

%%272$I3B3/

%%%%272$IZ9#/

%%%%%%.@&%,F/%%%%%%%%-I.9/%!"3-1@

%%%%%%%%-1:97BZ9#/%3-27<2#<

%%%%%%%%8B$-9#/%G3-27<2#<E%$1K9#!239E%3-1.E

%%%%%%%%%%%%%%%%%23!BB81$<B7HE%.1#-9#*-9@Jelas

ticse

arch

.ym

l

`

?-B-$9?/%>?-I.9?/%?3-#B7H?E%?272$IZ9#?/%?9"$27H?NE

`

map

ping

API

API conventions

append ?pretty=true to get readable JSON

boolean values: false/0/off = false, rest is true

JSONP support via callback parameter

API structure

http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/_status

GET http://es:9200/twitter/_status

POST http://es:9200/twitter/tweet/1

GET http://es:9200/twitter/tweet/1

API structure

http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/twitter/tweet/_search

GET http://es:9200/twitter/user/_search

GET http://es:9200/twitter/tweet,user/_search

GET http://es:9200/twitter,facebook/_search

GET http://es:9200/_search

_cluster API structure

GET /_cluster/health

GET /_cluster/health/index1,index2

GET /_cluster/nodes/stats

GET /_cluster/nodes/nodeId1,nodeId2/stats

API {core}

index

bulk

delete

delete by query

get

count

search

query

from/size paging

sort

highlighting

selective fields

API {indices}

create

delete

open/close

get/put/delete mapping

refresh

optimize

snapshot

update settings

analyze

status

flush

API {cluster}

health

state

nodes info

nodes stats

nodes shutdown

Query DSL

term / terms

range

prefix

bool

fuzzy

wildcard

query_string

default_operator

analyzer

phrase_slop

etc

filters

share some similar features with queries (term, range, etc)

why use a filter?

filters

faster than queries

cached (depends on the filter)

the cache is used for different queries against the same filter

no scoring

more useful ones: term, terms, range, prefix, and, or, not, exists, missing, query

facets

provide aggregated data based on the search request

terms, histogram, date histogram, range, statistical, and more

geo search

implemented as filters (and a facet)

geo_distance

geo_bounding_box

geo_polygon

interfaces

REST

including memcached

Java /!Groovy

Language clients (REST/Thrift):

pyes, PHP (standalone and symfony), Ruby, Perl

Flume sink implementation

elastica

similar to the other PHP ElasticSearch client

API naming is consistent with Zend Framework

can be extended for new filters, facets, etc

still under development

elastica

$es = new Elastica_Client('vm', 9200);$index = new Elastica_Index($es, 'test');$index->create(array(), true);$type = new Elastica_Type($index, 'person');$doc = new Elastica_Document(1, array('name' => 'Andrei Zmievski', 'email' => 'andrei@test.com', 'username' => 'andrei', 'bills' => array(2, 3, 5)));$type->addDocument($doc);

$qs = new Elastica_Query_QueryString('andrei');$query = new Elastica_Query($qs);$resultSet = $type->search($query);print $resultSet->count();

exam

ple

data import

ES is not the primary data store (usually)

to import/synchronize data:

write an agent (Gearman, message queues, etc)

use rivers (CouchDB, RabbitMQ, Twitter)

10 more features

versioning

index aliases

parent/child docs

scripting

dynamic mapping templates

load balancing nodes

plugins

more_like_this

multi_field mapping

percolation

Recommended