28
Solr vs. Elasticsearch Case by Case Alexandre Rafalovitch @arafalov @SolrStart www.solr-start.com

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN

Embed Size (px)

DESCRIPTION

Presented at Lucene/Solr Revolution 2014

Citation preview

Page 1: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Solr vs. Elasticsearch Case by Case

Alexandre Rafalovitch @arafalov @SolrStart

www.solr-start.com

Page 2: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Meet the FRENEMIES Friends (common) •  Based on Lucene •  Full-text search •  Structured search •  Queries, filters, caches •  Facets/stats/enumerations •  Cloud-ready

Elas%csearch*  

*  Elas%csearch  is  a  trademark  of  Elas%csearch  BV,          registered  in  the  U.S.  and  in  other  countries.  

Enemies (differences) •  Download size •  AdminUI vs. Marvel •  Configuration vs. Magic •  Nested documents •  Chains vs. Plugins •  Types and Rivers •  OpenSource vs. Commercial •  Etc.

Page 3: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

This used to be Solr (now in Lucene/ES)

•  Field  types  •  Dismax/eDismax  •  Many  of  analysis  filters  (WordDelimiterFilter,  Soundex,  Regex,  HTML,  kstem,  Trim…)  

•  Mul%-­‐valued  field  cache  •  ….  (source:  hOp://heliosearch.org/lucene-­‐solr-­‐history/  )  

•  Disclaimer:  Nowadays,  Elas%csearch  hires  awesome  Lucene  hackers  

Page 4: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Basically - sisters

Source:  hOps://www.flickr.com/photos/franzfume/11530902934/  

Download  Expanded  

First  run  

0  

50  

100  

150  

200  

250  

300  

Solr   Elas%csearch  

Page 5: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Solr: Chubby or Rubenesque?

0.00   50.00   100.00   150.00   200.00   250.00   300.00  

Solr  

Elas%csearch+plugins   Code  

Examples  

Documenta%on  

ES-­‐Admin  

ES-­‐ICU  

Extract/Tika  

UIMA  

Map-­‐Reduce  

Test  Framework  

Page 6: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Elasticsearch setup

Source:  hOps://www.flickr.com/photos/deborah-­‐is-­‐lola/6815624125/  

•  Admin UI: bin/plugin -i elasticsearch/marvel/latest

•  Tika/Extraction: bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.4.1

•  ICU (Unicode components): bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

•  JDBC River (like DataImportHandler subset): bin/plugin --install jdbc --url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip

•  JavaScript scripting support: bin/plugin -install elasticsearch/elasticsearch-lang-javascript/2.4.1

•  On each node…. •  Without dependency management (jars =

rabbits)

Page 7: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Index a document - Elasticsearch 1.  Setup an index/collection 2.  Define fields and types 3.  Index content (using Marvel sense):

POST /test1/hello { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" }

Alternative: PUT /test1/hello/id1 { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" }

An index, type and definitions are created automatically

So,  where  is  our  document:  GET  /test1/hello/_search  { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test1", "_type": "hello", "_id": "AUmIk4LDF4XvfpxnVJ2g", "_score": 1, "_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" }} ] }}

Page 8: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Behind the scenes GET /test1/hello/_search ….. {

"_index": "test1",

"_type": "hello",

"_id": "AUmIk4LDF4XvfpxnVJ2g",

"_score": 1,

"_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" }

….

GET  /test1/hello/_mapping    { "test1": { "mappings": { "hello": { "properties": { "msg": { "type": "string" }, "names": { "type": "string" }, "when": { "type": "date", "format": "dateOptionalTime" }}}}}}

Page 9: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Basic search in Elasticsearch GET /test1/hello/_search ….. {

"_index": "test1", "_type": "hello",

"_id": "AUmIk4LDF4XvfpxnVJ2g", "_score": 1, "_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" }

….

•  GET  /test1/hello/_search?q=foobar  –  no  results  •  GET  /test1/hello/_search?q=Alex  –  YES  on  names?  •  GET  /test1/hello/_search?q=alex  –  YES  lower  case  •  GET  /test1/hello/_search?q=happy  –  YES  on  msg?  •  GET  /test1/hello/_search?q=2014  –  YES???  •  GET  /test1/hello/_search?q="birthday  alex"  –  YES  •  GET  /test1/hello/_search?q="birthday  mark"  –  NO        Issues:  1.  Where  are  we  actually  searching?  2.  Why  are  lower-­‐case  searches  work?  3.  What's  so  special  about  Alex?  

Page 10: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

All about _all and why strings are tricky •  By default, we search in the field _all •  What's an _all field in Solr terms?

<field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/>

<copyField source="*" dest="_all"/>

•  And the default mapping for Elasticsearch "string" type is like:

<fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" > <analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

</analyzer>

</fieldType>

•  Elasticsearch equivalent to Solr's solr.StrField is: {"type" : "string", "index" : "not_analyzed"}

Page 11: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Can Solr do the same kind of magic? •  curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-

type:application/json' -d @msg.json curl  'hOp://localhost:8983/solr/collec%on1/select'    { "responseHeader":{ "status":0, "QTime":18, "params":{}}, "response":{"numFound":1,"start":0,"docs":[ { "msg":["Happy birthday"], "names":["Alex", "Mark"], "when":["2014-11-01T10:09:08Z"], "_id":"e9af682d-e775-42f2-90a5-c932b5fbb691", "_version_":1484096406012559360}] }}

curl  'hOp://localhost:8983/solr/collec%on1/schema/fields'    { "responseHeader":{ "status":0, "QTime":1}, "fields":[ {"name":"_all", "type":"es_string", "multiValued":true, "indexed":true, "stored":false}, {"name":"_id", "type":"string", "multiValued":false, "indexed":true, "required":true, "stored":true, "uniqueKey":true}, {"name":"_version_", "type":"long", "indexed":true, "stored":true}, {"name":"msg", "type":"es_string"}, {"name":"names", "type":"es_string"}, {"name":"when", "type":"tdates"}]}  

•  Output  slightly  re-­‐formated  

Page 12: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Nearly the same magic <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> <!-- UUIDUpdateProcessorFactory will generate an id if none is present in the incoming document --> <processor class="solr.UUIDUpdateProcessorFactory" /> <processor class="solr.LogUpdateProcessorFactory"/>

<processor class="solr.DistributedUpdateProcessorFactory"/> <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/> <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/> <processor class="solr.ParseLongFieldUpdateProcessorFactory"/>

<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/> <processor class="solr.ParseDateFieldUpdateProcessorFactory"> <arr name="format"> <str>yyyy-MM-dd'T'HH:mm:ss</str>

<str>yyyyMMdd'T'HH:mm:ss</str> </arr> </processor>

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory"> <str name="defaultFieldType">es_string</str> <lst name="typeMapping"> <str name="valueClass">java.lang.Boolean</str>

<str name="fieldType">booleans</str> </lst> <lst name="typeMapping">

<str name="valueClass">java.util.Date</str> <str name="fieldType">tdates</str> </lst> <processor class="solr.RunUpdateProcessorFactory"/>

</updateRequestProcessorChain>

Not  quite  the  same  magic:  •  URP  chain  happens  before  copyField  

•  Date/Ints  are  converted  first  •  copyText  converts  content  back  to  string  •  _all  field  also  gets  copy  of  _id  and  _version  

•  All  auto-­‐mapped  fields  HAVE  to  be  mul%valued  •  No  (ES-­‐Style)  types,  just  collec%ons  •  Unable  to  reproduce  cross-­‐field  search  •  S%ll  rough  around  the  edges  •  Requires  dynamic  schema,  so  adding  new  types  

becomes  a  challenge  

•  Auto-­‐mapping  is  NOT  recommended  for  produc%on  

•  Dynamic  fields  solu%on  is  s%ll  more  mature  

Page 13: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Explicit mapping - Solr •  In schema.xml (or dynamic equivalent) •  Uses Java Factories •  Related content (e.g. stopwords) are usually in separate files (recently added REST-managed) •  French example:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" /> <filter class="solr.FrenchLightStemFilterFactory"/> </analyzer> </fieldType>

Page 14: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Explicit mapping - Elasticsearch •  Created through PUT command •  Also can be stored in config/default-mapping.json or config/mappings/[index_name] •  Mappings for all types in one index should be compatible to avoid problems •  Usually uses predefined mapping names. Has many names, including for

languages •  Explicit mapping is through named cross-references, rather than duplicated in-

place stack (like Solr) •  Related content is usually also in the definition. Sometimes in file (e.g.

stopwords_path – needs to be on all nodes) •  French example (next slide):

Page 15: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Explicit mapping – Elasticsearch - French { "settings": { "analysis": { "filter": { "french_elision": { "type": "elision", "articles": [ "l", "m", "t", "qu", "n", "s", "j", "d", "c", "jusqu", "quoiqu", "lorsqu", "puisqu" ] }, "french_stop": { "type": "stop", "stopwords": "_french_" }, "french_keywords": { "type": "keyword_marker", "keywords": [] }, "french_stemmer": { "type": "stemmer", "language": "light_french" } },

….

"analyzer": { "french": { "tokenizer": "standard", "filter": [ "french_elision", "lowercase", "french_stop", "french_keywords", "french_stemmer" ] } } } } }  

Page 16: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Default analyzer - Elasticsearch

Indexing  1.   the  analyzer  defined  in  the  field  

mapping,  else    2.   the  analyzer  defined  in  the  _analyzer  

field  of  the  document,  else    3.  the  default  analyzer  for  the  type,  

which  defaults  to    4.  the  analyzer  named  default  in  the  

index  seongs,  which  defaults  to    5.  the  analyzer  named  default  at  node  

level,  which  defaults  to    6.  the  standard  analyzer  

Query  1.   the  analyzer  defined  in  the  query  

itself,  else    2.   the  analyzer  defined  in  the  field  

mapping,  else    3.  the  default  analyzer  for  the  type,  

which  defaults  to    4.  the  analyzer  named  default  in  the  

index  seongs,  which  defaults  to    5.  the  analyzer  named  default  at  node  

level,  which  defaults  to    6.  the  standard  analyzer    

Page 17: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Index many documents – Elasticsearch

POST  /test3/entries/_bulk  {  "index":  {"_id":  "1"  }  }  {"msg":  "Hello",  "names":  ["Jack",  "Jill"]}  {  "index":  {"_id":  "2"  }  }  {"msg":  "Goodbye",  "names":  "Jason"}  {  "delete"  :  {"_id"  :  "3"  }  }      

NOTE:  Rivers  (similar  to  DIH)  MAY  be  deprecated.                            Use  Logstash  instead  (180Mb  on  disk,  including  2  jRuby  run%mes  !!!)    

Page 18: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Index many documents - Solr

JSON  -­‐  simple  [ { "_id": "1", "msg": "Hello", "names": ["Jack", "Jill"] }, { "_id": "2", "msg": "Goodbye", "names": "Jason" } ]

JSON  –  with  commands  { "add": { "doc": {

"_id": "1", "msg": "Hello", "names": ["Jack", "Jill"]

} }, "add": { "doc": {

"_id": "2", "msg": "Goodbye", "names": "Jason"

} }, "delete": { "_id":3 } }

Also:  •  CSV  •  XML  •  XML+XSLT  •  JSON+transform  (4.10)  •  DataImportHandler  •  Map-­‐Reduce  External  tools  •  Logstash  (owned  by  ES)  

Page 19: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Comparing search - Search •  Same but different •  Same: vast majority of the features

come from Lucene •  Different: representation of search

parameters •  Solr: URL query with many – cryptic –

parameters •  Elasticsearch:

•  Search lite: URL query with a limited set of parameters (basic Lucene query)

•  Query DSL: JSON with multi-leveled structure

Lucene  Impl   ES  

only  Solr  only  

Page 20: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Search compared – Simple searches { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" }

{ "msg": "Happy New Year", "names": ["Jack", "Jill"], "when": "2015-01-01T00:00:01" }

{ "msg": "Goodbye", "names": ["Jack", "Jason"], "when": "2015-06-01T00:00:00" }

Elas%csearch  (Marvel  Sense  GET):  •  /test1/hello/_search  –  all  •  /test1/hello/_search?q=happy  birthday  Alex–  2  •  /test1/hello/_search?q=names:Alex  –  1  

Solr  (GET  hOp://localhost:8983/solr/…):  •  /collec%on1/select  –  all  •  /collec%on1/select?q=happy  birthday  Alex  –  2  •  /test1/hello/_search?q=names:Alex  –  1  

Page 21: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Search Compared – Query DSL

Elas%csearch    GET /test1/hello/_search { "query": { "query_string": { "fields": ["msg^5", "names"], "query": "happy birthday Alex", "minimum_should_match": "100%" } } }

Solr    …/collection1/select ?q=happy birthday Alex &defType=dismax &qf=msg^5 names &mm=100%

Page 22: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Search Compared – Query DSL - combo

Elas%csearch  GET /test1/hello/_search { "size" : 1, "query": { "filtered": { "query": { "query_string": { "query": "jack" }}, "filter": { "range": { "when": { "gte": "now" }}}}}}

Solr    …/collection1/select ?q=jack &fq=when:[NOW TO *] &rows=1

Search  future  entries  about  Jack.  Return  only  the  best  one.  

Page 23: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Parent/Child structures

Inner  objects    •  Mapping:  Object  •  Dynamic  mapping  (default)  •  NOT  separate  Lucene  docs  •  Map  to  flaGened  

mul%valued  fields  •  Search  matches  against  

value  from  ANY  of  inner  objects  

{ "followers.age": [19, 26], "followers.name": [alex, lisa] }

Elas%csearch  Nested  objects  •  Mapping:  nested  •  Explicit  mapping  •  Lucene  block  storage  •  Inner  documents  are  hidden  •  Cannot  return  inner  docs  only  •  Can  do  nested  &  inner  

Parent  and  Child  •  Mapping:  _parent  •  Explicit  references  •  Separate  documents  •  In-­‐memory  join  •  SLOW  

Solr  Nested  objects  •  Lucene  block  storage  •  All  documents  are  visible  •  Child  JSON  is  less  natural  

Page 24: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Cloud deployment – quick take 1.  General concepts are similar:

•  Node discovery •  Sharding •  Replication •  Routing

1.  Implementations are very, very different (layer above Lucene) 2.  Solr uses Apache Zookeeper 3.  Elasticsearch has its own algorithms 4.  No time to discuss 5.  Let's focus on the critical path: Node discovery/cloud-state management 6.  Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests

Page 25: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Jepsen test of Zookeper

Use  Zookeeper.  It’s  mature,  well-­‐designed,  and  baOle-­‐tested.    

Page 26: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Jepsen test of Elasticsearch

If  you  are  an  Elas%csearch  user  (as  I  am):  good  luck.  

Page 27: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Innovator’s dilemma •  Solr's usual attitude

•  An amazingly useful product for many different uses •  And wants everybody to know it •  …Right in the collection1 example •  “You will need all this eventually, might as well learn it first”

•  Elasticsearch is small and shiny (“trust us, the magic exists”) •  Elasticsearch + Logstash + Kibana => power-punch triple combo •  Especially when comparing to Solr (and not another commercial solution) •  Feature release process

•  Elasticsearch: kimchy: “LGTM” (Looks good to me) •  Solr: full Apache process around it

•  Solr – needs to buckle down and focus on onboarding experience •  Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Page 28: Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN

Solr vs. Elasticsearch Case by Case Alexandre Rafalovitch

www.solr-start.com

@arafalov @SolrStart