Click here to load reader

Hippo meetup: enterprise search with Solr and elasticsearch

  • View

  • Download

Embed Size (px)


Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.

Text of Hippo meetup: enterprise search with Solr and elasticsearch

  • 1.15th January 2013 Hippo meetupLuca CavannaSoftware developer & Search consultant at Trifork [email protected] - @lucacavanna

2. Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: Big data & Search Mobile Custom solutions Knowledge (GOTO Amsterdam) Hippo partner Hippo related search projects: working on 3. Agenda Search introduction Lucene foundation Why do we need Solr or elasticsearch? Scaling with Solr Elasticsearch distributed nature Elasticsearch features 4. Apache Lucene High-performance, full-featured text search enginelibrary written entirely in Java It indexes documents as collections of fields A field is a string based key-value pair What data structure does it use under the hood? 5. Inverted index termfreq Posting list1 The old night keeper keeps the keep in the townand1 6 big2 232 In the big old house in the big old gown. dark 1 63 The house in the town had the big old keep did1 4grown 1 24 Where the old night keeper never did sleep. had1 3house 2 235 The night keeper keeps the keep in the nightin5 123566 And keeps in the dark and sleeps in the light.keep3 135keeper3 145keeps 3 156 light1 6never 1 4night 3 145 old4 1234sleep 1 4sleeps1 6 the6 123456town2 13where 1 4 6. Inverted index Indexing Text analysis Tokenization, lowercasing and more The inverted index can contain more data Term offsets and more The inverted index itself doesnt contain the text fordisplaying the search results 7. Indexing Lucene writes indexes as segments Segments are not modifiable: Write-Once Each segment is a searchable mini index Each segment contains Inverted index Stored fields ...and more 8. Indexing: the commit operation Documents are searchable only after a commit! Commit gives also durability The most expensive operation in Lucene!!! 9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) With the Lucene near-real time API you dont need a commit to make new documents searchable Less expensive than commit Doesnt guarantee durability though Exposed as soft commit in Solr 4.0 10. Lucene code example indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close(); 11. Lucene code example querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = File("data")); IndexReader indexReader =; IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs =, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println( + ": " + field.stringValue()); } } 12. Whats missing? A common way to represent documents Interface to send document to (HTTP) A way to represent queries Interface to send queries to (HTTP) Configuration Caching Distributed infrastructure And more.... 13. Enterprise search servers 14. Scaling why? The more concurrent searches you run, the slower they get Indexing and searching on the same machine will substantially harm search performance Segment merging may be CPU/IO intensiveoperations Disk cache invalidation Fail over 15. Solr replication example 16. Solr replication (pull approach) Master-slave based solution Single machine for indexing data (master) Multiple machines for querying (slaves) Master is not aware of the slaves Slave is aware of the master Load balancer responsible for balancing the query requests What about real-time search? No way! 17. SolrCloud A set of new distributed capabilities in Solr uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election Whatever server (shard) you send data to: the documents get distributed over the shards A shard can be a leader or a replica and contains a subset of the data Easily scale up adding new Solr nodes 18. elasticsearch Distributed search engine built on top of Lucene Apache 2 license Written in Java RESTful Created and mainly developed by Shay Banon A company behind it: Regular releases Latest release 0.20.2 19. elasticsearch Schemaless Uses defaults and automatic type guessing Custom mappings may be defined if needed JSON oriented Multi tenancy Multiple indexes per node, multiple types per index Designed to be distributed from the beginning Almost everything is available as API (includingconfiguration) Wide range of administration APIs 20. elasticsearch distributed terminology Node: a running instance of elasticsearch which belongsto a cluster (usually one node per server) Cluster: one or more nodes with the same cluster name Shard: a single Lucene instance. A low-level worker unitmanaged by elasticsearch. An index is split into one ormore shards. Index: a logical namespace which points to one or moreshards Your code wont deal directly with a shard, only withan index But an index is composed of more lucene indexes(one per shard) 21. elasticsearch distributed terminology More shards: improve indexing performance increase data distribution (depends on # of nodes) Watch out: each shard has a cost as well! More replicas: increase failover improve querying performance 22. Transaction Log Indexed docs are fully persistent No need for a Lucene IndexWriter#commit Managed using a transaction log / WAL Full single node durability (kill dash 9) Utilized when doing hot relocation of shards Periodically flushed (calling IW#commit) Durability and real time search together! 23. Index - Shards & ReplicasNodeNode curl -XPUT localhost:9200/hippo -d{"index" : {Client "number_of_shards" : 2, "number_of_replicas" : 1} } 24. Index - Shards & Replicas Node NodeShard 0 Shard 0 (primary)(replica) Shard 1 Shard 1 (replica)(primary)curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2,"number_of_replicas" : 1 }} 25. Indexing - 1 Automatic sharding, push replication NodeNode Shard 0 Shard 0(primary)(replica)Shard 1 Shard 1(replica)(primary)curl -XPUT localhost:9200/hippo/users/1 -d { "name" : {"first" : "Luca",Client"last" : "Cavanna" }} 26. Indexing - 2NodeNodeShard 0 Shard 0 (primary)(replica) Shard 1 Shard 1 (replica)(primary) curl -XPUT localhost:9200/hippo/users/2 -d{"name" : { Client"first" : "Jeroen", "last" : "Reijn"} } 27. Search - 1 Scatter / Gather searchNode NodeShard 0Shard 0 (primary) (replica)Shard 1 Shard 1(replica)(primary)Clientcurl -XPUT localhost:9200/hippo/_search?q=luca 28. Search - 2 Automatic balancing between replicasNode NodeShard 0Shard 0 (primary) (replica)Shard 1 Shard 1(replica)(primary)Clientcurl -XPUT localhost:9200/hippo/_search?q=luca 29. Search - 3 Automatic failoverNode NodeShard 0Shard 0 (primary) (replica) Shard 1Shard 1 (replica) failure (primary)Client curl -XPUT localhost:9200/hippo/_search?q=luca 30. Adding a node Hot reallocation of shards to the new nodeNodeNodeShard 0 Shard 0 (primary)(replica) Shard 1Shard 1 (replica) (primary) 31. Adding a node Hot reallocation of shards to the new nodeNodeNodeNodeShard 0 Shard 0 (primary)(replica) Shard 1Shard 1 (replica) (primary) 32. Adding a node Hot reallocation of shards to the new nodeNodeNodeNodeShard 0 Shard 0Shard 0 (primary)(replica)(replica) Shard 1Shard 1 (replica) (primary) 33. Node failureNodeNodeNodeShard 0 Shard 0 (primary)(replica)Shard 1Shard 1(replica) (primary) 34. Node failure - 1 Replicas can automatically become primaries NodeNode Shard 0(primary) Shard 1(primary) 35. Node failure - 2 Shards are automatically assigned and do hot recovery Node NodeShard 0Shard 0(replica) (primary) Shard 1 Shard 1(primary)(replica) 36. Dynamic ReplicasNode NodeNodeShard 0Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d{"index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } } 37. Dynamic ReplicasNode NodeNodeShard 0Shard 0 Shard 0 (primary) (replica) (replica)curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_replicas" : 2 }} 38. Indexing (Push) - ElasticSearch Documents added through push requests Full JSON Object representation of Documents supported Embedded objects 1st class Parent / Child and Versioning Near Realtime index refreshing available Realtime get supported {"name": "Luca Cavanna","location": { "city": "Amsterdam", "country": "The Netherlands"}} 39. Indexing (Pull) - ElasticSearch Data flows from sources using Rivers Continues to add data as it flows Can be added, removed, configured dynamically Out-of-the-box support for CouchDB, Twitter (implemented by the es team) Community implementations for DBs, other NoSQL and Solr River River 40. Searching - ElasticSearch Search request in Request Body Powerful and extensible Query DSL Separation of Query and Filters Named Filters allowing tracking of which Documents matched which Filters By default storing the source of each document (_source field) Catch all feature enabled by default (_all field) Sorting of results Highlighting, Faceting, Boosting...and more