72
ELASTICSEARCH MAKE YOUR SOFTWARE SMARTER! OLEKSIY PANCHENKO / #PIVORAK / 2015

Elastic pivorak

Embed Size (px)

Citation preview

Page 1: Elastic pivorak

E L A S T I C S E A R C H

M A K EYO U R

S O F T WA R ES M A RT E R !

O L E K S I Y PA N C H E N KO / # P I V O R A K / 2 0 1 5

Page 2: Elastic pivorak

MY NAME IS…

Oleksiy PanchenkoSoftware engineer, Lohika

E-mail: [email protected]: oleskiyp

LinkedIn: https://ua.linkedin.com/in/opanchenko

Page 3: Elastic pivorak

AGENDA• Introduction. What is it all about?• Jump start Elastic. Demo time• Architecture and deployment. Why is

Elasticsearch elastic?• Case studies. 4 real-life projects• Query API in depth + Demo• Using Elastic in Rails applications. Approaches

and tools• Kinda summary• Q & A

Page 4: Elastic pivorak

[ ELASTIC MORNING @ LOHIKA ]

Page 5: Elastic pivorak

INTRODUCTIONW H AT I S I T A L L A B O U T ?

Page 6: Elastic pivorak

HOW TO MAKE YOUR SITE SEARCHABLE?

http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png

Page 7: Elastic pivorak

• Google search• Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where …• Sphinx (hello Craigslist, Habrahabr, The Pirate

Bay, 1C); Xapian• Lucene Family: Apache Lucene, Elasticsearch,

Apache Solr, Amazon Cloudsearch, …

Page 8: Elastic pivorak

WHO HAS EVER USED ELASTICSEARCH/SOLR/SPHINX?

http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

Page 9: Elastic pivorak

LUCENE AS A CORE• Lucene = Low-level Java library (JAR) which

implements search functionality• Lucene stores its index as a local binary file• Can be used in both web and standalone

applications (desktop, mobile)• Implemented in Java, ports to other languages

available• Initial version: 1999• Apache project since 2001• Latest stable release: 5.3.1 (September 24, 2015)

Page 10: Elastic pivorak

LUCENE AS A CORE• Lucene was originally

written in 1999 by Doug Cutting (creator of Hadoop and Nutch; currently Chief Architect at Cloudera) as a part of open-source web search engine (Nutch)

http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg

Page 11: Elastic pivorak

MORE ABOUT SEARCH ENGINES

Riak Search

Page 12: Elastic pivorak

TIME TO TALK ABOUT ELASTICSEARCH

https://www.elastic.co/products/elasticsearch

Near Real-Time Data (NRT)

Full-Text SearchMultilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete

Page 13: Elastic pivorak

https://www.elastic.co/products/elasticsearch

High Availability

Multitenancy

Distributed, Horizontally Scalable

Page 14: Elastic pivorak

https://www.elastic.co/products/elasticsearch

Document-Oriented

Schema-Free

Conflict ManagementOptimistic Concurrency Control

Page 15: Elastic pivorak

https://www.elastic.co/products/elasticsearch

Apache 2 Open Source License

Awesome documentation

Large community

Developer-Friendly, RESTful APIClient libraries available for many programming languages and frameworks.

Page 16: Elastic pivorak

ELASTICSEARCH USERS

https://www.elastic.co/use-caseshttps://en.wikipedia.org/wiki/Elasticsearch#Users

Page 17: Elastic pivorak

ELASTICSEARCH – PAST & PRESENT• 2004. Shay Banon (aka

Kimchy) started working on Compass – distributed and scalable Java Search Engine on top of Lucene• 2010. Initial release of ES• Latest stable release:

1.7.2(September 14, 2015)• 2.0 to be released in

November• 500K downloads per

month

• https://github.com/elastic/elasticsearchhttp://opensource.hk/sites/default/files/u1/shay-banon.jpg

Page 18: Elastic pivorak

ELASTICSEARCHAS A COMPANY• 2012. Elasticsearch BV; Funding: $104M in 3

rounds, 100+ employees• https://www.elastic.co/• Product portfolio:

– Elasticsearch, Logstash, Kibana (ELK stack)– Watcher– Shield– Marvel– es-hadoop– found

Page 19: Elastic pivorak

JUMP START ELASTIC

D E M O T I M E

Page 20: Elastic pivorak

INSTALLATION & CONFIGURATION• Prerequisites:

– JDK 6 or above (recommended: JDK 8)– RAM: min. 2Gb (recommended: 16–64 Gb for

production)– CPU: number of cores over clock rate– Disks: recommended SSD

• Homebrew, apt, yum: apt-get install elasticsearch

• Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch

• Installation is absolutely straightforward and easy: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html

Page 21: Elastic pivorak

LET’S TALK ABOUT TERMINOLOGYIndex ~ DB Schema

Type ~ DB Table

Document

Record, JSON object

Mapping ~ Schema definition in RDBMS

Page 22: Elastic pivorak

DEMO #1

http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg

Page 23: Elastic pivorak

ARCHITECTURE AND DEPLOYMENTW H Y I S E L A S T I C S E A R C H E L A S T I C ?

Page 24: Elastic pivorak

Cluster One or more nodes which share the same cluster name

Node Running instance of Elasticsearch which belongs to a cluster

Shard A portion of data – single Lucene instance.Default: 5 shards in an index

Primary Shard

Master copy of data

Replica Shard

Exact copy of a primary shard.Default: 1 replica

Page 25: Elastic pivorak

SINGLE-NODE CLUSTER

0 1 2 3 4

HashFunction

{ "id": "123", "name": "John", … }

{ "id": "124", "name": "Patricia", … }

{ "id": "125", "name": "Scott", … }

Node

Page 26: Elastic pivorak

TWO-NODE CLUSTER

0 1 R2 3 R4Node 1

R0 R1 2 R3 4Node 2

* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)

Page 27: Elastic pivorak

BENEFITS OF SHARDING• Take advantage of multi-core CPUs (one shard

is a single Lucene instance = single JVM process)• Horizontal scalability. Dynamic rebalancing• Fault tolerance and cluster resilience• NB! The number of shards can not be changed

dynamically on the fly – need to perform full reindexing• Max number of documents per shard:

2,147,483,519 – imposed by Lucene

Page 28: Elastic pivorak

ELASTICSEARCH NODE TYPES• Data node node.data = true• Master node node.master = true• Communication client http.enabled = true• TCP ports 9200 (ext), 9300 (int)• A node can play 2 or 3 roles at the same time• Multicast discovery (true by default):discovery.zen.ping.multicast.enabled

Page 29: Elastic pivorak

DEPLOYMENT DIAGRAM

Page 30: Elastic pivorak

INDEXING A DOCUMENT

https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html

Page 31: Elastic pivorak

RETRIEVING A DOCUMENT

https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html

• In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard

Page 32: Elastic pivorak

DISTRIBUTED SEARCH• Given search query, retrieve 10 most relevant results

https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html

Page 33: Elastic pivorak

http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg

Page 34: Elastic pivorak

CASE STUDIES4 R E A L - L I F E P R O J E C T S

http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&path-prefix=ru

Page 35: Elastic pivorak

GENERAL INFO• 4 projects, ~2 years• RDBMS (MySQL, PostgreSQL) as a primary

data storage• Both on-premise Elasticsearch installation

(AWS, MS Azure) and SaaS (Bonsai @ Heroku)• 1 or 2 instances in a cluster• Data volume: Gigabytes; millions of

documents• Back-end: Java, Ruby

Page 36: Elastic pivorak

#1. SOCIAL INFLUENCER MARKETING PLATFORM

http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg

Page 37: Elastic pivorak

• Document types: Blog Posts, Bloggers (Influencers)• Elasticsearch usage:

– search and rank Influencers by category, keywords, tags, location, audience, influence

– search blog posts by keywords etc.• Amount of data:

– Influencers: hundreds of thousands– Blog Posts: millions

• ES cluster size: 2 instances• Technology stack: Java, MySQL, Dynamo DB,

AWS• Considered alternatives: Sphinx, Apache Solr

Page 38: Elastic pivorak

#2. JOB SITE

http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg

Page 39: Elastic pivorak

• Document types: Job Postings, Jobseekers• Find relevant jobs

– Simple one-click search– Advanced search (title, keywords, industry,

location/distance, salary, requirements)• Elasticsearch as a Recommendation Engine

Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc.• 2 types of recommendations:

– Side banner (You also might be interested in…)

– E-mail subscriptions every 2 weeks• Find appropriate candidates by location,

requirements (experience, education, languages), salary expectations

Page 40: Elastic pivorak

• No fixed document structure (jobs from different providers)• Full-text search• Fuzzy search• Geolocation (distance)• Weighted search: Boosted search

clauses• Dynamic scripting (Mvel until v1.4.0,

then Groovy)

SEARCH QUERIES

Page 41: Elastic pivorak

SOME MORE FACTS• Amount of data:

– Job postings: ~1M–Applicants: ~20K

• Cluster size: 2 ‘medium’ EC2 instances• Technology stack:

–Ruby on Rails–Elasticsearch, PostgreSQL, Redis–Heroku + add-ons, AWS (S3, EC2)–Lots of 3rd party APIs and integrations

Page 42: Elastic pivorak

LESSONS LEARNED• On-premise deployment (EC2) vs. SaaS

(Bonsai @ Heroku)• Dynamic scripting• PostgreSQL as a backup search engine

sucks

Page 43: Elastic pivorak

#3. CAR TRADING

http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png

Page 44: Elastic pivorak

PARSING ADS

Price

$3900

Page 45: Elastic pivorak

1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPGWAT???• Fuzzy Search (Levenstein Distance Algorithm) used to parse

ads and classify cars• Elasticsearch index contains dictionary (Year, Make, Model,

Trim)• Used in conjunction with other approaches: regular

expressions, dictionaries of synonyms (VW Volkswagen, Chevy Chevrolet), normalization (e.g. LX-370 LX370)

• Algorithm approach:– Parse Year (1996)– Search most relevant Make (VW, volkswagon

Volkswagen)– Search most relevant Model (Passat) for Make =

Volkswagen, Year = 1996– Search most relevant Trim (TDi 4dr Sedan)

• Parsing quality: 90%https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html

Page 46: Elastic pivorak

#4. [NDA]

http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg

Page 47: Elastic pivorak

SOME UNCOVERED INFO• Check documents against duplicate content• Shingle analysis (commonly used by copywriters and SEO

experts)– I have a dream that one day this nation will rise up and live…– Normalization

I have a dream that one day this nation will rise up and live…

– Splitting a text into shingles (n-grams), n = 3..10have dream that

dream that thisthat this nationthis nation will

…– Replacement: latin ‘c’ cyrillic ‘c’

• Custom or standard ES implementation of Shingle analysishttps://en.wikipedia.org/wiki/W-shingling

Page 48: Elastic pivorak

QUERY API IN DEPTH+ D E M O

Page 49: Elastic pivorak

FILTERS VS. QUERIESAs a general rule, filters should be used:• for binary yes/no searches• for queries on exact values

Filters are much faster than queriesFilters are usually great candidates for caching

27 Filters available (Elasticsearch 1.7.1)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

Page 50: Elastic pivorak

QUERIES VS. FILTERSAs a general rule, queries should be used instead of filters:• for full text search• where the result depends on a relevance score

Common approach: Filter as many records as possible, then query them.

38 Queries available (Elasticsearch v 1.7.1)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html

Page 51: Elastic pivorak

DEMO #2

http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg

Page 52: Elastic pivorak

SOME THEORY BEHIND RELEVANCE SCORINGfull AND text AND search AND (elasticsearch OR lucene)

• Term Frequency: How often does the term appear in the document?

• Inverse Document Frequency: How often does the term appear in all documents in the collection?

• Field-length norm: How long is the field?

• TF, FLN etc. are calculated and stored at index timehttps://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting

Page 53: Elastic pivorak

MORE COOL FEATURES• Indexing attachments: MS Office, ePub, PDF

(Apache Tika)• Autocomplete suggestion:

• Did-you-mean suggestion:

• Highlight results:

Page 54: Elastic pivorak

SEARCH IMAGES

https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/https://github.com/kzwang/elasticsearch-image

Page 55: Elastic pivorak

USING ELASTIC IN RAILS APPLICATIONSA P P R OA C H E S A N D T O O L S

Page 56: Elastic pivorak

ELASTICSEARCH-RUBY• https://github.com/elastic/elasticsearch-ruby• Includes two packages:

elasticsearch-transport + elasticsearch-api• Client for connecting to an Elasticsearch

cluster• Ruby API for the Elasticsearch's REST API• Various extensions and utilities

Page 57: Elastic pivorak

ELASTICSEARCH-RAILS• https://github.com/elastic/elasticsearch-rails• Includes three packages:

elasticsearch-model + elasticsearch-persistence + elasticsearch-rails

• ActiveModel integration with adapters for ActiveRecord and Mongoid

• Enumerable-based wrapper for search results; ActiveRecord::Relation-based wrapper for returning search results as records

• Support for Kaminari and WillPaginate pagination• Convenience methods for (re)creating the index,

setting up mappings, indexing documents, …• Rake tasks for importing data from application

models

Page 58: Elastic pivorak

MY WAY (RAILS 4 APP)Gemfile

config/environments/production.rb

Page 59: Elastic pivorak

MY WAY (RAILS 4 APP)job.rb

Page 60: Elastic pivorak

MY WAY (RAILS 4 APP)job.rb

Page 61: Elastic pivorak

MY WAY (RAILS 4 APP)job.rb

Page 62: Elastic pivorak

ELASTICSEARCH SEARCH QUERY

Page 63: Elastic pivorak

MY WAY (RAILS 4 APP)job_helper.rb

Page 64: Elastic pivorak

MY WAY (RAILS 4 APP)job_helper.rb

Page 65: Elastic pivorak

MY WAY (RAILS 4 APP)elasticsearch.rake

Page 66: Elastic pivorak

KINDA SUMMARY

Page 67: Elastic pivorak

ELASTICSEARCH DRAWBACKS• No transaction support. Elasticsearch is not a

database.• No joins, constraints and other RDBMS

features• Durability and consistency issues, data loss:– https://

aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

– https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

Page 68: Elastic pivorak

PERFORMANCE?

http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/http://solr-vs-elasticsearch.com/

• Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently• Sphinx is faster at indexing (up to 15MB/s per

core)• Performance issues can be usually fixed by

horizontal scaling

Page 69: Elastic pivorak

SUMMARY• ES is not a silver bullet but really really

powerful tool• Elasticsearch is not a RDBMS and is not

supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES

• Elasticsearch is dead simple at the start but might be sophisticated later as you go

• Kick off easily, then hire a good DevOps engineer for best results

• Ecosystem around Elasticsearch is just amazing• Give it a try – it can bring a lot of value to your

product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png

Page 70: Elastic pivorak

QUESTIONS?

http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

Page 71: Elastic pivorak

THANK YOU!

http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg

Page 72: Elastic pivorak

USEFUL LINKS• Elasticsearch: https://

www.elastic.co/products/elasticsearch• Extended presentation about Elasticsearch and its

ecosystem:https://www.youtube.com/watch?v=GL7xC5kpb-c

• Scripts for the demos:https://github.com/opanchenko/morning-at-lohika-ELK