Upload
pivorak-meetup
View
556
Download
1
Embed Size (px)
Citation preview
E L A S T I C S E A R C H
M A K EYO U R
S O F T WA R ES M A RT E R !
O L E K S I Y PA N C H E N KO / # P I V O R A K / 2 0 1 5
MY NAME IS…
Oleksiy PanchenkoSoftware engineer, Lohika
E-mail: [email protected]: oleskiyp
LinkedIn: https://ua.linkedin.com/in/opanchenko
AGENDA• Introduction. What is it all about?• Jump start Elastic. Demo time• Architecture and deployment. Why is
Elasticsearch elastic?• Case studies. 4 real-life projects• Query API in depth + Demo• Using Elastic in Rails applications. Approaches
and tools• Kinda summary• Q & A
[ ELASTIC MORNING @ LOHIKA ]
INTRODUCTIONW H AT I S I T A L L A B O U T ?
HOW TO MAKE YOUR SITE SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
• Google search• Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where …• Sphinx (hello Craigslist, Habrahabr, The Pirate
Bay, 1C); Xapian• Lucene Family: Apache Lucene, Elasticsearch,
Apache Solr, Amazon Cloudsearch, …
WHO HAS EVER USED ELASTICSEARCH/SOLR/SPHINX?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
LUCENE AS A CORE• Lucene = Low-level Java library (JAR) which
implements search functionality• Lucene stores its index as a local binary file• Can be used in both web and standalone
applications (desktop, mobile)• Implemented in Java, ports to other languages
available• Initial version: 1999• Apache project since 2001• Latest stable release: 5.3.1 (September 24, 2015)
LUCENE AS A CORE• Lucene was originally
written in 1999 by Doug Cutting (creator of Hadoop and Nutch; currently Chief Architect at Cloudera) as a part of open-source web search engine (Nutch)
http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
MORE ABOUT SEARCH ENGINES
Riak Search
TIME TO TALK ABOUT ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text SearchMultilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete
https://www.elastic.co/products/elasticsearch
High Availability
Multitenancy
Distributed, Horizontally Scalable
https://www.elastic.co/products/elasticsearch
Document-Oriented
Schema-Free
Conflict ManagementOptimistic Concurrency Control
https://www.elastic.co/products/elasticsearch
Apache 2 Open Source License
Awesome documentation
Large community
Developer-Friendly, RESTful APIClient libraries available for many programming languages and frameworks.
ELASTICSEARCH USERS
https://www.elastic.co/use-caseshttps://en.wikipedia.org/wiki/Elasticsearch#Users
ELASTICSEARCH – PAST & PRESENT• 2004. Shay Banon (aka
Kimchy) started working on Compass – distributed and scalable Java Search Engine on top of Lucene• 2010. Initial release of ES• Latest stable release:
1.7.2(September 14, 2015)• 2.0 to be released in
November• 500K downloads per
month
• https://github.com/elastic/elasticsearchhttp://opensource.hk/sites/default/files/u1/shay-banon.jpg
ELASTICSEARCHAS A COMPANY• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees• https://www.elastic.co/• Product portfolio:
– Elasticsearch, Logstash, Kibana (ELK stack)– Watcher– Shield– Marvel– es-hadoop– found
JUMP START ELASTIC
D E M O T I M E
INSTALLATION & CONFIGURATION• Prerequisites:
– JDK 6 or above (recommended: JDK 8)– RAM: min. 2Gb (recommended: 16–64 Gb for
production)– CPU: number of cores over clock rate– Disks: recommended SSD
• Homebrew, apt, yum: apt-get install elasticsearch
• Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch
• Installation is absolutely straightforward and easy: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html
LET’S TALK ABOUT TERMINOLOGYIndex ~ DB Schema
Type ~ DB Table
Document
Record, JSON object
Mapping ~ Schema definition in RDBMS
DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
ARCHITECTURE AND DEPLOYMENTW H Y I S E L A S T I C S E A R C H E L A S T I C ?
Cluster One or more nodes which share the same cluster name
Node Running instance of Elasticsearch which belongs to a cluster
Shard A portion of data – single Lucene instance.Default: 5 shards in an index
Primary Shard
Master copy of data
Replica Shard
Exact copy of a primary shard.Default: 1 replica
SINGLE-NODE CLUSTER
0 1 2 3 4
HashFunction
{ "id": "123", "name": "John", … }
{ "id": "124", "name": "Patricia", … }
{ "id": "125", "name": "Scott", … }
Node
TWO-NODE CLUSTER
0 1 R2 3 R4Node 1
R0 R1 2 R3 4Node 2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
BENEFITS OF SHARDING• Take advantage of multi-core CPUs (one shard
is a single Lucene instance = single JVM process)• Horizontal scalability. Dynamic rebalancing• Fault tolerance and cluster resilience• NB! The number of shards can not be changed
dynamically on the fly – need to perform full reindexing• Max number of documents per shard:
2,147,483,519 – imposed by Lucene
ELASTICSEARCH NODE TYPES• Data node node.data = true• Master node node.master = true• Communication client http.enabled = true• TCP ports 9200 (ext), 9300 (int)• A node can play 2 or 3 roles at the same time• Multicast discovery (true by default):discovery.zen.ping.multicast.enabled
DEPLOYMENT DIAGRAM
INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard
DISTRIBUTED SEARCH• Given search query, retrieve 10 most relevant results
https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
CASE STUDIES4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&path-prefix=ru
GENERAL INFO• 4 projects, ~2 years• RDBMS (MySQL, PostgreSQL) as a primary
data storage• Both on-premise Elasticsearch installation
(AWS, MS Azure) and SaaS (Bonsai @ Heroku)• 1 or 2 instances in a cluster• Data volume: Gigabytes; millions of
documents• Back-end: Java, Ruby
#1. SOCIAL INFLUENCER MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
• Document types: Blog Posts, Bloggers (Influencers)• Elasticsearch usage:
– search and rank Influencers by category, keywords, tags, location, audience, influence
– search blog posts by keywords etc.• Amount of data:
– Influencers: hundreds of thousands– Blog Posts: millions
• ES cluster size: 2 instances• Technology stack: Java, MySQL, Dynamo DB,
AWS• Considered alternatives: Sphinx, Apache Solr
#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
• Document types: Job Postings, Jobseekers• Find relevant jobs
– Simple one-click search– Advanced search (title, keywords, industry,
location/distance, salary, requirements)• Elasticsearch as a Recommendation Engine
Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc.• 2 types of recommendations:
– Side banner (You also might be interested in…)
– E-mail subscriptions every 2 weeks• Find appropriate candidates by location,
requirements (experience, education, languages), salary expectations
• No fixed document structure (jobs from different providers)• Full-text search• Fuzzy search• Geolocation (distance)• Weighted search: Boosted search
clauses• Dynamic scripting (Mvel until v1.4.0,
then Groovy)
SEARCH QUERIES
SOME MORE FACTS• Amount of data:
– Job postings: ~1M–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances• Technology stack:
–Ruby on Rails–Elasticsearch, PostgreSQL, Redis–Heroku + add-ons, AWS (S3, EC2)–Lots of 3rd party APIs and integrations
LESSONS LEARNED• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)• Dynamic scripting• PostgreSQL as a backup search engine
sucks
#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
PARSING ADS
Price
$3900
1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPGWAT???• Fuzzy Search (Levenstein Distance Algorithm) used to parse
ads and classify cars• Elasticsearch index contains dictionary (Year, Make, Model,
Trim)• Used in conjunction with other approaches: regular
expressions, dictionaries of synonyms (VW Volkswagen, Chevy Chevrolet), normalization (e.g. LX-370 LX370)
• Algorithm approach:– Parse Year (1996)– Search most relevant Make (VW, volkswagon
Volkswagen)– Search most relevant Model (Passat) for Make =
Volkswagen, Year = 1996– Search most relevant Trim (TDi 4dr Sedan)
• Parsing quality: 90%https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
SOME UNCOVERED INFO• Check documents against duplicate content• Shingle analysis (commonly used by copywriters and SEO
experts)– I have a dream that one day this nation will rise up and live…– Normalization
I have a dream that one day this nation will rise up and live…
– Splitting a text into shingles (n-grams), n = 3..10have dream that
dream that thisthat this nationthis nation will
…– Replacement: latin ‘c’ cyrillic ‘c’
• Custom or standard ES implementation of Shingle analysishttps://en.wikipedia.org/wiki/W-shingling
QUERY API IN DEPTH+ D E M O
FILTERS VS. QUERIESAs a general rule, filters should be used:• for binary yes/no searches• for queries on exact values
Filters are much faster than queriesFilters are usually great candidates for caching
27 Filters available (Elasticsearch 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
QUERIES VS. FILTERSAs a general rule, queries should be used instead of filters:• for full text search• where the result depends on a relevance score
Common approach: Filter as many records as possible, then query them.
38 Queries available (Elasticsearch v 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
SOME THEORY BEHIND RELEVANCE SCORINGfull AND text AND search AND (elasticsearch OR lucene)
• Term Frequency: How often does the term appear in the document?
• Inverse Document Frequency: How often does the term appear in all documents in the collection?
• Field-length norm: How long is the field?
• TF, FLN etc. are calculated and stored at index timehttps://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
MORE COOL FEATURES• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)• Autocomplete suggestion:
• Did-you-mean suggestion:
• Highlight results:
SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/https://github.com/kzwang/elasticsearch-image
USING ELASTIC IN RAILS APPLICATIONSA P P R OA C H E S A N D T O O L S
ELASTICSEARCH-RUBY• https://github.com/elastic/elasticsearch-ruby• Includes two packages:
elasticsearch-transport + elasticsearch-api• Client for connecting to an Elasticsearch
cluster• Ruby API for the Elasticsearch's REST API• Various extensions and utilities
ELASTICSEARCH-RAILS• https://github.com/elastic/elasticsearch-rails• Includes three packages:
elasticsearch-model + elasticsearch-persistence + elasticsearch-rails
• ActiveModel integration with adapters for ActiveRecord and Mongoid
• Enumerable-based wrapper for search results; ActiveRecord::Relation-based wrapper for returning search results as records
• Support for Kaminari and WillPaginate pagination• Convenience methods for (re)creating the index,
setting up mappings, indexing documents, …• Rake tasks for importing data from application
models
MY WAY (RAILS 4 APP)Gemfile
config/environments/production.rb
MY WAY (RAILS 4 APP)job.rb
MY WAY (RAILS 4 APP)job.rb
MY WAY (RAILS 4 APP)job.rb
ELASTICSEARCH SEARCH QUERY
MY WAY (RAILS 4 APP)job_helper.rb
MY WAY (RAILS 4 APP)job_helper.rb
MY WAY (RAILS 4 APP)elasticsearch.rake
KINDA SUMMARY
ELASTICSEARCH DRAWBACKS• No transaction support. Elasticsearch is not a
database.• No joins, constraints and other RDBMS
features• Durability and consistency issues, data loss:– https://
aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0
– https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/http://solr-vs-elasticsearch.com/
• Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently• Sphinx is faster at indexing (up to 15MB/s per
core)• Performance issues can be usually fixed by
horizontal scaling
SUMMARY• ES is not a silver bullet but really really
powerful tool• Elasticsearch is not a RDBMS and is not
supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES
• Elasticsearch is dead simple at the start but might be sophisticated later as you go
• Kick off easily, then hire a good DevOps engineer for best results
• Ecosystem around Elasticsearch is just amazing• Give it a try – it can bring a lot of value to your
product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png
QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
USEFUL LINKS• Elasticsearch: https://
www.elastic.co/products/elasticsearch• Extended presentation about Elasticsearch and its
ecosystem:https://www.youtube.com/watch?v=GL7xC5kpb-c
• Scripts for the demos:https://github.com/opanchenko/morning-at-lohika-ELK