Click here to load reader
View
154
Download
1
Embed Size (px)
Managing Your Content With
ElasticsearchSamantha Quiones / @ieatkillerbees
http://twitter.com/ieatkillerbees
About Me
Software Engineer & Data Nerd since 1997 Doing media stuff since 2012 Principal @ AOL since 2014 @ieatkillerbees http://samanthaquinones.com
http://samanthaquinones.com
What Well Cover Intro to Elasticsearch
CRUD
Creating Mappings
Analyzers
Basic Querying & Searching
Scoring & Relevance
Aggregations Basics
But First
Download - https://www.elastic.co/downloads/elasticsearch
Clone - https://github.com/squinones/elasticsearch-tutorial.git
https://www.elastic.co/downloads/elasticsearchhttps://github.com/squinones/elasticsearch-tutorial.git
What is Elasticsearch?
Near real-time (documents are available for search quickly after being indexed) search engine powered by Lucene
Clustered for H/A and performance via federation with shards and replicas
Whats it Used For?
Logging (we use Elasticsearch to centralize traffic logs, exception logs, and audit logs)
Content management and search
Statistical analysis
Installing Elasticsearch
$ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz
$ tar -zxvf elasticsearch*
$ cd elasticsearch-2.1.1/bin
$ ./elasticsearch
Connecting to Elasticsearch
Via Java, there are two native clients which connect to an ES cluster on port 9300
Most commonly, we access Elasticsearch via HTTP API
HTTP API
curl -X GET "http://localhost:9200/?pretty"
Data Format
Elasticsearch is a document-oriented database
All operations are performed against documents (object graphs expressed as JSON)
Analogues
Elasticsearch MySQL MongoDB
Index Database Database
Type Table Collection
Document Row Document
Field Column Field
Index Madness
Index is an overloaded term.
As a verb, to index a document is store a document in an index. This is analogous to an SQL INSERT operation.
As a noun, an index is a collection of documents.
Fields within a document have inverted indexes, similar to how a column in an SQL table may have an index.
Indexing Our First Document
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }
Retrieving Our First Document
curl -X GET "http://localhost:9200/test_document/test/1"
Lets Look at Some Stackoverflow Posts!
$ vi queries/bulk_insert_so_data.json
Bulk Insert
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/bulk_insert_so_data.json"
First Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
Query String Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"
Query DSL
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php" } } }'
Compound Queriescurl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "filtered": { "query" : { "match" : { "title" : "(php OR python) AND (flask OR laravel)" } }, "filter": { "range": { "score": { "gt": 3 } } } } } }'
Full-Text Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php loop" } } }'
Relevancy
When searching (in query context), results are scored by a relevancy algorithm
Results are presented in order from highest to lowest score
Phrase Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } } }'
Highlighting Searches
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } }, "highlight": { "fields" : { "title" : {} } } }'
Aggregations
Run statistical operations over your data
Also near real-time!
Complex aggregations are abstracted away behind simple interfaces you dont need to be a statistician
Analyzing Tags
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 } } } }'
Nesting Aggregationscurl -X POST http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 }, "aggs": { "avg_score": { "avg": { "field": "score"} } } } } }'
Break Time!
Under the Hood
Elasticsearch is designed from the ground-up to run in a distributed fashion.
Indices (collections of documents) are partitioned in to shards.
Shards can be stored on a single or multiple nodes.
Shards are balanced across the cluster to improve performance
Shards are replicated for redundancy and high availability
What is a Cluster?
One or more nodes (servers) that work together to
serve a dataset that exceeds the capacity of a single server
provide federated indexing (writes) and searching (reads)
provide H/A through sharing and replication of data
What are Nodes?
Individual servers within a cluster
Can providing indexing and searching capabilities
What is an Index?
An index is logically a collection of documents, roughly analogous to a database in MySQL
An index is in reality a namespace that points to one or more physical shards which contain data
When indexing a document, if the specified index does not exist, it will be created automatically
What are Shards?
Low-level units that hold a slice of available data
A shard represents a single instance of lucene and is fully-functional, self-contained search engine
Shards are either primary or replicas and are assigned to nodes
What is Replication?
Shards can have replicas
Replicas primarily provide redundancy for when shards/nodes fail
Replicas should not be allocated on the same node as the shard it replicates
Default Topology
5 primary shards per index
1 replica per shard
NODE
Clustering & Replication
NODE
R1 P2 P3 R2 R3P4 R5 P1 R4 P5
Cluster Health
curl -X GET http://localhost:9200/_cluster/health" curl -X GET "http://localhost:9200/_cat/health?v"
_cat API
Display human-readable information about parts of the ES system
Provides some limited documentation of functions
aliases
> $ http GET ':9200/_cat/aliases?v' alias index filter routing.index routing.search posts posts_561729df8ce4e * - - posts.public posts_561729df8ce4e * - - posts.write posts_561729df8ce4e - - -
Display all configured aliases
allocation
> $ http GET ':9200/_cat/allocation?v' shards disk.used disk.avail disk.total disk.percent host 33 2.6gb 21.8gb 24.4gb 10 host1 33 3gb 21.4gb 24.4gb 12 host2 34 2.6gb 21.8gb 24.4gb 10 host3
Show how many shards are allocated per node, with disk utilization info
count
> $ http GET ':9200/_cat/count?v' epoch timestamp count 1453790185 06:36:25 182763
> $ http GET :9200/_cat/count/posts?v epoch timestamp count 1453790467 06:41:07 164169
> $ http GET :9200/_cat/count/posts.public?v epoch timestamp count 1453790472 06:41:12 164169=
Display a count of documents in the cluster, or a specific index
fielddata
> $ http -b GET ':9200/_cat/fielddata?v' id host ip node total site_id published 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb __xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb
Shows how much memory is allocated to fielddata (metadata used for sorts)
health
> $ http -b GET ':9200/_cat/health?v' epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks 1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0
indices
> $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v' health status index pri rep docs.count docs.deleted store.size pri.store.size green