Click here to load reader

Managing Your Content with Elasticsearch

  • View

  • Download

Embed Size (px)

Text of Managing Your Content with Elasticsearch

  • Managing Your Content With

    ElasticsearchSamantha Quiones / @ieatkillerbees

  • About Me

    Software Engineer & Data Nerd since 1997 Doing media stuff since 2012 Principal @ AOL since 2014 @ieatkillerbees

  • What Well Cover Intro to Elasticsearch


    Creating Mappings


    Basic Querying & Searching

    Scoring & Relevance

    Aggregations Basics

  • But First

    Download -

    Clone -

  • What is Elasticsearch?

    Near real-time (documents are available for search quickly after being indexed) search engine powered by Lucene

    Clustered for H/A and performance via federation with shards and replicas

  • Whats it Used For?

    Logging (we use Elasticsearch to centralize traffic logs, exception logs, and audit logs)

    Content management and search

    Statistical analysis

  • Installing Elasticsearch

    $ curl -L -O

    $ tar -zxvf elasticsearch*

    $ cd elasticsearch-2.1.1/bin

    $ ./elasticsearch

  • Connecting to Elasticsearch

    Via Java, there are two native clients which connect to an ES cluster on port 9300

    Most commonly, we access Elasticsearch via HTTP API


    curl -X GET "http://localhost:9200/?pretty"

  • Data Format

    Elasticsearch is a document-oriented database

    All operations are performed against documents (object graphs expressed as JSON)

  • Analogues

    Elasticsearch MySQL MongoDB

    Index Database Database

    Type Table Collection

    Document Row Document

    Field Column Field

  • Index Madness

    Index is an overloaded term.

    As a verb, to index a document is store a document in an index. This is analogous to an SQL INSERT operation.

    As a noun, an index is a collection of documents.

    Fields within a document have inverted indexes, similar to how a column in an SQL table may have an index.

  • Indexing Our First Document

    curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }

  • Retrieving Our First Document

    curl -X GET "http://localhost:9200/test_document/test/1"

  • Lets Look at Some Stackoverflow Posts!

    $ vi queries/bulk_insert_so_data.json

  • Bulk Insert

    curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/bulk_insert_so_data.json"

  • First Search

    curl -X GET "http://localhost:9200/stack_overflow/_search"

  • Query String Searches

    curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"

  • Query DSL

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php" } } }'

  • Compound Queriescurl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "filtered": { "query" : { "match" : { "title" : "(php OR python) AND (flask OR laravel)" } }, "filter": { "range": { "score": { "gt": 3 } } } } } }'

  • Full-Text Searching

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php loop" } } }'

  • Relevancy

    When searching (in query context), results are scored by a relevancy algorithm

    Results are presented in order from highest to lowest score

  • Phrase Searching

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } } }'

  • Highlighting Searches

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } }, "highlight": { "fields" : { "title" : {} } } }'

  • Aggregations

    Run statistical operations over your data

    Also near real-time!

    Complex aggregations are abstracted away behind simple interfaces you dont need to be a statistician

  • Analyzing Tags

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 } } } }'

  • Nesting Aggregationscurl -X POST http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 }, "aggs": { "avg_score": { "avg": { "field": "score"} } } } } }'

  • Break Time!

  • Under the Hood

    Elasticsearch is designed from the ground-up to run in a distributed fashion.

    Indices (collections of documents) are partitioned in to shards.

    Shards can be stored on a single or multiple nodes.

    Shards are balanced across the cluster to improve performance

    Shards are replicated for redundancy and high availability

  • What is a Cluster?

    One or more nodes (servers) that work together to

    serve a dataset that exceeds the capacity of a single server

    provide federated indexing (writes) and searching (reads)

    provide H/A through sharing and replication of data

  • What are Nodes?

    Individual servers within a cluster

    Can providing indexing and searching capabilities

  • What is an Index?

    An index is logically a collection of documents, roughly analogous to a database in MySQL

    An index is in reality a namespace that points to one or more physical shards which contain data

    When indexing a document, if the specified index does not exist, it will be created automatically

  • What are Shards?

    Low-level units that hold a slice of available data

    A shard represents a single instance of lucene and is fully-functional, self-contained search engine

    Shards are either primary or replicas and are assigned to nodes

  • What is Replication?

    Shards can have replicas

    Replicas primarily provide redundancy for when shards/nodes fail

    Replicas should not be allocated on the same node as the shard it replicates

  • Default Topology

    5 primary shards per index

    1 replica per shard

  • NODE

    Clustering & Replication


    R1 P2 P3 R2 R3P4 R5 P1 R4 P5

  • Cluster Health

    curl -X GET http://localhost:9200/_cluster/health" curl -X GET "http://localhost:9200/_cat/health?v"

  • _cat API

    Display human-readable information about parts of the ES system

    Provides some limited documentation of functions

  • aliases

    > $ http GET ':9200/_cat/aliases?v' alias index filter routing.index posts posts_561729df8ce4e * - - posts.public posts_561729df8ce4e * - - posts.write posts_561729df8ce4e - - -

    Display all configured aliases

  • allocation

    > $ http GET ':9200/_cat/allocation?v' shards disk.used disk.avail disk.percent host 33 2.6gb 21.8gb 24.4gb 10 host1 33 3gb 21.4gb 24.4gb 12 host2 34 2.6gb 21.8gb 24.4gb 10 host3

    Show how many shards are allocated per node, with disk utilization info

  • count

    > $ http GET ':9200/_cat/count?v' epoch timestamp count 1453790185 06:36:25 182763

    > $ http GET :9200/_cat/count/posts?v epoch timestamp count 1453790467 06:41:07 164169

    > $ http GET :9200/_cat/count/posts.public?v epoch timestamp count 1453790472 06:41:12 164169=

    Display a count of documents in the cluster, or a specific index

  • fielddata

    > $ http -b GET ':9200/_cat/fielddata?v' id host ip node total site_id published 7tjeJNY3TMajqRkmYsJyrA host1 node1 1.1mb 170.1kb 996.5kb __xrpsKAQW6yyCY8luLQdQ host2 node2 1.6mb 329.3kb 1.3mb bdoNNXHXRryj22YqjnqECw host3 node3 1.1mb 154.7kb 991.7kb

    Shows how much memory is allocated to fielddata (metadata used for sorts)

  • health

    > $ http -b GET ':9200/_cat/health?v' epoch timestamp cluster status shards pri relo init unassign pending_tasks 1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0

  • indices

    > $ http -b GET '' health status index pri rep docs.count docs.deleted store.size green

Search related