Click here to load reader

Managing Your Content with Elasticsearch

  • View
    154

  • Download
    1

Embed Size (px)

Text of Managing Your Content with Elasticsearch

  • Managing Your Content With

    ElasticsearchSamantha Quiones / @ieatkillerbees

    http://twitter.com/ieatkillerbees

  • About Me

    Software Engineer & Data Nerd since 1997 Doing media stuff since 2012 Principal @ AOL since 2014 @ieatkillerbees http://samanthaquinones.com

    http://samanthaquinones.com

  • What Well Cover Intro to Elasticsearch

    CRUD

    Creating Mappings

    Analyzers

    Basic Querying & Searching

    Scoring & Relevance

    Aggregations Basics

  • But First

    Download - https://www.elastic.co/downloads/elasticsearch

    Clone - https://github.com/squinones/elasticsearch-tutorial.git

    https://www.elastic.co/downloads/elasticsearchhttps://github.com/squinones/elasticsearch-tutorial.git

  • What is Elasticsearch?

    Near real-time (documents are available for search quickly after being indexed) search engine powered by Lucene

    Clustered for H/A and performance via federation with shards and replicas

  • Whats it Used For?

    Logging (we use Elasticsearch to centralize traffic logs, exception logs, and audit logs)

    Content management and search

    Statistical analysis

  • Installing Elasticsearch

    $ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz

    $ tar -zxvf elasticsearch*

    $ cd elasticsearch-2.1.1/bin

    $ ./elasticsearch

  • Connecting to Elasticsearch

    Via Java, there are two native clients which connect to an ES cluster on port 9300

    Most commonly, we access Elasticsearch via HTTP API

  • HTTP API

    curl -X GET "http://localhost:9200/?pretty"

  • Data Format

    Elasticsearch is a document-oriented database

    All operations are performed against documents (object graphs expressed as JSON)

  • Analogues

    Elasticsearch MySQL MongoDB

    Index Database Database

    Type Table Collection

    Document Row Document

    Field Column Field

  • Index Madness

    Index is an overloaded term.

    As a verb, to index a document is store a document in an index. This is analogous to an SQL INSERT operation.

    As a noun, an index is a collection of documents.

    Fields within a document have inverted indexes, similar to how a column in an SQL table may have an index.

  • Indexing Our First Document

    curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }

  • Retrieving Our First Document

    curl -X GET "http://localhost:9200/test_document/test/1"

  • Lets Look at Some Stackoverflow Posts!

    $ vi queries/bulk_insert_so_data.json

  • Bulk Insert

    curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/bulk_insert_so_data.json"

  • First Search

    curl -X GET "http://localhost:9200/stack_overflow/_search"

  • Query String Searches

    curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"

  • Query DSL

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php" } } }'

  • Compound Queriescurl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "filtered": { "query" : { "match" : { "title" : "(php OR python) AND (flask OR laravel)" } }, "filter": { "range": { "score": { "gt": 3 } } } } } }'

  • Full-Text Searching

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php loop" } } }'

  • Relevancy

    When searching (in query context), results are scored by a relevancy algorithm

    Results are presented in order from highest to lowest score

  • Phrase Searching

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } } }'

  • Highlighting Searches

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } }, "highlight": { "fields" : { "title" : {} } } }'

  • Aggregations

    Run statistical operations over your data

    Also near real-time!

    Complex aggregations are abstracted away behind simple interfaces you dont need to be a statistician

  • Analyzing Tags

    curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 } } } }'

  • Nesting Aggregationscurl -X POST http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 }, "aggs": { "avg_score": { "avg": { "field": "score"} } } } } }'

  • Break Time!

  • Under the Hood

    Elasticsearch is designed from the ground-up to run in a distributed fashion.

    Indices (collections of documents) are partitioned in to shards.

    Shards can be stored on a single or multiple nodes.

    Shards are balanced across the cluster to improve performance

    Shards are replicated for redundancy and high availability

  • What is a Cluster?

    One or more nodes (servers) that work together to

    serve a dataset that exceeds the capacity of a single server

    provide federated indexing (writes) and searching (reads)

    provide H/A through sharing and replication of data

  • What are Nodes?

    Individual servers within a cluster

    Can providing indexing and searching capabilities

  • What is an Index?

    An index is logically a collection of documents, roughly analogous to a database in MySQL

    An index is in reality a namespace that points to one or more physical shards which contain data

    When indexing a document, if the specified index does not exist, it will be created automatically

  • What are Shards?

    Low-level units that hold a slice of available data

    A shard represents a single instance of lucene and is fully-functional, self-contained search engine

    Shards are either primary or replicas and are assigned to nodes

  • What is Replication?

    Shards can have replicas

    Replicas primarily provide redundancy for when shards/nodes fail

    Replicas should not be allocated on the same node as the shard it replicates

  • Default Topology

    5 primary shards per index

    1 replica per shard

  • NODE

    Clustering & Replication

    NODE

    R1 P2 P3 R2 R3P4 R5 P1 R4 P5

  • Cluster Health

    curl -X GET http://localhost:9200/_cluster/health" curl -X GET "http://localhost:9200/_cat/health?v"

  • _cat API

    Display human-readable information about parts of the ES system

    Provides some limited documentation of functions

  • aliases

    > $ http GET ':9200/_cat/aliases?v' alias index filter routing.index routing.search posts posts_561729df8ce4e * - - posts.public posts_561729df8ce4e * - - posts.write posts_561729df8ce4e - - -

    Display all configured aliases

  • allocation

    > $ http GET ':9200/_cat/allocation?v' shards disk.used disk.avail disk.total disk.percent host 33 2.6gb 21.8gb 24.4gb 10 host1 33 3gb 21.4gb 24.4gb 12 host2 34 2.6gb 21.8gb 24.4gb 10 host3

    Show how many shards are allocated per node, with disk utilization info

  • count

    > $ http GET ':9200/_cat/count?v' epoch timestamp count 1453790185 06:36:25 182763

    > $ http GET :9200/_cat/count/posts?v epoch timestamp count 1453790467 06:41:07 164169

    > $ http GET :9200/_cat/count/posts.public?v epoch timestamp count 1453790472 06:41:12 164169=

    Display a count of documents in the cluster, or a specific index

  • fielddata

    > $ http -b GET ':9200/_cat/fielddata?v' id host ip node total site_id published 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb __xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb

    Shows how much memory is allocated to fielddata (metadata used for sorts)

  • health

    > $ http -b GET ':9200/_cat/health?v' epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks 1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0

  • indices

    > $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v' health status index pri rep docs.count docs.deleted store.size pri.store.size green

Search related