Click here to load reader

Elasticsearch intro output

  • View
    704

  • Download
    3

Embed Size (px)

Text of Elasticsearch intro output

  • ELASTICSEARCH INTROTom Chen [email protected]

  • ABOUT ME engineer @ iF+ TechArt

    full stack / CTO @ House123

    engineer @ Trend Micro

  • Management

    |

    |

    |

    |

    |

  • Interactive

    |

    |

    |

    |

    |

    |

    |

  • tornado, raspberry pi

  • ... ok !

  • elasticsearch introduction

  • Elas%csearch is a exible and powerful open source, distributed, real-%me search and analy%cs engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elas%csearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elas%csearch delivers on the near limitless promises of search technology.

  • store data search scalable

  • Database

  • DB NoSQL

  • Search

  • SQLSELECT * FROM table WHERE field LIKE '%querystring%';

  • elasticsearchfield:'querystring'

    { "match": { "field": "querystring" } }

    or

  • elasticsearchLucene Query Parser Syntax

    Elasticsearch Query DSL

    orhttp://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html

  • behind the scene (indexing)

    "Set the shape to semi-transparent by calling set_trans(5)"

    set the shape to semi transparent by calling set_trans 5

    standard tokenizer (Unicode Standard Annex #29)lowercase token filterstop token filter

    fields and query string are analyzed

  • behind the scene (searching)

    set the shape to semi transparent by calling set_trans 5

    fields and query string are analyzed

    semi-transparent

    semi transparent

  • what about Chinese?use an analyzer that is friendly to Chinese

    elasticsearch-analysis-mmseg

    MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm

    mmseg4j

    https://github.com/medcl/elasticsearch-analysis-mmseg

    https://code.google.com/p/mmseg4j/

    http://technology.chtsai.org/mmseg/

  • what about Chinese?use an analyzer that is friendly to Chinese

    elasticsearch-analysis-smartcnhttps://github.com/elasticsearch/elasticsearch-analysis-smartcn

  • Aggregations (Facets)

  • POST /cars/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" } { "index": {}} { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" } { "index": {}} { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" } { "index": {}} { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

    http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html

  • GET /cars/transactions/_search?search_type=count { "aggs" : { "colors" : { "terms" : { "field" : "color" } } } }

    http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html

  • { ... "hits": { "hits": [] }, "aggregations": { "colors": { "buckets": [ { "key": "red", "doc_count": 4 }, { "key": "blue", "doc_count": 2 }, { "key": "green", "doc_count": 2 } ] } } }

    http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html

  • more like thisgeolocation

  • so

  • code XD

  • https://github.com/yychen/estest

  • wordpress

    https://github.com/yychen/estest

  • GOAL wordpress

    https://github.com/yychen/estest

  • requests + lxml (XPath) pyelasticsearch elasticsearch

    tornado host

    https://github.com/yychen/estest

  • #!/usr/bin/env python from pyelasticsearch import ElasticSearch from pyelasticsearch.exceptions import ElasticHttpNotFoundError from settings import HOST, INDEX, DOCTYPE index_settings = { 'mappings': { DOCTYPE: { 'properties': { 'title': {'type': 'string', 'analyzer': 'mmseg', 'boost': 1.5, 'term_vector': 'with_positions_offsets'}, 'url': {'type': 'string', 'index': 'not_analyzed'}, 'content': {'type': 'string', 'analyzer': 'mmseg', 'boost': 0.7, 'term_vector': 'with_positions_offsets'}, 'categories': {'type': 'nested', 'properties': { 'url': {'type': 'string', 'index': 'not_analyzed'}, 'name': {'type': 'string', 'index': 'not_analyzed'}, } } } } } } es = ElasticSearch(HOST) try: es.delete_index(INDEX) except ElasticHttpNotFoundError: # No index found pass es.create_index(INDEX, settings=index_settings)

    mappinghttps://github.com/yychen/estest

  • requests + lxml

    https://github.com/yychen/estest 1

  • requests + lxml

    title

    categories

    content

    url

    https://github.com/yychen/estest 2

  • requests + lxml

    item { 'url': u'http://yychen.joba.cc/dev/archives/164', 'content': u'\u96d6\u7136\u4e4b...', 'categories': [ {'link': 'http://yychen.joba.cc/dev/archives/category/django', 'name': 'django'}, {'link': 'http://yychen.joba.cc/dev/archives/category/python', 'name': 'python'}, {'link': 'http://yychen.joba.cc/dev/archives/category/web', 'name': 'web'}], 'title': 'Django 1.7 Migration' }

    from pyelasticsearch import ElasticSearch

    es = ElasticSearch(HOST) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])

    https://github.com/yychen/estest 3

  • requests + lxml

    def main(): url = u'http://yychen.joba.cc/dev/archives/164' es = ElasticSearch(HOST) for i in range(20): item, url = get_page(url)

    if not url: print '\033[1;33mWe\'ve reached the end, breaking...\033[m' break

    # put it into es print 'Indexing \033[1;37m%s\033[m (%s)...' % (item['title'], item['url']) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])

    https://github.com/yychen/estest

  • requests + lxml def get_page(url): # store the to-be-indexed document to item item = { 'categories': [], } page = requests.get(url) # page.encoding = 'utf-8' html = etree.HTML(page.text) try: prev_url = html.xpath('//a[@rel="prev"]/@href')[0] except IndexError: # We reached the end return None, None title_parts = html.xpath('//h1//text()') content_parts = html.xpath('//div[@class="post-bodycopy cf"]//text()') categories = html.xpath('//a[@rel="category tag"]') item['url'] = url item['title'] = process_tags(title_parts) item['content'] = process_tags(content_parts) # Process the categories for category in categories: _cat = {} _cat['link'] = category.xpath('./@href')[0] _cat['name'] = category.xpath('./text()')[0] item['categories'].append(_cat) return item, prev_url

    https://github.com/yychen/estest

  • tornado

    class SearchHandler(tornado.web.RequestHandler): def post(self): dsl = { 'query': { 'bool': { 'should': [ {'match': {'content': self.get_argument('q')}}, {'match': {'title': self.get_argument('q')}}, ] } }, 'highlight': { 'pre_tags': [''], 'post_tags': [''], 'fields': { 'content': {'no_match_size': 150, 'number_of_fragments': 1}, 'title': {'no_match_size': 150, 'number_of_fragments': 0}, } } } results = es.search(dsl, index=INDEX, doc_type=DOCTYPE) hits = results['hits']['hits'] self.write(json.dumps(hits))

    https://github.com/yychen/estest

  • tada~~~https://github.com/yychen/estest

  • GOOD!

  • !

    https://github.com/yychen/estest

  • search.joba.cc , demo

  • :D

  • tip

  • mapping keyword

  • : url, title, id, category etc.

  • keyword

  • Conclusion

  • ,

  • photo by derek_b on Flickr

    Any questions?

  • Thank you :D