Download pdf - Elasticsearch intro output

ELASTICSEARCH INTROTom Chen 陳炯廷[email protected]

mailto:[email protected]

ABOUT ME• engineer @ iF+ TechArt 當若科技藝術

• full stack / CTO @ House123

• engineer @ Trend Micro

策展、專案管理Management

| 展覽企劃

| 空間規劃

| 專案管理

| 協力單位聯繫／執行

協調

| 預算評估／製表

互動設計Interactive

| 人因交互設計

| 展示設備

| 互動內容製作

| 客製化整案執行

| 工業級機電控制／系統整合

| 結構設計／施工

| 實境遊戲

所以其實最近⽐比較常碰tornado, raspberry pi…

但... ok 的!

elasticsearch introduction

Elas%csearch is a flexible and powerful open source, distributed, real-‐%me search and analy%cs engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elas%csearch gives you the ability to move easily beyond simple full-‐text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elas%csearch delivers on the near limitless promises of search technology.

• store data • search • scalable

她有點像 Database但⼜又不太⼀一樣

你可以把他當DB 或 NoSQL 使⽤用

但還是要看你想要達成的⺫⽬目標是什麼

先來看看他的強項

Search強項⼀一

⼀一般 SQLSELECT * FROM table WHERE field LIKE '%querystring%';

elasticsearchfield:'querystring'

{ "match": { "field": "querystring" } }

or

elasticsearchLucene Query Parser Syntax

Elasticsearch Query DSL

orhttp://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html

http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html

behind the scene (indexing)

"Set the shape to semi-transparent by calling set_trans(5)"

set the shape to semi transparent by calling set_trans 5

standard tokenizer (Unicode Standard Annex #29)

lowercase token filterstop token filter

fields and query string are analyzed

behind the scene (searching)

set the shape to semi transparent by calling set_trans 5

fields and query string are analyzed

semi-transparent

semi transparent

what about Chinese?use an analyzer that is friendly to Chinese

elasticsearch-analysis-mmseg

MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm

mmseg4j

https://github.com/medcl/elasticsearch-analysis-mmseg

https://code.google.com/p/mmseg4j/

http://technology.chtsai.org/mmseg/

https://github.com/medcl/elasticsearch-analysis-mmseg

https://code.google.com/p/mmseg4j/

http://technology.chtsai.org/mmseg/

what about Chinese?use an analyzer that is friendly to Chinese

elasticsearch-analysis-smartcnhttps://github.com/elasticsearch/elasticsearch-analysis-smartcn

https://github.com/elasticsearch/elasticsearch-analysis-smartcn

Aggregations (Facets)強項⼆二

就是可以達成像這樣⼦子的東⻄西

POST /cars/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-‐10-‐28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-‐11-‐05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-‐05-‐18" } { "index": {}} { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-‐07-‐02" } { "index": {}} { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-‐08-‐19" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-‐11-‐05" } { "index": {}} { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-‐01-‐01" } { "index": {}} { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-‐02-‐12" }

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html


GET /cars/transactions/_search?search_type=count { "aggs" : { "colors" : { "terms" : { "field" : "color" } } } }



{ ... "hits": { "hits": [] }, "aggregations": { "colors": { "buckets": [ { "key": "red", "doc_count": 4 }, { "key": "blue", "doc_count": 2 }, { "key": "green", "doc_count": 2 } ] } } }



還有⼀一些不錯的功能

more like thisgeolocation

…

但這邊就不多說了

so…

來看 code 吧 XD

https://github.com/yychen/estest

很多⼈人都有 wordpress


GOAL 替⾃自⼰己的 wordpress 刻⼀一個搜尋


requests + lxml (XPath) 寫爬蟲pyelasticsearch 與 elasticsearch 溝通

tornado host 網⾴頁


#!/usr/bin/env python from pyelasticsearch import ElasticSearch from pyelasticsearch.exceptions import ElasticHttpNotFoundError from settings import HOST, INDEX, DOCTYPE index_settings = { 'mappings': { DOCTYPE: { 'properties': { 'title': {'type': 'string', 'analyzer': 'mmseg', 'boost': 1.5, 'term_vector': 'with_positions_offsets'}, 'url': {'type': 'string', 'index': 'not_analyzed'}, 'content': {'type': 'string', 'analyzer': 'mmseg', 'boost': 0.7, 'term_vector': 'with_positions_offsets'}, 'categories': {'type': 'nested', 'properties': { 'url': {'type': 'string', 'index': 'not_analyzed'}, 'name': {'type': 'string', 'index': 'not_analyzed'}, } } } } } } es = ElasticSearch(HOST) try: es.delete_index(INDEX) except ElasticHttpNotFoundError: # No index found pass es.create_index(INDEX, settings=index_settings)

建⽴立 mappinghttps://github.com/yychen/estest

requests + lxml 寫爬蟲

從最新的⼀一篇當作⼊入⼝口第⼀一⾴頁

下⼀一⾴頁就找這個連結

https://github.com/yychen/estest 1


title

categories

content

url



item { 'url': u'http://yychen.joba.cc/dev/archives/164', 'content': u'\u96d6\u7136\u4e4b...', 'categories': [ {'link': 'http://yychen.joba.cc/dev/archives/category/django', 'name': 'django'}, {'link': 'http://yychen.joba.cc/dev/archives/category/python', 'name': 'python'}, {'link': 'http://yychen.joba.cc/dev/archives/category/web', 'name': 'web'}], 'title': 'Django 1.7 Migration' }

from pyelasticsearch import ElasticSearch

es = ElasticSearch(HOST) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])



def main(): url = u'http://yychen.joba.cc/dev/archives/164' es = ElasticSearch(HOST) for i in range(20): item, url = get_page(url)

if not url: print '\033[1;33mWe\'ve reached the end, breaking...\033[m' break

# put it into es print 'Indexing \033[1;37m%s\033[m (%s)...' % (item['title'], item['url']) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])


requests + lxml 寫爬蟲def get_page(url): # store the to-‐be-‐indexed document to item item = { 'categories': [], } page = requests.get(url) # page.encoding = 'utf-‐8' html = etree.HTML(page.text) try: prev_url = html.xpath('//a[@rel="prev"]/@href')[0] except IndexError: # We reached the end return None, None title_parts = html.xpath('//h1//text()') content_parts = html.xpath('//div[@class="post-‐bodycopy cf"]//text()') categories = html.xpath('//a[@rel="category tag"]') item['url'] = url item['title'] = process_tags(title_parts) item['content'] = process_tags(content_parts) # Process the categories for category in categories: _cat = {} _cat['link'] = category.xpath('./@href')[0] _cat['name'] = category.xpath('./text()')[0] item['categories'].append(_cat) return item, prev_url


tornado

class SearchHandler(tornado.web.RequestHandler): def post(self): dsl = { 'query': { 'bool': { 'should': [ {'match': {'content': self.get_argument('q')}}, {'match': {'title': self.get_argument('q')}}, ] } }, 'highlight': { 'pre_tags': ['<em>'], 'post_tags': ['</em>'], 'fields': { 'content': {'no_match_size': 150, 'number_of_fragments': 1}, 'title': {'no_match_size': 150, 'number_of_fragments': 0}, } } } results = es.search(dsl, index=INDEX, doc_type=DOCTYPE) hits = results['hits']['hits'] self.write(json.dumps(hits))


tada~~~https://github.com/yychen/estest

GOOD!

叉⼦子之後發揮駭客精神⾃自⼰己寫⾃自⼰己部落格的搜尋引擎吧!


再把剛剛的東⻄西再發揚光⼤大⼀一些

search.joba.cc 本服務已經下線, 僅供 demo

⾮非常⾮非常好 :D

⼀一個⼩小 tip

在 mapping 建⽴立⼀一個keyword 的欄位

把所有東⻄西都丟進去

譬如說: url, title, 任何 id, category etc.全部都丟進去

這樣直接在 keyword 這個欄位做搜尋就可以⼀一網打盡了

Conclusion

今天過後, ⼤大家就知道要怎麼畫⾺馬了

photo by derek_b on Flickr

Any questions?

Thank you :D