ELASTICSEARCH INTROTom Chen 陳炯廷[email protected]
ABOUT ME• engineer @ iF+ TechArt 當若科技藝術
• full stack / CTO @ House123
• engineer @ Trend Micro
策展、專案管理Management
| 展覽企劃
| 空間規劃
| 專案管理
| 協力單位聯繫/執行
協調
| 預算評估/製表
互動設計Interactive
| 人因交互設計
| 展示設備
| 互動內容製作
| 客製化整案執行
| 工業級機電控制/系統整合
| 結構設計/施工
| 實境遊戲
所以其實最近⽐比較常碰tornado, raspberry pi…
但... ok 的!
elasticsearch introduction
Elas%csearch is a flexible and powerful open source, distributed, real-‐%me search and analy%cs engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elas%csearch gives you the ability to move easily beyond simple full-‐text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elas%csearch delivers on the near limitless promises of search technology.
• store data • search • scalable
她有點像 Database但⼜又不太⼀一樣
你可以把他當DB 或 NoSQL 使⽤用
但還是要看你想要達成的⺫⽬目標是什麼
先來看看他的強項
Search強項⼀一
⼀一般 SQLSELECT * FROM table WHERE field LIKE '%querystring%';
elasticsearchfield:'querystring'
{ "match": { "field": "querystring" } }
or
elasticsearchLucene Query Parser Syntax
Elasticsearch Query DSL
orhttp://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
behind the scene (indexing)
"Set the shape to semi-transparent by calling set_trans(5)"
set the shape to semi transparent by calling set_trans 5
standard tokenizer (Unicode Standard Annex #29)
lowercase token filterstop token filter
fields and query string are analyzed
behind the scene (searching)
set the shape to semi transparent by calling set_trans 5
fields and query string are analyzed
semi-transparent
semi transparent
what about Chinese?use an analyzer that is friendly to Chinese
elasticsearch-analysis-mmseg
MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm
mmseg4j
https://github.com/medcl/elasticsearch-analysis-mmseg
https://code.google.com/p/mmseg4j/
http://technology.chtsai.org/mmseg/
what about Chinese?use an analyzer that is friendly to Chinese
elasticsearch-analysis-smartcnhttps://github.com/elasticsearch/elasticsearch-analysis-smartcn
Aggregations (Facets)強項⼆二
就是可以達成像這樣⼦子的東⻄西
POST /cars/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-‐10-‐28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-‐11-‐05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-‐05-‐18" } { "index": {}} { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-‐07-‐02" } { "index": {}} { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-‐08-‐19" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-‐11-‐05" } { "index": {}} { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-‐01-‐01" } { "index": {}} { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-‐02-‐12" }
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html
GET /cars/transactions/_search?search_type=count { "aggs" : { "colors" : { "terms" : { "field" : "color" } } } }
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html
{ ... "hits": { "hits": [] }, "aggregations": { "colors": { "buckets": [ { "key": "red", "doc_count": 4 }, { "key": "blue", "doc_count": 2 }, { "key": "green", "doc_count": 2 } ] } } }
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html
還有⼀一些不錯的功能
more like thisgeolocation
…
但這邊就不多說了
so…
來看 code 吧 XD
https://github.com/yychen/estest
很多⼈人都有 wordpress
https://github.com/yychen/estest
GOAL 替⾃自⼰己的 wordpress 刻⼀一個搜尋
https://github.com/yychen/estest
requests + lxml (XPath) 寫爬蟲pyelasticsearch 與 elasticsearch 溝通
tornado host 網⾴頁
https://github.com/yychen/estest
#!/usr/bin/env python from pyelasticsearch import ElasticSearch from pyelasticsearch.exceptions import ElasticHttpNotFoundError from settings import HOST, INDEX, DOCTYPE index_settings = { 'mappings': { DOCTYPE: { 'properties': { 'title': {'type': 'string', 'analyzer': 'mmseg', 'boost': 1.5, 'term_vector': 'with_positions_offsets'}, 'url': {'type': 'string', 'index': 'not_analyzed'}, 'content': {'type': 'string', 'analyzer': 'mmseg', 'boost': 0.7, 'term_vector': 'with_positions_offsets'}, 'categories': {'type': 'nested', 'properties': { 'url': {'type': 'string', 'index': 'not_analyzed'}, 'name': {'type': 'string', 'index': 'not_analyzed'}, } } } } } } es = ElasticSearch(HOST) try: es.delete_index(INDEX) except ElasticHttpNotFoundError: # No index found pass es.create_index(INDEX, settings=index_settings)
建⽴立 mappinghttps://github.com/yychen/estest
requests + lxml 寫爬蟲
從最新的⼀一篇當作⼊入⼝口第⼀一⾴頁
下⼀一⾴頁就找這個連結
https://github.com/yychen/estest 1
requests + lxml 寫爬蟲
title
categories
content
url
https://github.com/yychen/estest 2
requests + lxml 寫爬蟲
item { 'url': u'http://yychen.joba.cc/dev/archives/164', 'content': u'\u96d6\u7136\u4e4b...', 'categories': [ {'link': 'http://yychen.joba.cc/dev/archives/category/django', 'name': 'django'}, {'link': 'http://yychen.joba.cc/dev/archives/category/python', 'name': 'python'}, {'link': 'http://yychen.joba.cc/dev/archives/category/web', 'name': 'web'}], 'title': 'Django 1.7 Migration' }
from pyelasticsearch import ElasticSearch
es = ElasticSearch(HOST) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])
https://github.com/yychen/estest 3
requests + lxml 寫爬蟲
def main(): url = u'http://yychen.joba.cc/dev/archives/164' es = ElasticSearch(HOST) for i in range(20): item, url = get_page(url)
if not url: print '\033[1;33mWe\'ve reached the end, breaking...\033[m' break
# put it into es print 'Indexing \033[1;37m%s\033[m (%s)...' % (item['title'], item['url']) es.index(INDEX, DOCTYPE, doc=item, id=item['url'])
https://github.com/yychen/estest
requests + lxml 寫爬蟲def get_page(url): # store the to-‐be-‐indexed document to item item = { 'categories': [], } page = requests.get(url) # page.encoding = 'utf-‐8' html = etree.HTML(page.text) try: prev_url = html.xpath('//a[@rel="prev"]/@href')[0] except IndexError: # We reached the end return None, None title_parts = html.xpath('//h1//text()') content_parts = html.xpath('//div[@class="post-‐bodycopy cf"]//text()') categories = html.xpath('//a[@rel="category tag"]') item['url'] = url item['title'] = process_tags(title_parts) item['content'] = process_tags(content_parts) # Process the categories for category in categories: _cat = {} _cat['link'] = category.xpath('./@href')[0] _cat['name'] = category.xpath('./text()')[0] item['categories'].append(_cat) return item, prev_url
https://github.com/yychen/estest
tornado
class SearchHandler(tornado.web.RequestHandler): def post(self): dsl = { 'query': { 'bool': { 'should': [ {'match': {'content': self.get_argument('q')}}, {'match': {'title': self.get_argument('q')}}, ] } }, 'highlight': { 'pre_tags': ['<em>'], 'post_tags': ['</em>'], 'fields': { 'content': {'no_match_size': 150, 'number_of_fragments': 1}, 'title': {'no_match_size': 150, 'number_of_fragments': 0}, } } } results = es.search(dsl, index=INDEX, doc_type=DOCTYPE) hits = results['hits']['hits'] self.write(json.dumps(hits))
https://github.com/yychen/estest
tada~~~https://github.com/yychen/estest
GOOD!
叉⼦子之後發揮駭客精神⾃自⼰己寫⾃自⼰己部落格的搜尋引擎吧!
https://github.com/yychen/estest
再把剛剛的東⻄西再發揚光⼤大⼀一些
search.joba.cc 本服務已經下線, 僅供 demo
⾮非常⾮非常好 :D
⼀一個⼩小 tip
在 mapping 建⽴立⼀一個keyword 的欄位
把所有東⻄西都丟進去
譬如說: url, title, 任何 id, category etc.全部都丟進去
這樣直接在 keyword 這個欄位做搜尋就可以⼀一網打盡了
Conclusion
今天過後, ⼤大家就知道要怎麼畫⾺馬了
photo by derek_b on Flickr
Any questions?
Thank you :D