Upload
foundsearch
View
115
Download
8
Embed Size (px)
DESCRIPTION
Søkemotorer kan løse langt fler utfordringer enn en søkeboks gir. Du har kanskje et søkeproblem uten å være klar over det? Elasticsearch, en open source søkemotor bygd på Lucene, får stadig mer oppmerksomhet - ikke bare fordi den er glimrende til å løse typiske søkeproblemer, men også fordi den kan brukes til analyse- og "big data"-utfordringer. Foredraget gir en oversikt over hva søkemotorer er gode på, relaterte problemer du kommer over, hvordan Elasticsearch kan bidra – samt hvordan den passer inn i teknologistacken din. Det er ingen tutorial, men med et relativt høyt tempo og eksempler med realistisk kompleksitet gis en oversikt over hva som er mulig. Vi runder av med hvordan Elasticsearch kan klassifiseres i mylderet av "NoSQL"-databaser.
Citation preview
Hvem?
Co-founder av Found AS7+ år søk, 2+ Elasticsearch
Håndterer hundrevis av Elasticsearch-clustre
Wednesday, September 11, 13
Agenda
0. Elasticsearch
1. Bruksområder
2. Lingo
3. Datastrukturer
4. Tekstprosessering
5. Elasticsearch
6. NOSQL?
Wednesday, September 11, 13
Elasticsearch
Open source
Real-time søk og analyse
Skjemafri
Basert på Lucene
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
�
Wednesday, September 11, 13
$ curl localhost:9200/sample_index/sample_type -XPOST -d '{ "user": { "name": "DEVOPS_BORAT" }, "followers": 42000, "location": { "lat": 56.78, "lon": 12.34 }, "tags": [ "questionable", "funny" ], "message": "1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.", "retweets": 123}'
{"ok":true,"_index":"sample_index","_type":"sample_message","_id":"rjs9KSmPRnqhvs7QjgxJJw","_version":1}
Wednesday, September 11, 13
$ curl localhost:9200/sample_index/sample_type/_search -XPOST -d '{ "query":{ "match": { "message": "consistent" } }}'
Wednesday, September 11, 13
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.076713204, "hits" : [ { "_index" : "sample_index", "_type" : "sample_message", "_id" : "rjs9KSmPRnqhvs7QjgxJJw", "_score" : 0.076713204, "_source" : { "user": { "name": "DEVOPS_BORAT" }, "message": "1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.", "retweets": 123, ... } } ] }}
Wednesday, September 11, 13
{ "sample_index" : { "sample_message" : { "properties" : { "followers" : { "type" : "long" }, "location" : { "properties" : { "lat" : { "type" : "double" }, "lon" : { "type" : "double" } } }, "message" : { "type" : "string" }, "retweets" : { "type" : "long" }, "tags" : { "type" : "string" }, "user" : { "properties" : { "name" : { "type" : "string" } } } } } }}
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DE
PR
EC
AT
ED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.The screen name &
user ID of replied to
tweet author. Truncated to 140
characters. Only
possible from SMS.
Th
e a
uth
or
of
the
tw
ee
t. T
his
em
be
dd
ed
ob
ject
ca
n g
et
ou
t o
f syn
c.
Th
e a
uth
or's
use
r ID
.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).The creation date
for this account.Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Nu
mb
er
of
twe
ets
this
use
r h
as.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Wh
eth
er
this
use
r h
as g
eo
en
ab
led
(h
ttp
://b
it.ly/4
pF
Y7
7).
DEPRECATED
in this context
Whether this user
has a verified badge.
Th
e g
eo
ta
g o
n t
his
tw
ee
t in
Ge
oJS
ON
(h
ttp
://b
it.ly/b
8L
1C
p).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this placeThe printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweetMap of a Twitter Status Object
Raffi Krikorian <[email protected]>18 April 2010
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
user: name: DEVOPS_BORATmessage: “1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.”location: lon: 12.34 lat: 56.78followers: 42000retweets: 123tags: [questionable, funny]
Wednesday, September 11, 13
Analysis
whitespace
The quick brown fox had a day off
whitespace-tokenizer
Wednesday, September 11, 13
Filter: boolean match
Query: match med score
Kan være satt sammen av andre queries
Filter / Query
Wednesday, September 11, 13
“Søk”
Hele informasjonsbehovet
Query, filtre, fasetter, paginering, ...
Wednesday, September 11, 13
Invertert indeks
"If you don't find it in the index, look very carefully through the entire catalog."
–Sears, Roebuck, and Co., Consumers' Guide 1897
Wednesday, September 11, 13
Wednesday, September 11, 13
AbstractEnterpriseSingletonProxyFactoryBean
Wednesday, September 11, 13
xkcd.com/292
Wednesday, September 11, 13
camelCase
AbstractSingletonProxyFactoryBean
camelCase-tokenizer
lowercase
Wednesday, September 11, 13
Prefiks-problemer!
Wednesday, September 11, 13
Prefiks-problemer
*suffix xiffus*
(60.6384, 6.5017) u4u8gyykk
123 {1-hundreds, 12-tens, 123} (forenkla)
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Elasticsearch
Distribuert
Cluster av noder
Selv-koordinerende
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
Mapping
Wednesday, September 11, 13
�
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
+P�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
��
��
�
�
�
�
�
�
�
�
�
��
��
�
!
Wednesday, September 11, 13
+P�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
��
��
�
�
�
�
�
�
�
�
�
��
��
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
Så langt
Inverterte indekser
Tekstprosessering
Indeks-termer
Mappings
Indeks-maler
Wednesday, September 11, 13
Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
�
xkcd.com/208Wednesday, September 11, 13
�
�
�
�
�
�
�
�
�
��
��
��
��
��
��
��
��
��
��
��
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
?q={!boost b=div(popularity,price) v=$qq} &qq={!dismax qf=desc^2,review}cheap &bq={!lucene df=keywords}lucene solr java &fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5} &fq={!term f=$ff v=$vv}&ff=keywords&vv=solr &sort=query(keywords:lame) asc, score desc
Wednesday, September 11, 13
Wednesday, September 11, 13
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
Filtre
Caches som bitmaps
Kompakte
Veldig raske
Wednesday, September 11, 13
term: className: "InternalFrameInternalFrameTitlePaneInternalFrameTitlePaneMaximizeButtonWindowNotFocusedState"
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�
Wednesday, September 11, 13
Filtre
Bruk filtre når du kan …
… og queries når du trenger rangering.
Wednesday, September 11, 13
Fasetter
Oppsummerer hele resultat-mengden
Filtre + fasetter grunnlag for analyse-bruk
Wednesday, September 11, 13
Wednesday, September 11, 13
�
Wednesday, September 11, 13
�Wednesday, September 11, 13
�
Wednesday, September 11, 13
Fasetterings-muligheter
Termer
Histogrammer
Tids-histogrammer
Geo-distanse
Statistisk fordeling
Filtre/Spørringer
Wednesday, September 11, 13
Fasetter
Ressurskrevende
CPU + minne
Viktig å ha nok minne
Wednesday, September 11, 13
Filter-cacher
Felt-cacher: fasetter, m.m.
Page-cache
CacherThere are two hard things in computer science:
cache invalidation, naming things, and off-by-one errors.
Wednesday, September 11, 13
CacherNow you are thinking with...
Per segment
Nye segmenter invaliderer ikke gamle
Viktig for (near) real time
Wednesday, September 11, 13
Wednesday, September 11, 13
PostgreSQL
Verifiserer ressursbrukTrygg >> rask
Bruker disk om den må
Wednesday, September 11, 13
Elasticsearch stoler på degBygd for fart
What could possibly go wrong?
Wednesday, September 11, 13
OutOfMemoryError
Woah thereI ate all the memories
Your cluster may or may not work any more
Wednesday, September 11, 13
NOSQL?
Kjapp, ikke robust
Dokumentdatabase
Skjema-fleksibel
Ingen transaksjoner
Lett å skalere/distribuere
Naïv leader-election
Ingen auth/authz
Wednesday, September 11, 13
?Slides og relevante linker på
found.no/jz13
(Prøv hosted Elasticsearch i 6 mnd. gratis)
Solr-meetup i community-rommeti morgen!
Wednesday, September 11, 13
Image credits
Nails – Adam Rosenberg
Map of Westeros
Elephant, Roy Costello
Wingsuit, Richard Schneider
Wednesday, September 11, 13