practical elasticsearch - real world use cases
TRANSCRIPT
Me?
• Itamar Syn-Hershko / @synhershko
• Lucene.NET PMC and lead committer
• Microsoft MVP
• RavenDB
– X-Core developer
– “RavenDB in Action” authorConsulting Partner
An index
Elasticsearch
• Powered by Apache Lucene
• Open-source
• Rapid growth
• High profile users world-wide
REST API
• Indexes• Types• IDs
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{"user" : "synhershko","post_date" : "2013-05-30T14:12:12","message" : "trying out Elastic Search","followers": 3,"registered": true
}'
Full-Text Search
DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
Example from:
Justin Zobel , Alistair Moffat,
Inverted files for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
Full-text Search 101:The inverted index
Full-text Search 101:The inverted index
DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
User queries for “keeper”
Term NormalizationDocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
• Lowercasing
• Stop words (grey)
• Not best practice anymore
• Stemming
• Porter stemmer
• s-stemmer
• Relevance++
• SizeOnDisk--
Full-Text Search
Your data store
How hard is it to get search right, anyway?
Relevance
• PrecisionThe fraction of the retrieved documents that are relevant
• RecallThe fraction of the relevant documents that are retrieved
• Order of results
Challenges with search
• Relevance
• Getting the tokens right
– Tokenization
– Stemming
• Multi-lingual content
– Or other cross-cutting search concerns
• Tolerance
Real-time Analytics
Real-time Analytics
Queue(Redis)
“Shippers”
“Indexer”
Scaling out
Moar use cases!
#1: Real-Time Alerting System
Percolation
#2: Smarter query parsing
Matching inexact queries
• Phrase slop
– “Bridge of London” -> “London Bridge”
• Word-level edit distance with fuzzy queries
– ditsance -> distance
– color -> colour
#3: Offline Classification
Structuring the unstructured
• Record linkage
– Bag of words model
– “More Like This” functionality
• NLP
• Entity extraction
#4: Everything is searchable
Geo-spatial search
• Distance
• Shape interactions
• Multiple algorithms
Geo-spatial search
http://cs.stanford.edu/people/karpathy/deepimagesent
Deep Visual-Semantic Alignments for Generating Image Descriptions
#5: Anomaly detection
The Significant Terms Aggregation
Uncommonly common
Mark Harwood’s talk at
http://www.infoq.com/presentations/elasticsearch-revealing-uncommonly-common
#6: Debugging a distributed system
Queue(Redis)
#6: Debugging a distributed system
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gifHTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
System.NullReferenceException: Object reference not set to an instance of an object. at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add) at AjaxControlToolkit.ToolkitScriptManager.GetScriptCombineAttributes(Assembly assembly) at AjaxControlToolkit.ToolkitScriptManager.IsScriptCombinable(ScriptEntry scriptEntry) at AjaxControlToolkit.ToolkitScriptManager.OnResolveScriptReference(ScriptReferenceEventArgs e) at System.Web.UI.ScriptManager.RegisterScripts() at System.Web.UI.ScriptManager.OnPagePreRenderComplete(Object sender, EventArgs e) at System.Web.UI.Page.OnPreRenderComplete(EventArgs e) at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
#7: Distributed git storage
• PoC in C# using libgit2sharp
• https://github.com/synhershko/libgit2sharp.Elasticsearch
• Kudos @nulltoken
Putting this to practice
• Search on your data
– Data doesn’t have to be structured to be queried
• Use your logs to gain insight
– Metrics
– Establish a baseline
– Investigate on unexpected / unfamiliar behaviors
Thank you.Questions?
Itamar Syn-Hershko
http://code972.com
@synhershko