elasticsearch in production: lessons learned

22
ElasticSearch in Production lessons learned Anne Veling, ApacheCon EU, November 6, 2012

Upload: beyondtrees

Post on 06-May-2015

46.631 views

Category:

Technology


3 download

DESCRIPTION

With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.

TRANSCRIPT

Page 1: ElasticSearch in Production: lessons learned

ElasticSearch in Production

lessonslearned

Anne Veling, ApacheCon EU, November 6, 2012

Page 2: ElasticSearch in Production: lessons learned

agenda

Introduction

ElasticSearch

Udini

Upcoming Tool

Lessons Learned

Page 3: ElasticSearch in Production: lessons learned

introduction

Anne Veling, @anneveling

Self-employed contractorSoftware Architect

Agile process management

Performance optimization

Lucene/SOLR/ElasticSearch implementations & training

Page 4: ElasticSearch in Production: lessons learned

ElasticSearch

Apache Lucene

Started in 2010 by Shay Banon

Open Source – Apache License

A company was formed in 2012: ElasticSearchTraining, support and development

Careful feature developmentvs. build because you can

@kimchy

Page 5: ElasticSearch in Production: lessons learned

ElasticSearch

ScalableDistributed, Node Discovery

Automatic sharding

Query distribution

RESTful, HTTP APIWith API wrappers for Ruby, Java, Scala, …

JSON in, JSON out

Document ModelMaps book.author.lastName

“schemaless” -> field type recognition

Keeps source, keeps ‘version’ number

Page 6: ElasticSearch in Production: lessons learned

ElasticSearch

Integrated facetingWith statistical aggregates (sum/avg/…) for free

Field types and analyzersString, numerics, geo, attachment, …

Arrays, subdocuments, nested documents

Integrated shardingRouting and alias

Cross-index searching / multi-document type

Page 7: ElasticSearch in Production: lessons learned

udini.proquest.com

ProQuest

The World’s Article Store

StackAmazon EC2

Scala with Unfiltered

MongoDB, ElasticSearch

Page 8: ElasticSearch in Production: lessons learned

architecture

mongo DB

ElasticSear

ch

Udini

SOLR

Summon

PDF pipeline

FulfillmentProviders

Page 9: ElasticSearch in Production: lessons learned

SOLR at Udini

Connecting to Summon API700M SOLR Cluster

In Udini, we serve a subset of 160M full text articles

Including fulfillment mechanisms

PDF and HTML5 viewing and annotation

Page 10: ElasticSearch in Production: lessons learned

ElasticSearch at Udini

Local index to search your articles

Many small user libraries, searching only locallyUser-id as sharding key

Include key in all queries

Page 11: ElasticSearch in Production: lessons learned

Exciting new product

Developing for ProQuest

Exciting new research tool for scientific researchers

Creating a large ElasticSearch index for journal article canonicalization

Currently in private beta, launching in the coming months

Page 12: ElasticSearch in Production: lessons learned

Lessons Learned

Very fast indexing

Bulk indexing ftwSet up without replicas (replicas = 0, not 1)

Play with bulk size

Simple write to disk and CURL it in, is very fast

1M records in 40sfor f in ${BATCH_DIR}/batch-*.jsondo echo "about to index $f" curl --silent --show-error --request POST

--data-binary @$f localhost:9200/_bulk > /dev/null echodone

Page 13: ElasticSearch in Production: lessons learned

Lessons learned

Schema(less)?

Automatic field type recognitionCan miss types

Strict about types #duh

Mapping of subfields (doc.title vs doc.publication.title)

Version dependent

In realitySchema still needed

Mapping changes still non trivial

Page 14: ElasticSearch in Production: lessons learned

Lessons learned

Learn to trust ElasticSearchAnalyzers: do not pretokenize queries yourself…

Difference between “term” and “text” type queriestokenized or not

ElasticSearch probably already does what you want it to do

Search for it

Try it

Page 15: ElasticSearch in Production: lessons learned

Lessons learned

Issues with automated testing and node discovery/startup

Start/stop hundreds of times during Jenkins test jobs or development boxes

Takes time

Locally sometimes picks up previous versions

Memory issues: ElasticSearch manages a large part of its memory outside of the heap

Do not simply increase -Xmx

Page 16: ElasticSearch in Production: lessons learned

Lessons learned

New tools every month

waitForYellowStatus

Aliases, routing allow for clever control

Page 17: ElasticSearch in Production: lessons learned

API

ElasticSearch is new, connection libraries still in infancy, documentation growing

Issues using the Java API in Scala

Happy with Scalastic nowsynchronous

asynchronous

bulk prepare https://github.com/bsadeh/scalastic

Page 18: ElasticSearch in Production: lessons learned

#nodb

ElasticSearch used as a full nosql datastore?

Using “version” and optimistic locking scheme

Could replace MongoDb in our setup

ElasticSearch is actually a store optimized for getting stuff out, not for getting stuff in

With free faceting

Who needs multi-table transactions anyway?

Page 19: ElasticSearch in Production: lessons learned

SOLR vs ElasticSearch

SOLRWell-known, many tools, extensions

Feels clunky to configure

Manual document to lucene mapping

Replication and indexing in a cluster non-trivial

Old school ;-)

ElasticSearchNew kid on the block

Very easy to configure

Handles document to lucene mapping

Horizontally scalable

Easy replication

But: shard key

New school

Page 20: ElasticSearch in Production: lessons learned

search evolution• Custom indexers

• Inverted index• Segment merges

• Custom analyzers• Faceting

• Configuration of analyzers• Faceting, Geospatial

• Document mapping• Sub-document queries• Replication

• JSON document input• Faceting, complex queries just work

Page 21: ElasticSearch in Production: lessons learned

conclusions

ElasticSearch benefitsEasy to setup

Very clever architecture

DrawbacksVery new software, tool support limited

But lots of movement

Change sharding in a full index non-trivial

ElasticSearchClever architecture, fast, stable

Does exactly what you need

Page 22: ElasticSearch in Production: lessons learned

thank you

[email protected]

@anneveling

Are you still using Solr?Come on, it’s 2012 already ;-)