aws re:invent 2016: how to scale and operate elasticsearch on aws (dev307)
TRANSCRIPT
![Page 1: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/1.jpg)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mahdi Ben Hamida - SignalFx
11/30/2016
DEV307
How to Scale and Operate
Elasticsearch on AWS
![Page 2: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/2.jpg)
What to Expect from the Session
• Elasticsearch (ES) usage at SignalFx
• What do we use ES for?
• How ES is deployed on AWS?
• Backup/restore of ES on Amazon S3
• Important ES/AWS metrics to monitor; what to alert on
• ES capacity planning
• Zero-downtime re-sharding
• SignalFx metadata storage architecture overview
• Scaling up and zero-downtime re-sharding on AWS
![Page 3: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/3.jpg)
Elasticsearch at
![Page 4: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/4.jpg)
ES Usage
Ad-hoc queries Auto-complete Full-text search
![Page 5: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/5.jpg)
Cluster Size
• 4 clusters in production on Amazon EC2
• Biggest cluster
• 54 data nodes, 3 master nodes, 6 client nodes deployed
across 3 AZs
• Over 1.3 billion unique documents
• 10+ TB of data
• 270 shards (primaries + replica)
• Sustained 75 QPS, 1K index/sec
![Page 6: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/6.jpg)
ES Deployment on AWS
• Dockerized ES 2.3/1.7 clusters. Orchestration done
using MaestroNG
• Biggest cluster
• Data nodes: i2.2xlarge – 16 GB heap (61GB total)
• Master nodes: m3.large – 2 GB heap (7.5GB total)
• Client nodes: m3.xlarge – 10 GB heap (15GB total)
• ES rack awareness to distribute primary and 2 replica
across 3 Availability Zones
![Page 7: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/7.jpg)
Backup/Restore
• Made easy using the AWS Cloud plugin:PUT _snapshot/s3-repo { "type": "s3", "settings": { "bucket": ”signalfx-es-backups", "region": "us-east" } }
• Incremental backups
• Un-versioned S3 bucket
• VPC S3 endpoint to avoid bandwidth constraints
• Instance profiles for authentication to S3
• Cron job for hourly snapshots and weekly rotation
![Page 8: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/8.jpg)
ES Monitoring & Alerting
![Page 9: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/9.jpg)
Key Performance Metrics
![Page 10: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/10.jpg)
Key Detectors
• High CPU usage, low disk size
• Sustained high heap usage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Thread pool rejections (search, bulk, index are the most
critical)
![Page 11: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/11.jpg)
Always Test your ES Detectors/Alerts
![Page 12: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/12.jpg)
Elasticsearch Capacity Planning
![Page 13: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/13.jpg)
Capacity Factors
• Indexing
• CPU/IO utilization can be considerable
• Merges are CPU/IO intensive. Improved in ES 2.0
• Queries
• CPU load
• Memory load
![Page 14: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/14.jpg)
ES Sharding & Scale-up
1P
0R
0P
1R
node-1
node-2
1P
0P
node-1
node-2
0R
1R
node-3
node-4
1P
0P
node-1
node-2
0R
1R
node-3
node-4
0R
1R
node-5
node-6
![Page 15: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/15.jpg)
Sizing Shards
• Create an index with one shard
• Simulate what you expect your indexing load to be –
measure CPU/IO load, find where it breaks
• Do the same with queries
• Determine disk consumption (average document size)
![Page 16: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/16.jpg)
Zero-downtime Re-sharding
![Page 17: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/17.jpg)
Why Re-shard?
• Required if you can’t scale up indexing by adding more
nodes
• If the index is read-only, you could implement a simpler
approach using aliases
• If the index is being written to, it’s more complicated
![Page 18: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/18.jpg)
service-A
metabase-client
mb-
server-1mb-
server-1metabase-1index-topic
write-topic
(1) enqueue write
(2) dequeue write
(3) write to C*
(4) enqueue index
(7) index document
(5) dequeue index
(6) read from C*
SignalFx’s Metadata Storage Architecture
![Page 19: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/19.jpg)
Index Re-sharding Process
• Pre-requisites
• Phase 1: create target index
• Phase 2: bulk re-indexing
• Phase 3: double writing & change re-conciliation
• Phase 4: testing new index
• Phase 5: complete re-sharding process
![Page 20: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/20.jpg)
Pre-requisite 1: readers query from an alias
myindex_v1
myindex readerreaderreader
![Page 21: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/21.jpg)
Pre-requisite 2: indexing state +
generation number
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
![Page 22: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/22.jpg)
myindex_v2
Phase 1: create new index with updated
mappings
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
![Page 23: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/23.jpg)
Phase 2: increment generation, then start
bulk re- indexing of older generations
myindex_v1 myindex_v2_generation <= 42
indexer generation: 43
extra: <null>
current: myindex_v1
![Page 24: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/24.jpg)
During this step, documents may get
added/updated (or deleted*)
_generation <= 42
43
43
updated
created
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
![Page 25: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/25.jpg)
Index state at the end of the bulk indexing
43
43
43
43
43
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
![Page 26: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/26.jpg)
Phase 3 – (a): enable double writ ing & bump
generation
43
43
43
43
43
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
![Page 27: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/27.jpg)
Phase 3 – (b): re- index documents at
generation 43
43
43
43
43
43
44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44
![Page 28: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/28.jpg)
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
![Page 29: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/29.jpg)
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
![Page 30: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/30.jpg)
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
![Page 31: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/31.jpg)
Phase 3 – (e): perfect sync of both indices
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
![Page 32: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/32.jpg)
Phase 4: A/B testing of the new index
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreaderreaderreader
44 44
44 44
![Page 33: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/33.jpg)
Phase 4: swap read alias (or swap back !)
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreaderreaderreader
44 44
44 44
![Page 34: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/34.jpg)
Phase 5: switch write index, generation,
stop double writ ing
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
45
indexer
45
45
45
myindex_v1
generation: 45
extra: <null>
current: myindex_v2
myindex_v2
44 44
44 44
![Page 35: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/35.jpg)
Handling Failures
• Bulk re-indexing can fail (and it does); you don’t want to
re-start from scratch
• Use a “partition” field
• Migrate partition ranges
• Deletions could be a problem. We handle that by using
“deletion markers” instead then cleaning up
![Page 36: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/36.jpg)
Performance Considerations
• Migrate using partition ranges to avoid holding segments
for a long time
• Add temporary nodes to handle the load
• Disable refreshes on the target index (so worth it!)
• Start with no replica (or one just in case)
• Avoid ”hot” shards by sorting on a field (a timestamp for
example)
• Have throttling controls to control indexing load
![Page 37: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/37.jpg)
Thank you!
Sign-up for a free trial at
signalfx.com
![Page 38: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)](https://reader031.vdocument.in/reader031/viewer/2022030305/587543341a28abb8208b56bd/html5/thumbnails/38.jpg)
Remember to complete
your evaluations!