Transcript
Page 1: Getting 100B Metrics to Disk

G E T T I N G 1 0 0 B M E T R I C S T O D I S KJonathan Thurman -Site Reliability Engineer @jthurman42

1 9 4 B

http://www.flickr.com/photos/meteopassione/9157134653/

Page 2: Getting 100B Metrics to Disk

N E W R E L I C

• Performance Monitoring

• Web Apps

• Mobile Apps

• Servers

• Databases, Caches & More…

• Software Analytics

Page 3: Getting 100B Metrics to Disk

O K AY, Y O U C O L L E C T D ATA

• 194 Billion Metrics

• 100,000 req/sec

• 2 Gbps Inbound

• 216 Terabytes

• All backed my MySQL

http://www.flickr.com/photos/bobsfever/6658919861/

Page 4: Getting 100B Metrics to Disk

H O W W E G O T H E R E

http://www.flickr.com/photos/auvet/853157494/

Page 5: Getting 100B Metrics to Disk

B U I L D I N G B L O C K S

• Hosted Environment

• Xen Virtual Machines

• Data storage

• ATA over Ethernet

• SATA drives

• MySQL 5.0

• Single Ruby on Rails Application

http://www.flickr.com/photos/riekhavoc/4648423297/

Page 6: Getting 100B Metrics to Disk

S H A R D I N G F R O M I N C E P T I O N

• Account Information

• Read heavy

• Single HA Instance

• Agent Data

• Write heavy

• 8 shards based on AccountId

http://www.flickr.com/photos/erikb/48221952/

Page 7: Getting 100B Metrics to Disk

TA L E O F T W O M O D E L S

• Ruby on Rails

• class ShardData < ActiveRecord::Base

• Look up shard for Account

• Override ConnectionHandler

http://www.flickr.com/photos/jungle_boy/140279885/

Page 8: Getting 100B Metrics to Disk
Page 9: Getting 100B Metrics to Disk

T R I B B L E S TA B L E S

• Metric table name contains

• AccountID

• Year and Julian Day

• Resolution

• ts_72_13221_1h

• Currently ~200k tables per DB

http://www.flickr.com/photos/15942690@N00/4571141076/

Page 10: Getting 100B Metrics to Disk

B I N G E A N D P U R G E

• Purging data

• DELETE FROM …

• DROP TABLE …

• innodb_file_per_table

• innodb_lazy_drop_table (pre 5.5.30-30.2)

http://www.flickr.com/photos/exalthim/2261294871/

Page 11: Getting 100B Metrics to Disk

http://www.flickr.com/photos/davidmonro/8331755849/

http://www.flickr.com/photos/heliocentric/1571127347/

http://www.flickr.com/photos/aigle_dore/6225535459/

Page 12: Getting 100B Metrics to Disk

G R O W I N G PA I N S

http://www.flickr.com/photos/aigle_dore/5626285743/

Page 13: Getting 100B Metrics to Disk

M U LT I P L E P O I N T S O F FA I L U R E

• Single shard slows down

• App servers wait for response

• DB connection pool becomes full

• Site goes down

http://www.flickr.com/photos/boston_public_library/8204384670/

Page 14: Getting 100B Metrics to Disk

S H A R D G U A R D

• Monitor all databases

• Identify shard status:

• Bad? Mark as “wedged”

• Good? Clear “wedged” flag

• ShardData checks status!

http://www.flickr.com/photos/mac_filko/5486980804/

Page 15: Getting 100B Metrics to Disk

S TA B I L I T Y A N D P E R F O R M A N C E

• Degraded performance

• New Accounts => Shard 9!

• Old accounts remain as-is

http://www.flickr.com/photos/ejpphoto/7823027272/

Page 16: Getting 100B Metrics to Disk

D ATA C O L L E C T I O N

• Rails isn’t great for data collection

• Ruby isn’t great either…

• Rewritten in Java using Jetty

http://www.flickr.com/photos/autograt/224540606/

Page 17: Getting 100B Metrics to Disk

C A C H E I S K I N G

• Buffered, not queued

• RAM is cheaper than I/O

• Get creative with batch processing

http://www.flickr.com/photos/epsos/8474532085/

Page 18: Getting 100B Metrics to Disk

I N S E R T I N T O ( S E L E C T …

• Select rows and re-process

• Cache last hour in Java’s Heap

• Write a journal and post-process it

http://www.flickr.com/photos/esoteric_13/4741001804/

Page 19: Getting 100B Metrics to Disk

R E A D / W R I T E P R O B L E M

• Sequential Inserts

• Batched in 5k chunks

• Optimize for Throughput

• Must complete < 1 minute

Page 20: Getting 100B Metrics to Disk

R E A D / W R I T E P R O B L E M

• Scattered Reads

• Optimized for Latency

• Unique Covering Indexes

Page 21: Getting 100B Metrics to Disk

M O V E T O H A R D W A R E

• Instant performance!

• Just add…

• Datacenter - Chicago, US

• Servers - Dell

• Storage - Direct Attached

• Time - About 6 months

http://www.flickr.com/photos/zebble/9621007/

Page 22: Getting 100B Metrics to Disk

S P I N N I N G R U S T

• Dell MD1200 shelves

• 8 Disks per shelf

• RAID 5 virtual disk

• Dedicated Hot-spare

http://www.flickr.com/photos/walkn/5472536812/

Page 23: Getting 100B Metrics to Disk

T H E G R E AT E X PA N S E

• MD1200s support 12 disks

• Add four more!

• Online RAID expansion

http://www.flickr.com/photos/aigle_dore/5853807037/

Page 24: Getting 100B Metrics to Disk

# FA I L

• “On-line” expansion, not so much

• Added second 4 disk RAID 5

• LVM Concatenation for space

http://www.flickr.com/photos/fireflythegreat/2845637227/

Page 25: Getting 100B Metrics to Disk

N E E D M O R E C A PA C I T Y

• Tight on disk space

• Performance not an issue

• New Accounts => Shard 10!

• Old Accounts as-is

http://www.flickr.com/photos/seandreilinger/6289721616/

Page 26: Getting 100B Metrics to Disk
Page 27: Getting 100B Metrics to Disk

S H A R D P I T FA L L S

http://www.flickr.com/photos/21206761@N00/469110140/

Page 28: Getting 100B Metrics to Disk

M I G R AT I O N P R O B L E M

• Accounts cannot move

• Not all tables have the shard key

• Rails defaults to auto-increment IDs

• Massive primary key collisions

• Punt and move the metrics

http://www.flickr.com/photos/tzafrir/125380911/

Page 29: Getting 100B Metrics to Disk

B R E A K I N G U P I S H A R D T O D O

• Agent Databases

• Metadata / Notes / Errors

• Timeslice Databases

• Time-series metric data

• 1 Minute and 1 Hour resolution

http://www.flickr.com/photos/rsepulveda/4275236049/

Page 30: Getting 100B Metrics to Disk
Page 31: Getting 100B Metrics to Disk

R E S O U R C E P O O L S

• Distributed by Shard Key

• Distribution can CHANGE

• Lookup table, not hash

• Data can be MOVED

http://www.flickr.com/photos/dclark3996/4971906528/

Page 32: Getting 100B Metrics to Disk

B A C K U P S

• Custom mysqldump wrapper

• Based on business need

• Backup per table

• Ignore tables to be purged

http://www.flickr.com/photos/usdagov/6896218334/

Page 33: Getting 100B Metrics to Disk

E V O L U T I O N

http://www.flickr.com/photos/pfsullivan_1056/3485953405/

Page 34: Getting 100B Metrics to Disk

S S D R E V O L U T I O N

• 600GB Intel 320 SSDs

• Dell MD1220 Direct Attached shelf

• Disks are no longer the bottle-neck

• Inserts in Read-optimized order are “fast enough”

Page 35: Getting 100B Metrics to Disk

Y O U C A N U S E S S D W I T H D ATA B A S E S

• 6 of 420 drives RMA’d

• March 2012 to Aug 2013

• Average 180TB lifetime writes

• 91% wear remaining

http://www.flickr.com/photos/joeshlabotnik/3584172834/

Page 36: Getting 100B Metrics to Disk

R E D U N D A N T A R R AY O F E X P E N S I V E D I S K S

• Rebuilds under load > 4 hours

• Migrated to RAID 60

• 2 x 12 disk span

• Ditch the Hot-spares

http://www.flickr.com/photos/mbk/27640225/

Page 37: Getting 100B Metrics to Disk

X F S T U N I N G

• mkfs.xfs -s size=4096

• options

• noatime

• nobarrier

• inode64

• logbsize=256k

http://www.flickr.com/photos/rocketlass/5169004165/

Page 38: Getting 100B Metrics to Disk

S H A R D G U A R D PA R T D E U X

• Protect all the things!

• Kill UI queries over 75 seconds

• Kill background queries over 1 hour

• Yes, all of them

• No really, kill them, now

http://www.flickr.com/photos/chiky/7194089194/

Page 39: Getting 100B Metrics to Disk

I F Y O U D O N ’ T B E L I E V E M E …

• Delayed Job

• Long running background query

• InnoDB History List Traversal

Page 40: Getting 100B Metrics to Disk

T O I N F I N I T Y A N D B E Y O N D

http://www.flickr.com/photos/temma2/1149223191/

Page 41: Getting 100B Metrics to Disk

H A R D W A R E V 2

• Dell R620

• 2 x Intel E5-2690 @ 2.90GHz

• 96GB RAM

• MD1220 Storage Shelf

• 800GB Intel SSD S3500

http://www.flickr.com/photos/tnarik/2590037637/

Page 42: Getting 100B Metrics to Disk

C O N T I N U O U S I M P R O V E M E N T

• EXT4 / ZFS / XFS

• RAID Card vs HBA

• Percona Server 5.6

• Multiple MySQL Instances

• Databases per Service

http://www.flickr.com/photos/shawnclover/8555834230/

Page 43: Getting 100B Metrics to Disk

JOIN THE TEAM NewRelic.com/jobs


Top Related