the smugmug tale presentation
Post on 29-May-2018
222 Views
Preview:
TRANSCRIPT
-
8/9/2019 The SmugMug Tale Presentation
1/33
The SmugMug Tale
http://cmac.smugmug.com/gallery/2504559%23131487110http://cmac.smugmug.com/gallery/2504559%23131487110 -
8/9/2019 The SmugMug Tale Presentation
2/33
Premium photo & video sharing.
Bootstrapped in 02.
$10M+ as of 07.
Profitable.
No debt.
Top 400 website.
Doubling yearly.
Who are we?
http://cmac.smugmug.com/gallery/2504559%23131487110http://cmac.smugmug.com/gallery/2504559%23131487110 -
8/9/2019 The SmugMug Tale Presentation
3/33
Premium means more and better.
Unlimited storage.
Unlimited bandwidth.
Big photos (48Mpix). 500M+ of them.
Big video (1920x180p).
Lots of photos per page.
Super fast.
Our challenge
-
8/9/2019 The SmugMug Tale Presentation
4/33
LAMP(hp).
x86 (mostly AMD) on Linux (~300 4+ core hosts?)
4 datacenters: 2 x SV, 1 x VA, 1 x SEA
2 Ops guys. :)
Majority of boxes are diskless.
Consume lots of cloud services (S3, EC2, etc).
Architecture overview
-
8/9/2019 The SmugMug Tale Presentation
5/33
Binary data (photos, video, etc):
Stored in Amazons S3. PBs.
Akamai fronts for caching and acceleration.
Structured data (Database, etc):
MySQL (InnoDB mostly).
4+ cores, 64GB, >2TB storage
Memcached fronts for caching.
Storage
-
8/9/2019 The SmugMug Tale Presentation
6/33
Photo & video processing / encoding:
Handled in Amazon EC2.
Totally autonomous scaling (SkyNet)
Customer facing:
Diskless web boxes (PXE boot)
Scaled up *and* out MySQL
Memcached ~1TB
Compute
-
8/9/2019 The SmugMug Tale Presentation
7/33
Super-fast CDN:
Reads often already close to customer.
More than just a CDN:
HTML/AJAX/etc inspection for pre-fetch
Anticipate requests and get data to within low ms
Optimal data path to SmugMug
DNS latency reduction
$$$ but worth it. Get what you pay for.
Secret Weapon: Akamai
-
8/9/2019 The SmugMug Tale Presentation
8/33
Screaming fast.
~1TB of data stored.
>96% hit rate
Contains MySQL row data, avoid SELECTs
Misc other data cached, but MySQL biggest win
Fall back on MySQL for cold data
Secret Weapon: memcached
-
8/9/2019 The SmugMug Tale Presentation
9/33
Most important technology at SmugMug.
Super dependent on replication:
Performance
Reliability / High Availability
No MySQL data loss in >7 years.
No JOINs. (Or lots of 4.x+ features, either)
Vertically partitioned, not horizontally (no shards)
Secret Weapon: MySQL
-
8/9/2019 The SmugMug Tale Presentation
10/33
Most important technology at SmugMug.
Huge thanks to Heikki, Oracle, Percona and Google!
Running 1.0.3+patches in production.
Big performance gains with recent releases.
Secret Weapon: InnoDB
-
8/9/2019 The SmugMug Tale Presentation
11/33
Crazy concentration of talent under one roof.
Best MySQL dollars weve ever spent.
Helped us out of a major bind
Have you heard of the back_log mysqld setting?
Me neither. Hope you never do. Percona had.
Helped build, integrate, and test InnoDB patches.
Secret Weapon: Percona
-
8/9/2019 The SmugMug Tale Presentation
12/33
We care about write latency above all.
Well, ok, maybe data integrity. ;)
Scaling reads easy: replication and memcached.
Replication needs to stay current (
-
8/9/2019 The SmugMug Tale Presentation
13/33
Mostly SELECT pkey FROM table WHERE index;
On cache miss, SELECT * FROM table WHERE pkey;
UPDATEs/DELETEs mostly on single rows by pkey
Easy memcached expiration.
Easy slave-delay tracking.
Very denormalized.
No JOINs or complex SELECTs.
OLTP benchmark imperfect. Time for sysbench-web?
MySQL query details
-
8/9/2019 The SmugMug Tale Presentation
14/33
Better filesystem:
CentOS Linux shop (lots of expertise).
MySQL is storage intensive (iops, size, etc).
ext3 old and busted. fsck, well, sucks.
ext4 already old and busted. :(
Want good volume management.
Serialized writes (non-parallel). Ugh.
MySQL Issues: Filesystems
-
8/9/2019 The SmugMug Tale Presentation
15/33
-
8/9/2019 The SmugMug Tale Presentation
16/33
We run Linux.
ZFS doesnt run on Linux.
Crap.
The REAL Issue
-
8/9/2019 The SmugMug Tale Presentation
17/33
Unknown state on crash:
Did *.info get written at commit?
Or is it *2 months* out of date?
Bringing TB+ slaves online quickly.
Backups using LVM/ZFS a pain.
Keeping up with master.
Single thread for replication SQL.
Master promotion cludgy.
MySQL Issues: Replication
-
8/9/2019 The SmugMug Tale Presentation
18/33
Transactional replication patches:
Slave always in known state.
Either ok to bring back up or CHANGE MASTER.
Safe to take snapshots anytime, no effort.
Safe to use innodb_flush_log_at_trx_commit=2
InnoDB only. Stopgap. Global trx IDs better.
Using in pre-production. Production next week?
Replication solutions
-
8/9/2019 The SmugMug Tale Presentation
19/33
Toro aka S7410.
NAS storage with a few twists.
2 x Quad-Core Opteron + 64GB RAM
100MB Readzilla SSD
2 x 18GB Writezilla SSD. 20K write iops.
22 x 1TB 7200rpm HDD
Clustered HA configuration.
Secret Weapon: Sushi
-
8/9/2019 The SmugMug Tale Presentation
20/33
ZFS on Linux!
SSD is here!
SSD performance is cheap!
Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc.
Massive flexibility - no more DAS.
Fishworks interface is a dream.
Analytics is a game changer.
Mmm, Toro tastes good.
-
8/9/2019 The SmugMug Tale Presentation
21/33
Initial sticker shock - $80K?! $142K clustered?!
No one pays list price. Whew.
Startup Essentials. Double-whew.
Paradigm shift. Biggest whew!
DAS -> NAS
So much IO, in theory, can stack lots of clients.
In practice, can stack *lots* of clients.
We now have 5 clustered configs. :)
Sushis quite reasonable
-
8/9/2019 The SmugMug Tale Presentation
22/33
Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
Sushi served fast
-
8/9/2019 The SmugMug Tale Presentation
23/33
Scalable. 15K 4k write iops w/16 threads.
Low latency. ~250us @ 3K iops, ~700us @ 10K
Sushi served fast
0
5000
10000
15000
20000
1 2 4 8 16 32
4
K
writeiops
threads
fio write benchmark
-
8/9/2019 The SmugMug Tale Presentation
24/33
So fast, were stacking like crazy.
5 different MySQL workloads on single clustered Toro.
8 slaves on single Toro.
Each used to have 15K disks + write cache.
Lots of excess io and space capacity still.
Compression for free (no client CPU usage)
Crazy fast
~1.5X ratio across TBs of InnoDB
Sushi today
-
8/9/2019 The SmugMug Tale Presentation
25/33
Backups a breeze.
Automatic snapshots every n minutes / hours / days.
No need to LOCK / shutdown / STOP SLAVE / etc
Rollback anytime. Skip bad SQL statements.
New slave? Click snapshot. Click clone. Done.
Slaves share unchanged data on disk and in RAM.
Future bright: clone + de-dupe = insanely efficient.
Sushi today
-
8/9/2019 The SmugMug Tale Presentation
26/33
DTrace on Linux!
Never had analytics on storage before.
Vendor used to say: Um, we dunno. Buy more spindles?
Now I know all.
Vendor now says: What does Analytics say?
Drill down on everything. Correlate anything.
God-like power.
Analyzing sushi
-
8/9/2019 The SmugMug Tale Presentation
27/33
NFSv3 (rather than v4)
16KB record size in ZFS (InnoDB)
Mirrored (RAID1+0) disks w/striped Logzilla
MySQL concurrency bound - cant use all the I/O
If compressing, use LZJB.
In theory, can optimize InnoDB:
doublewrite = 0, checksums = 0. ZFS does these.
In practice, no big gain with our workload.
MySQL on Toro so far
-
8/9/2019 The SmugMug Tale Presentation
28/33
Replication *.info files not syncd over NFS
Found a slave with *2 month old* info files
Transactional replication to the rescue!
NFS locking and InnoDB
Warnings on the Net. No hard data.
Actively researching. Whats the problem?
MySQL on Toro problems
-
8/9/2019 The SmugMug Tale Presentation
29/33
10GbE for reduced latency?
Actively testing this.
Driver tuning required. Defaults for throughput.
Cards (Intel) & switches (Arista) cheap & fast
Less than $500/port.
Copper twinax SFP+ cables cheap. Optical XFP not.
$50 vs $1000+
Toro doesnt support SFP+ cards yet. :(
Even faster?
-
8/9/2019 The SmugMug Tale Presentation
30/33
Everything runs better on Toro. :)
Revision control.
Stateless Linux mounts.
Email.
Developer home directories.
Built-in, automatic replication for multi-site backups.
Photo and video serving?
Kitchen sink on Toro
-
8/9/2019 The SmugMug Tale Presentation
31/33
100% SSD.
Still too $$ for TB+ installs.
Even better InnoDB.
Community on fire. Oracle/MySQL accepting patches!
Multi-threaded replication.
Preview release is out. Yes!
New storage engines
PBXT, Falcon, Maria, oh my!
The future?
-
8/9/2019 The SmugMug Tale Presentation
32/33
MySQL is a crown jewel.
Not a gateway drug to Oracle. Different customers.
Kill btrfs. GPL ZFS.
MySQL and InnoDB under one roof = opportunity.
OpenStorage is game changer. Dont kill it.
Listen to your new communities.
Im busy. Im up here because this is important.
Oracle wishlist
-
8/9/2019 The SmugMug Tale Presentation
33/33
Thanks!
Blog: http://blogs.smugmug.com/don
Twitter: DonMacAskill
Email: don@smugmug.com
Percona Conference: Upstairs :)
http://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumbmailto:don@smugmug.commailto:don@smugmug.comhttp://blogs.smugmug.com/onethumbhttp://blogs.smugmug.com/onethumb
top related