crossing the production barrier: development at scale
DESCRIPTION
TRANSCRIPT
![Page 2: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/2.jpg)
The world’s handmade marketplaceplatform for people to sell homemade, crafts, and vintage goods
![Page 3: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/3.jpg)
![Page 4: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/4.jpg)
42MM unique visitors/mo.
![Page 5: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/5.jpg)
1.5B+ page views / mo.
42MM unique visitors/mo.
![Page 6: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/6.jpg)
1.5B+ page views / mo.
42MM unique visitors/mo.
850K shops / 200 countries
![Page 7: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/7.jpg)
1.5B+ page views / mo.
895MM sales in 2012
42MM unique visitors/mo.
850K shops / 200 countries
![Page 8: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/8.jpg)
big cluster, 20 shards and adding 5 more
![Page 9: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/9.jpg)
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 10: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/10.jpg)
4TB InnoDB buffer pool
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 11: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/11.jpg)
4TB InnoDB buffer pool
20TB+ data stored
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 12: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/12.jpg)
60K+ queries/sec avg
4TB InnoDB buffer pool
20TB+ data stored
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 13: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/13.jpg)
60K+ queries/sec avg
4TB InnoDB buffer pool
20TB+ data stored
~1.2Gbps outbound (plain text)
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 14: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/14.jpg)
60K+ queries/sec avg
4TB InnoDB buffer pool
20TB+ data stored
99.99% queries under 1ms
~1.2Gbps outbound (plain text)
over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
![Page 15: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/15.jpg)
50+ MySQL servers / 800 CPUs
Server SpecHP DL 380 G7
96GB RAM16 spindles / 1TB RAID 10
24 Core16 x 146GB
![Page 16: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/16.jpg)
The Problem
been around since ’05, hit this a few years ago, every big company probably has this issue
![Page 17: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/17.jpg)
DATA
sync prod to dev, until prod data gets too big
http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/
![Page 18: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/18.jpg)
Some Approaches
subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake
![Page 19: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/19.jpg)
Some Approaches
subsets of data
subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake
![Page 20: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/20.jpg)
Some Approaches
subsets of data
generated data
subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake
![Page 21: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/21.jpg)
But...
but there is a problem with both of those approaches
![Page 22: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/22.jpg)
Edge Cases
what about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user facing bug
http://www.flickr.com/photos/sovietuk/141381675/sizes/l/in/photostream/
![Page 23: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/23.jpg)
Perspective
another issue is testing problems at scale, complex and large gobs of datareal social network ecosystem can be difficult to generate (favorites, follows) (activity feed, “similar items” search gives better results)
http://www.flickr.com/photos/donsolo/2136923757/sizes/l/in/photostream/
![Page 24: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/24.jpg)
Prod Dev ?
what most people do before data gets too big, almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbps bringing prod dataset to dev was expensive hardware/maint, keeping parity with prod, and applying schema changes would take at least as long
![Page 25: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/25.jpg)
Use Production
so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable
![Page 26: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/26.jpg)
Use Production(sometimes)
so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable
![Page 27: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/27.jpg)
goes without saying this can be dangerousalso difficult if done right, we’ve been working on this for a year
http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/photostream/
![Page 28: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/28.jpg)
Approach
two big things: cultural and technical
![Page 29: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/29.jpg)
Solve Culture Issues First
part of figuring this out was exhausting all other optionsgetting buy-in from major stakeholders
![Page 30: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/30.jpg)
Two “Simple” Technical Issues
![Page 31: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/31.jpg)
step 0:
failure recovery
![Page 32: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/32.jpg)
step 1:
make it safehow to have test data in production, prevent stupid mistakes
![Page 33: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/33.jpg)
phased rollout
![Page 34: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/34.jpg)
phased rollout
read-only
![Page 35: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/35.jpg)
phased rollout
read-onlyr/w dev shard only
![Page 36: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/36.jpg)
phased rollout
read-onlyr/w dev shard only
full r/w
![Page 37: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/37.jpg)
How?
how did we do it?
![Page 38: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/38.jpg)
Quick Overview
high level view
http://www.flickr.com/photos/h-k-d/7852444560/sizes/o/in/photostream/
![Page 39: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/39.jpg)
tickets index
shard 1 shard 2 shard N
![Page 40: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/40.jpg)
tickets index
shard 1 shard 2 shard N
Unique IDs
![Page 41: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/41.jpg)
tickets index
shard 1 shard 2 shard N
Shard Lookup
![Page 42: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/42.jpg)
tickets index
shard 1 shard 2 shard N
Store/Retrieve Data
![Page 43: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/43.jpg)
dev shard
introducing....
dev shard, shard used for initial writes of data created when coming from dev env
![Page 44: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/44.jpg)
tickets index
shard 1 shard 2 shard N
![Page 45: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/45.jpg)
tickets index
shard 1 shard 2 shard N
DEV shard
![Page 46: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/46.jpg)
shard 1 shard 2 shard N
DEV shard
www.etsy.com www.goulah.vm
Initial Writes
![Page 47: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/47.jpg)
shard 1 shard 2 shard N
DEV shard
www.etsy.com www.goulah.vm
Initial Writes
![Page 48: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/48.jpg)
shard 1 shard 2 shard N
DEV shard
www.etsy.com www.goulah.vm
Initial Writes
![Page 49: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/49.jpg)
mysql proxy
![Page 50: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/50.jpg)
proxy hits all of the shards/index/tickets
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html
![Page 51: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/51.jpg)
dangerous/unnecessary queries
-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)
![Page 52: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/52.jpg)
dangerous/unnecessary queries
(DEV) etsy_rw@jgoulah [test]> select * from fred_test;
-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)
![Page 53: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/53.jpg)
dangerous/unnecessary queries
(DEV) etsy_rw@jgoulah [test]> select * from fred_test;
ERROR 9001 (E9001): Selects from tables must have where clauses
-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)
![Page 54: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/54.jpg)
known in/egress funnel
we know where all of the queries from dev originate from
http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/photostream/
![Page 55: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/55.jpg)
explicitly enabled
% dev_proxy onDev-Proxy config is now ON. Use 'dev_proxy off' to turn it off.
Not on all the time
![Page 56: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/56.jpg)
visual notifications
![Page 57: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/57.jpg)
notify engineers they are using the proxy, this is read-only mode
![Page 58: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/58.jpg)
read/write mode
![Page 59: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/59.jpg)
read-write mode, needed for login and other things that write data
![Page 60: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/60.jpg)
stealth data
hiding data from users (favorites go on dev and prod shard, making sure test user/shops don’t show up)
http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/photostream/
![Page 61: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/61.jpg)
Security
http://www.flickr.com/photos/sidelong/3878741556/sizes/l/in/photostream/
![Page 62: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/62.jpg)
PCI
token exchange only, locked down for most people
![Page 63: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/63.jpg)
PCI
off-limits
token exchange only, locked down for most people
![Page 64: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/64.jpg)
anomaly detection
another part of our security setup is detection
![Page 65: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/65.jpg)
logging
basics of anomaly detection is log collection
![Page 66: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/66.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
![Page 67: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/67.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date
![Page 68: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/68.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
![Page 69: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/69.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
source ip
![Page 70: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/70.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
source ip
unique id generated by proxy
![Page 71: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/71.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
source ip
unique id generated by proxy
app request id
![Page 72: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/72.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
source ip
unique id generated by proxy
app request id dest. shard
![Page 73: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/73.jpg)
2013-04-22 18:05:43 485370821 devproxy --
/* DEVPROXY source=10.101.194.19:40198
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
SELECT id FROM table;
date thread id
source ip
unique id generated by proxy
app request id dest. shard script
![Page 74: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/74.jpg)
![Page 75: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/75.jpg)
login-as
(read only, logged w/ reason for access)
![Page 76: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/76.jpg)
![Page 77: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/77.jpg)
![Page 78: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/78.jpg)
reason is recorded and reviewed
![Page 79: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/79.jpg)
Recovery
![Page 80: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/80.jpg)
sources of restore data
![Page 81: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/81.jpg)
sources of restore dataHadoop
![Page 82: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/82.jpg)
sources of restore dataHadoop
Backups
![Page 83: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/83.jpg)
sources of restore dataHadoop
Backups
Delayed Slaves
![Page 84: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/84.jpg)
Delayed Slaves
pt-slave-delay watches a slave and starts and stops its replication SQL thread as necessary to hold it
http://www.flickr.com/photos/xploded/141295823/sizes/o/in/photostream/
![Page 85: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/85.jpg)
Delayed Slaves
role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)
![Page 86: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/86.jpg)
4 hour delay behind master
Delayed Slaves
role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)
![Page 87: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/87.jpg)
4 hour delay behind master
produce row based binary logs
Delayed Slaves
role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)
![Page 88: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/88.jpg)
4 hour delay behind master
produce row based binary logs
Delayed Slaves
allow for quick recovery
role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)
![Page 89: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/89.jpg)
pt-slave-delay --daemonize
--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log
--delay 4h --interval 1m --nocontinue
last 3 options most important, 4h delay, interval is how frequently it should check whether slave should be started or stopped nocontinue - don’t continue replication normally on exitxuser/pass eliminated for brevity
![Page 90: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/90.jpg)
R/W R/W
Slave
Shard Pair
![Page 91: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/91.jpg)
R/W R/W
Slave
Shard Pair
pt-slave-delay
![Page 92: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/92.jpg)
R/W R/W
Slave
Shard Pair
pt-slave-delayrow based binlogs
![Page 93: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/93.jpg)
R/W R/W
Slave
Shard Pair
HDFS
VerticaParse/
Transform
in addition can use slaves to send data to other stores for offline queries1)parse each binlog file to generate sequence file of row changes2)apply the row changes to a previous set for the latest version
![Page 94: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/94.jpg)
something bad happens...bad query is run (bad update, etc)
http://www.flickr.com/photos/focalintent/1332072795/sizes/o/in/photostream/
![Page 95: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/95.jpg)
A B
Slave
Before Restoration....
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 96: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/96.jpg)
A B
Slave
Before Restoration....
1) stop delayed slave replication
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 97: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/97.jpg)
B
Slave
Before Restoration....
1) stop delayed slave replication
2) pull side A A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 98: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/98.jpg)
B
Slave
Before Restoration....
3) stop master-master replication
1) stop delayed slave replication
2) pull side A A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 99: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/99.jpg)
> SHOW SLAVE STATUS
Relay_Log_File: dbslave-relay.007178Relay_Log_Pos: 8666654
on delayed slave
get the relay position
![Page 100: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/100.jpg)
mysql> show relaylog events in "dbslave-relay.007178" from 8666654 limit 1\G
*************************** 1. row ******************* Log_name: dbslave-relay.007178 Pos: 8666654 Event_type: Query Server_id: 1016572End_log_pos: 8666565 Info: use `etsy_shard`; /* [CVmkWxhD7gsatX8hLbkDoHk29iKo] [etsy_shard_001_B] [/your/activity/index.php] */ UPDATE `news_feed_stats` SET `time_last_viewed` = 1366406780, `update_time` = 1366406780 WHERE `owner_id` = 30793071 AND `owner_type_id` = 2 AND `feed_type` = 'owner'2 rows in set (0.00 sec)
on delayed slave
show relaylog events will show statements from relay log pass relay log and position to start
![Page 101: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/101.jpg)
filter bad queriescycle through all the logs, analyze Query events rotate events - next log filelast relay log points to binlog master (server_id is masters, binlog coord matches master_log_file/pos)
http://www.flickr.com/photos/chriswaits/6607823843/sizes/l/in/photostream/
![Page 102: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/102.jpg)
B
Slave
After Delayed Slave Data Is Restored....
A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 103: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/103.jpg)
B
Slave
After Delayed Slave Data Is Restored....1) stop
mysql on A and slave
A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 104: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/104.jpg)
B
Slave
After Delayed Slave Data Is Restored....1) stop
mysql on A and slave
2) copy data files
to A
A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 105: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/105.jpg)
B
Slave
After Delayed Slave Data Is Restored....1) stop
mysql on A and slave
2) copy data files
to A
3) restart B to A replication, let A catch up to B
A
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 106: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/106.jpg)
Slave
After Delayed Slave Data Is Restored....1) stop
mysql on A and slave
2) copy data files
to A
3) restart B to A replication, let A catch up to B
A
4) restart A to B replication, put A back in, then pull B
A B
master.info should be pointing to the right place
step 2 could be flipping physical box (for faster recovery such as index servers)
![Page 107: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/107.jpg)
Other Forms of RecoveryMigrate Single Object (user/shop/etc)
Hadoop Deltas
Backup + Binlogs
migrate object from delayed slave (similar to shard migration)can generate deltas from hadoopif delayed slave has “played” the bad data, go from last nights backup (slower)
![Page 108: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/108.jpg)
Use Cases
what are some use cases?
http://www.flickr.com/photos/seatbelt67/502255276/sizes/o/in/photostream/
![Page 109: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/109.jpg)
user reports a bug...
a user files a bug, i can trace the code for the exact page they're on right from my dev machine
![Page 110: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/110.jpg)
testing “dry” writes
testing how application runs a “dry” write -- r/o mode, exception is thrown with the exact query it would have attempted to run, the values it tried to use, etc.
![Page 111: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/111.jpg)
search ads campaign consistency
starting campaigns and maintaining consistency for entire ad system is nearly impossible in dev Search ads data is stored in more than a dozen DB tables and state changes are driven by a combination of browsers triggering ads, sellers managing their campaigns, and a slew of crons running anywhere from once per 5 minutes to once a month eg) to test pausing campaigns that run out of money mid-day, can pull large numbers of campaigns from prod and operate on those to verify that the data will still be consistent
![Page 112: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/112.jpg)
google product listing ads
GPLA is where we syndicate our listings to google to be used in google product search adswe can test edge cases in GPLA syndication where it would be difficult to recreate the state in dev
![Page 113: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/113.jpg)
testing prototypes
features like similar items search gives better results in production because of the amount of data, allowed us to test the quality of listings a prototype was displaying
![Page 114: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/114.jpg)
performance testing
need a real data set to test pages like treasury search with lots of threads/avatars/etc the dev data is too sparse, xhprof traces don’t mean anything, missing avatars change perf characteristics
![Page 115: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/115.jpg)
hadoop generated datasets
dataset produced from hadoop (recommendations for users, or statistics about usage) but since hadoop is prod data its for prod users/listings/shops, so have to check against prod--- sync to dev would fill dev dbs and data wouldn’t line up (b/c prod data)
![Page 116: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/116.jpg)
browse slices
browse slices have complex population so its easier to test experiment against prod data
![Page 117: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/117.jpg)
not enough listings to populate the narrower subcategories, and it just takes too long
![Page 118: Crossing the Production Barrier: Development at Scale](https://reader034.vdocument.in/reader034/viewer/2022052617/54543634b1af9f90228b49d9/html5/thumbnails/118.jpg)
Thank You
etsy.com/jobs
We’re hiring