1,000,000 daily users and no cache (splash 2011)
DESCRIPTION
Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective. Starting from a rather classic Ruby on Rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes. Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.TRANSCRIPT
Who is that guy?
Jesper Richter-‐Reichhelm
Twi1er: @jrirei
Head of Engineering
wooga
Berlin, Germany
wooga is #3 game developer on Facebook
Wooga has dedicated game teams
Coomingsoon
Flash client sends state changes to backend
Flash client Ruby backend
Social games need to scale quite a bit
400 million PIs / month
Social games need to scale quite a bit
400 million PIs / month
Social games need to scale quite a bit
14 billion requests / month
Social games need to scale quite a bit
14 billion requests / month
Social games need to scale quite a bit
14 billion requests / month
100,000 DB operaKons / second
Social games need to scale quite a bit
14 billion requests / month
50,000 DB updates / second
Social games need to scale quite a bit
14 billion requests / month
50,000 DB updates / second
no cache
A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paradise
Conclusion
October 2009: wooga’s first simulaKon game
Instead of PHP we used Ruby
Our database was MySQL
Our database was MySQL
even user ids odd user ids
And we went into the cloud
Master-‐slave replicaKon for DBs worked fine
app app app
lb
db db
We added a few applicaKon servers over Kme
app app app app app app app app app
lb
db db
250K daily users and no problems
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Life was good
Life was well and I went on a nice vacaKon
<picture: Jesper in clot canyon>
TO DO
Our bane: MySQL hiccups
!"#
$!"#
%!"#
&!"#
'!"#
(!!"#
!# )# (!# ()# $!# $)# *!# *)# %!#
Our bane: MySQL hiccups
!"#
$!"#
%!"#
&!"#
'!"#
(!!"#
!# )# (!# ()# $!# $)# *!# *)# %!#
Our bane: MySQL hiccups
!"#
$!"#
%!"#
&!"#
'!"#
(!!"#
!# )# (!# ()# $!# $)# *!# *)# %!#
A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paradise
Conclusion
SQL queries generated by Rubyamf gem
AMF responses to Flash client
SQL queries generated by Rubyamf gem
AMF responses to Flash client
Wrong config...
... so associated data was included, too
SQL queries generated by Rubyamf gem
AMF responses to Flash client
Wrong config...
... so associated data was included, too
=> Easy to fix
More traffic using the same cluster
app app app app app app app app app
lb
db db
Config tweaks brought us to 300K DAU
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Config fixes
AcKveRecord’s checks caused 20% extra DB
Checking connecKon state
MySQL process list full of ‘status’ calls
AcKveRecord’s checks caused 20% extra DB
Checking connecKon state
MySQL process list full of ‘status’ calls
=> Fixed by 1 line of code
I/O on MySQL masters sKll was the bo^leneck
New Relic: 60% of all UPDATEs on ‘Kles’ table
Tiles are part of the core game loop
Core game loop1) plant2) wait3) harvest
We started to shard on model, too
Adding new shards
old master
old slave
We started to shard on model, too
Adding new shards1) Setup new masters as slaves of old ones
old master
old slave
new master
We started to shard on model, too
Adding new shards1) Setup new masters
old master
old slave
new master
new slave
We started to shard on model, too
Adding new shards1) Setup new masters2) Start using new masters
old master
old slave
new master
new slave
We started to shard on model, too
Adding new shards1) Setup new masters2) Start using new masters3) Cut replica<on
old master
old slave
new master
new slave
We started to shard on model, too
Adding new shards1) Setup new masters2) Start using new masters3) Cut replica<on4) Truncate
old master
old slave
new master
new slave
4 DB masters and a few more servers
app app
app app app app app app app app
app appapp
lb
<lesdb
<lesdb
db db
app app app
Sharding by model brought us to 400K DAU
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Shard by model
We improved our MySQL setup
RAID-‐0 of EBS volumes
We improved our MySQL setup
RAID-‐0 of EBS volumes
Using XtraDB
We improved our MySQL setup
RAID-‐0 of EBS volumes
Using XtraDB
Tweaking my.cnf
Sharding gem circumvented AR’s internal cache
AcKveRecord caches SQL queries...
Sharding gem circumvented AR’s internal cache
AcKveRecord caches SQL queries...
... only in our development environment!
Sharding gem circumvented AR’s internal cache
AcKveRecord caches SQL queries...
... only in our development environment!
=> Fixed by 2 lines of code
I/O sKll was not fast enough
If 2 + 2 is not enough, ...
I/O sKll was not fast enough
If 2 + 2 is not enough, ...
… perhaps 4 + 4 masters will do?
It’s no fun to handle 8+8 MySQL DBs
app app app app appapp app
app app app app app app app app app
appapp
lb
<lesdb
<lesdb
db db
It’s no fun to handle 8+8 MySQL DBs
app app app app appapp app
app app app app app app app app app
appapp
lb
<lesdb
<lesdb
<lesdb
<lesdb
db db db db
At 500K DAU we were at a dead end
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
At 500K DAU we were at a dead end
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
I/O remained bo^leneck for MySQL UPDATEs
Each DB master could do
about 1000 DB write/s.
I/O remained bo^leneck for MySQL UPDATEs
Each DB master could do
about 1000 DB write/s.
That’s not enough!
Pick the right tool for the job!
Redis is fast but goes beyond simple key/value
Redis is a key-‐value storeHashes, Sets, Sorted Sets, ListsAtomic opera<ons like set, get, increment
Redis is fast but goes beyond simple key/value
Redis is a key-‐value storeHashes, Sets, Sorted Sets, ListsAtomic opera<ons like set, get, increment
50,000 transacKons/s on EC2Writes are as fast as reads
Wooga has dedicated game teams
Shelf Kles : An ideal candidate for using
Shelf Kles:{ plant1 => 184,plant2 => 141,plant3 => 130,plant4 => 112,
… }
Shelf Kles : An ideal candidate for using Redis
Redis HashHGETALLHGETHSETHINCRBY…
Migrate on the fly when accessing new model
Migrate on the fly -‐ but only once
true if id could be addedelse false
Typical migraKon throughput over 3 days
Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2.Migrate the rest manually
Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2.Migrate the rest manually
3. Remove migraKon code
Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2.Migrate the rest manually
3. Remove migraKon code
4.Wait unKl no fallback necessary
Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2.Migrate the rest manually
3. Remove migraKon code
4.Wait unKl no fallback necessary
5. Remove SQL table
A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion
Again: Tiles are part of the core game loop
Core game loop1) plant2) wait3) harvest
Size ma^ers for migraKons
MigraKon check overloadMigra<on only on startup
Size ma^ers for migraKons
MigraKon check overloadMigra<on only on startup
Overlooked an edge caseOnly migrate 1% of usersCon<nue if everything is ok
In-‐memory DBs don’t like to dump to disk
Dumping to diskSAVE is blockingBGSAVE needs free RAM
In-‐memory DBs don’t like to dump to disk
Dumping to diskSAVE is blockingBGSAVE needs free RAM
Latency increase by 100%
In-‐memory DBs don’t like to dump to disk
Dumping to diskSAVE is blockingBGSAVE needs free RAM
Latency increase by 100%
=> BGSAVE on slaves every 15 minutes
Redis replicaKon starts with a BGSAVE
BGSAVE on master
Slave imports dumped file
Redis replicaKon starts with a BGSAVE
BGSAVE on master
Slave imports dumped file
=> No RAM means no new slaves
Redis had a memory fragmenKon problem
24 GB
44 GB
in 8 days
Redis had a memory fragmenKon problem
24 GB
38 GB
in 3 days
If MySQL is a truck
Fast enough
Disk based
Robust
Fast enough disk based robust
If MySQL is a truck, Redis is a race car
Super fast
RAM based
Fragile
Super fast RAM based fragile
Big and staKc data in MySQL, rest goes to Redis
60 GB data
50% writes
256 GB data
10% writeshSp://www.flickr.com/photos/erix/245657047/
Lots of boxes, but automaKon helps a lot!
app app app app app app app app app app app appapp
app app app app app app app app app app app appapp
app app app app app app app app app app app appapp
lb lb
redis redis redis redis redisdb db db db db
We reached 1 million daily users!
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
1,000,000 -‐ Big party!
We started archiving inacKve users
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
50% DBreducKon
We even survived a complete data center loss
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
EBS nomore!
We improved our MySQL schema on-‐the-‐fly
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
30% DBreducKon
Will we reach 2 million daily users?
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion
You do not know the future
Plan ahead
You do not know the future
Plan ahead
Learn
You do not know the future
Plan ahead
Learn
Adapt
of sonware
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
EvoluKon every week
EVOLUTION
of sonware
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
EvoluKon every week
EVOLUTION
EvoluKon every week
of sonware
EVOLUTION
EvoluKon every week, RevoluKon if necessary
of sonware
REVOLUTION
EVOLUTION
EvoluKon every week, RevoluKon if necessary
of sonware
REVOLUTION
EvoluKon every week, RevoluKon if necessary
of sonware
!"
#!!$!!!"
%$!!!$!!!"
%$#!!$!!!"
&$!!!$!!!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
REVOLUTION
Each new game is a revoluKon
Each new game is a revoluKon
Each new game is a revoluKon
Each new game is a revoluKon
Each new game is a revoluKon
Coomingsoon
Works for teams ...
Works for teams and for companies
!""#$%&"'()"*+,
Thank you!
Jesper Richter-‐Reichhelm@jrirei
slideshare.net/woogawooga.com/jobs