devopsdays silicon valley 2014 - the game of operations
DESCRIPTION
Operating online games is fun and challenging. Games are some of the spikiest workloads around, and real-time really means *real-time*. Randy shares many of the DevOps techniques his team has put into practice at KIXEYE: Cloud infrastructure, Service teams, and DevOps Culture. He talks about elastic workloads, micro-services, configuration automation, and a common service "chassis". He further discusses the organizational and technical disciplines of team autonomy, internal vendor-customer relationships, and, of course, "you build it, you run it"!TRANSCRIPT
The Game of Operationsand
The Operation of Games
Randy Shoup @randyshoup
linkedin.com/in/randyshoup
DevOpsDays Silicon Valley, June 28 2014
Background
CTO at KIXEYE• Real-time strategy games for web and
mobile
Director of Engineering for Google App Engine• World’s largest Platform-as-a-Service
Chief Engineer at eBay• Multiple generations of eBay’s real-time
search infrastructure
1973: Xerox PARC and SuperPaint
en.wikipedia.org/wiki/SuperPaintwww.computerhistory.org/collections/catalog/X1001.89B
Real-Time Strategy Games are …
• Real-time• Spiky• Computationally-
intensive• Constantly evolving• Constantly pushing
boundaries
Technically and operationally demanding
Operating Games: Goals
Player Fun• If players aren’t playing, we don’t have a business• If players aren’t having fun, we don’t have a
business for long• Fun includes game mechanics, feature set, uptime,
performance
Developer Productivity and Satisfaction• We are a vendor; the studios are our customers• Must be *strictly better* than the alternatives of
build, buy, borrow
Cost Efficiency• More output for less
The Game of Operations
Cloud• All studios and services moving to AWS• Strong focus on automation
Services• Small, focused teams • Clean, well-defined interface to
customers
DevOps Culture• One team across development and ops
The Game of Operations
Cloud
Services
DevOps Culture
Why Cloud? (The Obvious)
Provisioning Speed• Minutes, not weeks• Autoscaling in response to load
Near-Infinite Capacity• No need to predict and plan for growth• No need to defensively overprovision
Pay For What You Use• No “utilization risk” from owning / renting• If it’s not in use, spin it down
Why Cloud? (The Less Obvious)Instance Shaping• Instance shapes to fit most parts of the solution
space (compute-intensive, IO-intensive, etc.)• If one shape does not fit, try another
Service Quality• Amazon and Google know how to run data
centers• Battle-tested and highly automated• World-class networking, both cluster fabric and
external peering
Why Cloud? (Fundamental Forces)Economics• Nearly impossible to beat Google / Amazon
buying power or operating efficiencies• 2010s in computing are like 1910s in
electric power
Developer Adoption• It Just Works ™• Makes it easy to fall in love with
infrastructure
“Soon it will be just as common to run your own data center as it is to run your own electric power generation”
-- me
Autoscaling
Games are very spiky• Very unpredictable• Huge variability between peak and
trough
Hits are self-reinforcing
Automation Work at KIXEYE
Resilient Clients• Clients back off in response to latency• Clients continue gameplay despite
network disruption
Elastic Services• Services grow / shrink based on load• Service Cluster == AWS Auto Scale
Group
Automation Work at KIXEYE
Build / Deploy Pipeline• One button• Puppet -> Packer -> AMI -> Asgard• Zero-downtime red-black deployment• Futures: canarying, auto-rollback
Manageability• Puppet for configuration management• Flume -> ElasticSearch / Kibana for logging• Shinken -> PagerDuty for monitoring and
alerting
The Game of Operations
Cloud
Services
DevOps Culture
Service Teams
• Give teams autonomy• Freedom to choose technology,
methodology, working environment• Responsibility for the results of those
choices
• Hold them accountable for *results*• Give a team a goal, not a solution• Let team own the best way to achieve the
goal
KIXEYE Service Chassis
• Goal: “chassis” for building scalable game services
• Minimal resources, minimal direction• 3 people x 1 month• Consider building on NetflixOSS
Team exceeded expectations• Co-developed chassis, transport layer, service
template, build pipeline, red-black deployment, etc.• Operability and manageability from the beginning• 15 minutes from no code to running service in AWS
(!)• Open-sourced at github.com/kixeye
Micro-Services
Single-purposeSimple, well-defined interfaceModular and independentSmall teamsAutonomy and responsibility A
C D E
B
Transition to Service RelationshipsVendor – Customer Relationship• Friendly and cooperative, but structured• Clear ownership and division of
responsibility• Customer can choose to use service or not
(!)
Service-Level Agreement (SLA)• Promise of service levels by the provider• Customer needs to be able to rely on the
service, like a utility
Transition to Service RelationshipsCharging and Cost Allocation• Charge customers for *usage* of the
service• Aligns economic incentives of customer
and provider• Motivates both sides to optimize
The Game of Operations
Cloud
Services
DevOps Culture
One Team (!)
• Act as one team across development, product, operations, etc.
• Solve problems instead of blaming and pointing fingers
• Political games are not as fun as real-time strategy games
Everyone Is Responsible for ProdEveryone’s incentives are aligned
Everyone is strongly motivated to have solid instrumentation and monitoring
“DevOps is a reorg”
– Adrian Cockcroft
Blame-Free Post-Mortems
Learn from mistakes and improve• What did you do -> What did you learn• Take emotion and personalization out of
it
Post-mortem After Every Incident• Document exactly what happened• What went right• What went wrong
Blame-Free Post-Mortems
Open and Honest Discussion• What contributed to the incident?• What could we have done better?Engineers compete to take
responsibility (!)
“Failure is not falling down but refusing to get back up”
– Theodore Roosevelt
Transition to DevOps
Organization• Studios make user-visible games• Services provide common endpoints
Training / Retraining• Common bootcamp• Train devs as Ops, Ops as devs
Transition On-call• Use primary / secondary on-call as
apprenticeship
“You Build It, You Run It”
– Everyone
Recap: The Game of OperationsCloud
Services
DevOps
Come Join Us!
DevOps Whiskey Tasting, July 22333 Bush St., San Francisco
kixeyeloveswhiskey.eventbrite.com
Hiring in SF, Seattle, Victoria, Brisbane, Amsterdamwww.kixeye.com/jobs