riding the n train: how we dismantled groupon's ruby on rails monolith

Post on 06-May-2015

1.283 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a story about how Groupon's business was changing and our technology couldn't keep up. We rewrote the web site using node.js and changed the way our company and culture.

TRANSCRIPT

Riding the N(ode) Train: Dismantling the Monoliths

Tuesday, December 3, 2013

Sean McCullough – Engineer at Groupon @mcculloughsean

Part I

Broken Architecture and

A Changing Business

Business in Early 2012

Page 3

Architecture in 2012

Page 4

0%

20%

40%

60%

80%

100%

January ‘11

January ‘13

October ’12

July ’12

April ’12

January ’12

October ‘11

July ’11

April ’11

March ‘13

June ‘13

Leading the Mobile Commerce Revolution

Page 5

Mobile Transaction Mix Monthly, January 2011 to September 2013 (% of transactions)

September ’13

Product Engineering was Stuck

We couldn’t build features fast enough

We wanted to build features world-wide

Mobile and Web weren’t at feature parity

Page 6

Part II

The Rewrite

Page 7

The Rewrite

Page 8

The Rewrite

Should ...

• be built on APIs for consistent contract with mobile

• be easy to hire developers

• allow for teams to work at their own pace

• allow teams to deploy their own code

• allow for global design changes

• have out of the box I18n/L13n support

• be optimized for our read-heavy traffic pattern

• be small Page 9

How do we…?

• Deploy

• Authorize Users

• Share Sessions

• Route to different applications

• Manage distributed ops

• QA the whole site

Page 10

We Tried This Before and Failed

• Rolled out a new site design in our monolith

• Too many things changed all at once

• Hard to evaluate performance of each feature

Page 11

New Platform Evaluation

We evaluated:

• Node

• MRI Ruby/Rails, MRI Ruby/Sinatra

• JRuby/Rails, Sinatra

• MRI Ruby + Sinatra+EM

• Java/Play, Java/Vertx

• Python+Twisted

• PHPPage 12

Why Node?

• Vibrant community

• NPM!

• Easy to hire JavaScript developers

• Had the minimum viable performance characteristic

• Easy scaling (process model)

Page 13

The First App

Page 14

Growing Pains

Page 15

Poking Holes in our Infrastructure

• Longevity Test over two days

• Try to root out memory leaks

• Talking only to non-production systems

Page 16

Poking Holes in our Infrastructure

Within 2 hours we had a major site outage

Page 17

Poking Holes in our Infrastructure

• SSL termination on our hardware load balancer caused CPU to max out at 100%

• Production systems were using same LB as test and development systems

Page 18

Lessons Learned

• You will run into problems with Node

• You will find problems with your infrastructure

• Don’t panic!

Page 19

The Second App

• Looking for the next page

• Chose the “Browse” page

• Recently Built

• Built using mostly Backbone

• Experienced team of JS developers

Page 20

The Second App

Page 21

The Second App

New Problems:

• User authentication

• More service calls

• Complicated routing

• More traffic

• Needed to share look and feel

Page 22

The Second App

• Cultural problems

• Change of workflow

• Feedback loop fell apart

3 rewrites

6 months to launch

Page 23

Shared Layout

Maintain consistent look and feel across site:

• Distribute layout as library

• Use ESIs for top/bottom of page

• Apps are called through a “chrome service”

• Fetch templates from service

Page 24

Groupon Interface Guidelines

Page 25

Layout Service

• Uses semantic versioning

• Roll forward with bug fixes

• Stay locked on a specific version

• Enable Site-Wide ExperimentsPage 26

Layout Service

Page 27

Layout Service

Page 28

Routing Service

Page 29

The Big Push… or There’s No Going Back

Page 30

• Decided to get the whole company to move at once

• Supporting two platforms is hard – Rip off the band aid!

• End of June 2012 - move to I-Tier by September 1st

The Big Push… or There’s No Going Back

Page 31

• ~150 developers

• Global effort

• Feature freeze – A/B testing against mostly the same features

Part III

It Worked!

Page 32

95% Consumer Traffic On Node

Page 33

Sustained US Traffic Over 120k RPM

Page 34

Our Pages Got Faster

Page 35

It Worked!

Page 36

Success?

Page 37

• Moving to a new platform is not a straight line

• Solving for old problems

• Solving for new problems

• Culture shift

38

• Streaming responses for better performance

• Better resiliency to outages… circuit breakers, brownouts

• Distributed Tracing

• International

• Open Source

New I-Tier apps as we build new teams, products, ideas.

Latest technologies to help us drive our business.

Next Steps

Q&A

top related