slash n: tech talk track 2 – website architecture-mistakes & learnings - siddhartha reddy

47
Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart

Upload: slashn

Post on 05-Dec-2014

5.937 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Flipkart Website Architecture

Mistakes & Learnings

Siddhartha ReddyArchitect, Flipkart

Page 2: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

June 2007

Page 3: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

November 2007

Page 4: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

December 2012

Page 5: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

www.flipkart.com

• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as…

• [1] Issue: Website is “slow”• [2] RCA = Root Cause Analysis

Issue[1] RCA[2] Actions Learnings

Page 6: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

INFANCY (2007 – MID-2010)Surviving & reacting to the environment

Page 7: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!

Page 8: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?– MySQL queries taking too long

• Why?– Too many queries– Many slow queries– Queries locking tables

• Why?– Capacity

• Hmm…

Page 9: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• Get beefier servers (the obvious)• Separate master_db, slave_db– Writes go to master_db– Reads from slave_db– Critical reads from master_db

MySQL

ReadsWrites

MySQL

Master

Writes

MySQL

Slave

Reads

Replication

Page 10: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• Scale-out databases reads by distributing load across systems

• Isolate database writes from reads– Writes are (usually) more critical

Page 11: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!(Again)

Page 12: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?– MySQL queries taking too long (on slave_db)

• Why?– Too many queries– Many slow queries

• Why?– Queries from analytics / reporting and other

backend jobs• Urm…

Page 13: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• Analytics / reporting DB (archival_db)– Use MyISAM — optimized for reads– Additional indexes for quicker reporting

MySQL

Master

Website

Writes

MySQL

Slave

Website

Reads

Analytics

Reads

Replicatio

n

MySQL

Master

Website Writes

MySQL

Slave 1

Website

Reads

Replication

MySQL Slave 2

Analytics Reads

Replication

Page 14: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• Isolate the databases being used for serving website traffic from those being used for analytical/reporting

• Isolate systems being used by production website from those being used for background processing

Page 15: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

BABY (2010 – 2011)Learning the basics

Page 16: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!

Page 17: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?• How?– Instrumentation

Page 18: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA - 1

• Why?– Logging a lot– PHP processes blocking on writing logs

Log file

Request1-> Process1

Request2-> Process2Request3

-> Process3Waiting

Request2:Process1

Waiting

Request2:Process2

Writing

Request3:Process3

Page 19: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA - 2

• Why?– Service Oriented Architecture (SOA)– Too many calls to remote services per request• Creating fresh connection for each call• All the calls are made in serial order

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

Page 20: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA - 3

• Why?– Configurability– Fetch a lot of “config” from database for serving

each request

Receive request

Fetch Config1

Fetch Config2

Fetch Config3

Fetch Config4

Send response

Database

Page 21: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA – 1,2,3

• Why?– Logging a lot– SOA– Configurability

• Why?– PHP’s process model

• Argh!

Page 22: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• fk-w3-agent– Simple Java “middleware” daemon– Deployed on each web server– PHP communicates to it through local socket– Hosts pluggable “handlers”

Page 23: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

fk-w3-agent: LoggingHandler

Log file

Request1->

Process1

Request2->

Process2

Request3->

Process3

fk-w3-agent

Request1->

Process1

Request2->

Process2

Request3->

Process3

Log file

Async / buffered

Page 24: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

fk-w3-agent: ServiceHandler(s)

Receive request Callfk-w3-agent

Send response

fk-w3-agent

Service1Service2

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

Page 25: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

fk-w3-agent: ConfigHandlerReceiv

e reques

t

Fetch Config

1

Fetch Config

2

Fetch Config

3

Fetch Config

4

Send respon

se

Database

Receive request Fetch all config fromfk-w3-agent Send response

fk-w3-agent

Database

Poll and cache

Page 26: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• PHP — good for frontend and templating– Gives a lot of agility– Limiting process model• Hurdle for high performance

• Java — stability and performance• Horses for courses

Page 27: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!(Again)

Page 28: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?– PHP processes taking up too much time– PHP processes taking up too much CPU

• Why?– Product info deserialization taking up time/CPU– View construction taking up time/CPU

Page 29: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• Caching!• Cache fully constructed pages– For a few minutes– Only for highly trafficked pages (Homepage)

• Cache PHP serialized Product objects– ~20 million objects– Memcache

• Yeah! But…– Add caching => add complexity

Page 30: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Caching: Complications (1)

• “Caching fully constructed pages”• But parts of pages still need to be dynamic

• Example: Logged-in user’s name

• Impossible to do effective bucket testing• Or at least makes it prohibitively complex

Page 31: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Caching: Complications (2)

• “Caching PHP serialized Product objects”• Without caching:

• With caching, cache hit:

• With caching, cache miss:

getProductInfo() Fetch from CMS

getProductInfo() Fetch from Cache

getProductInfo()

Fetch from Cache

Fetch from CMS Set in Cache

Page 32: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Caching: Complications (3)

• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache– Receive “notifications” about product updates• Notification Server — pushes notifications raised by

CMS

• Use a persistent, distributed cache– Memcache => Membase, Couchbase

Page 33: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• Caching is a powerful tool for performance optimization

• Caching adds complexities– Reduced by keeping cache close to data source– Think deeply about TTL, invalidation

• Use caching to go from “acceptable performance” to “awesome performance”– Don’t rely on it to get to “acceptable

performance”

Page 34: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

KID (2012)Growing up

Page 35: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!

Page 36: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?– Search-service is slow (or Reviews-service is slow

or Recommendations-service is slow)• But why is rest of website slow?– Requests to the slow service are blocking

processing threads• Eh?!

Page 37: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Let’s do some math

• Let’s say– Mean (or median) response time: 100 ms– 8-core server– All requests are CPU bound

• Throughput: 80 requests per second (rps)• Let’s also say

– 95th Percentile response time: 1000 ms• Call them “bad requests”

• 4 bad requests in a second– Throughput down to 44 rps

• 8 bad requests in a second?– Throughput down to 8 rps

Page 38: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• Aggressive timeouts for all service calls– Isolate impact of a slow service• only to pages that depend on it

• Very aggressive timeouts for non-critical services– Example: Recommendations• On a Product page, Search results page etc.• Not on My Recommendations page

• Load non-critical parts of pages through AJAX

Page 39: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• Isolate the impact of a poorly performing services / systems

• Isolate the required from the good-to-have

Page 40: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Website is “slow”!(Again)

Page 41: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

RCA

• Why?– Load average of web servers has spiked

• Why?– Requests per second has spiked• From 1000 rps to 1500 rps

• Why?– Large number of notifications of product

information updates

Page 42: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Fixing it

• Separate cluster for receiving product info update notifications from the cluster that serves users

• Admission control: Don’t let a system receive more requests than it can handle– Throttling

• Batch the notifications

Page 43: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Learning from it

• Isolate the systems serving internal requests from those serving production traffic

• Admission control to ensure that a system is isolated from the over-enthusiasm of a client

• Look at the granularity at which we’re working

Page 44: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

TEENAGERIncreasing complexity

Page 45: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy
Page 46: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

THANK YOU

Page 47: Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Mistake?

• Sub-optimal decision– Not all information/scenarios considered– Insufficient information– Built for a different scenario

• Due to focus on “functional” aspects• A mistake is a mistake– … in retrospect