techtalktrack2 sid-final-130207111143-phpapp02

47
Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart

Upload: karthik-murugesan

Post on 05-Dec-2014

424 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Techtalktrack2 sid-final-130207111143-phpapp02

Flipkart Website Architecture

Mistakes & Learnings

Siddhartha ReddyArchitect, Flipkart

Page 2: Techtalktrack2 sid-final-130207111143-phpapp02

June 2007

Page 3: Techtalktrack2 sid-final-130207111143-phpapp02

November 2007

Page 4: Techtalktrack2 sid-final-130207111143-phpapp02

December 2012

Page 5: Techtalktrack2 sid-final-130207111143-phpapp02

www.flipkart.com

• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as…

• [1] Issue: Website is “slow”• [2] RCA = Root Cause Analysis

Issue[1] RCA[2] Actions Learnings

Page 6: Techtalktrack2 sid-final-130207111143-phpapp02

INFANCY (2007 – MID-2010)Surviving & reacting to the environment

Page 7: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!

Page 8: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?– MySQL queries taking too long

• Why?– Too many queries– Many slow queries– Queries locking tables

• Why?– Capacity

• Hmm…

Page 9: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• Get beefier servers (the obvious)• Separate master_db, slave_db– Writes go to master_db– Reads from slave_db– Critical reads from master_db

MySQL

ReadsWrites

MySQL

Master

Writes

MySQL

Slave

Reads

Replication

Page 10: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• Scale-out databases reads by distributing load across systems

• Isolate database writes from reads– Writes are (usually) more critical

Page 11: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!(Again)

Page 12: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?– MySQL queries taking too long (on slave_db)

• Why?– Too many queries– Many slow queries

• Why?– Queries from analytics / reporting and other

backend jobs• Urm…

Page 13: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• Analytics / reporting DB (archival_db)– Use MyISAM — optimized for reads– Additional indexes for quicker reporting

MySQL

Master

Website

Writes

MySQL

Slave

Website

Reads

Analytics

Reads

Replicatio

n

MySQL

Master

Website Writes

MySQL

Slave 1

Website

Reads

Replication

MySQL Slave 2

Analytics Reads

Replication

Page 14: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• Isolate the databases being used for serving website traffic from those being used for analytical/reporting

• Isolate systems being used by production website from those being used for background processing

Page 15: Techtalktrack2 sid-final-130207111143-phpapp02

BABY (2010 – 2011)Learning the basics

Page 16: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!

Page 17: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?• How?– Instrumentation

Page 18: Techtalktrack2 sid-final-130207111143-phpapp02

RCA - 1

• Why?– Logging a lot– PHP processes blocking on writing logs

Log file

Request1-> Process1

Request2-> Process2Request3

-> Process3Waiting

Request2:Process1

Waiting

Request2:Process2

Writing

Request3:Process3

Page 19: Techtalktrack2 sid-final-130207111143-phpapp02

RCA - 2

• Why?– Service Oriented Architecture (SOA)– Too many calls to remote services per request• Creating fresh connection for each call• All the calls are made in serial order

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

Page 20: Techtalktrack2 sid-final-130207111143-phpapp02

RCA - 3

• Why?– Configurability– Fetch a lot of “config” from database for serving

each request

Receive request

Fetch Config1

Fetch Config2

Fetch Config3

Fetch Config4

Send response

Database

Page 21: Techtalktrack2 sid-final-130207111143-phpapp02

RCA – 1,2,3

• Why?– Logging a lot– SOA– Configurability

• Why?– PHP’s process model

• Argh!

Page 22: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• fk-w3-agent– Simple Java “middleware” daemon– Deployed on each web server– PHP communicates to it through local socket– Hosts pluggable “handlers”

Page 23: Techtalktrack2 sid-final-130207111143-phpapp02

fk-w3-agent: LoggingHandler

Log file

Request1->

Process1

Request2->

Process2

Request3->

Process3

fk-w3-agent

Request1->

Process1

Request2->

Process2

Request3->

Process3

Log file

Async / buffered

Page 24: Techtalktrack2 sid-final-130207111143-phpapp02

fk-w3-agent: ServiceHandler(s)

Receive request Callfk-w3-agent

Send response

fk-w3-agent

Service1Service2

Receive

request

Connect to

Service1

Request

Service1

Connect

Service2

Request

Service2

Send respon

se

Page 25: Techtalktrack2 sid-final-130207111143-phpapp02

fk-w3-agent: ConfigHandlerReceiv

e reques

t

Fetch Config

1

Fetch Config

2

Fetch Config

3

Fetch Config

4

Send respon

se

Database

Receive request Fetch all config fromfk-w3-agent Send response

fk-w3-agent

Database

Poll and cache

Page 26: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• PHP — good for frontend and templating– Gives a lot of agility– Limiting process model• Hurdle for high performance

• Java — stability and performance• Horses for courses

Page 27: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!(Again)

Page 28: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?– PHP processes taking up too much time– PHP processes taking up too much CPU

• Why?– Product info deserialization taking up time/CPU– View construction taking up time/CPU

Page 29: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• Caching!• Cache fully constructed pages– For a few minutes– Only for highly trafficked pages (Homepage)

• Cache PHP serialized Product objects– ~20 million objects– Memcache

• Yeah! But…– Add caching => add complexity

Page 30: Techtalktrack2 sid-final-130207111143-phpapp02

Caching: Complications (1)

• “Caching fully constructed pages”• But parts of pages still need to be dynamic

• Example: Logged-in user’s name

• Impossible to do effective bucket testing• Or at least makes it prohibitively complex

Page 31: Techtalktrack2 sid-final-130207111143-phpapp02

Caching: Complications (2)

• “Caching PHP serialized Product objects”• Without caching:

• With caching, cache hit:

• With caching, cache miss:

getProductInfo() Fetch from CMS

getProductInfo() Fetch from Cache

getProductInfo()

Fetch from Cache

Fetch from CMS Set in Cache

Page 32: Techtalktrack2 sid-final-130207111143-phpapp02

Caching: Complications (3)

• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache– Receive “notifications” about product updates• Notification Server — pushes notifications raised by

CMS

• Use a persistent, distributed cache– Memcache => Membase, Couchbase

Page 33: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• Caching is a powerful tool for performance optimization

• Caching adds complexities– Reduced by keeping cache close to data source– Think deeply about TTL, invalidation

• Use caching to go from “acceptable performance” to “awesome performance”– Don’t rely on it to get to “acceptable

performance”

Page 34: Techtalktrack2 sid-final-130207111143-phpapp02

KID (2012)Growing up

Page 35: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!

Page 36: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?– Search-service is slow (or Reviews-service is slow

or Recommendations-service is slow)• But why is rest of website slow?– Requests to the slow service are blocking

processing threads• Eh?!

Page 37: Techtalktrack2 sid-final-130207111143-phpapp02

Let’s do some math

• Let’s say– Mean (or median) response time: 100 ms– 8-core server– All requests are CPU bound

• Throughput: 80 requests per second (rps)• Let’s also say

– 95th Percentile response time: 1000 ms• Call them “bad requests”

• 4 bad requests in a second– Throughput down to 44 rps

• 8 bad requests in a second?– Throughput down to 8 rps

Page 38: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• Aggressive timeouts for all service calls– Isolate impact of a slow service• only to pages that depend on it

• Very aggressive timeouts for non-critical services– Example: Recommendations• On a Product page, Search results page etc.• Not on My Recommendations page

• Load non-critical parts of pages through AJAX

Page 39: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• Isolate the impact of a poorly performing services / systems

• Isolate the required from the good-to-have

Page 40: Techtalktrack2 sid-final-130207111143-phpapp02

Website is “slow”!(Again)

Page 41: Techtalktrack2 sid-final-130207111143-phpapp02

RCA

• Why?– Load average of web servers has spiked

• Why?– Requests per second has spiked• From 1000 rps to 1500 rps

• Why?– Large number of notifications of product

information updates

Page 42: Techtalktrack2 sid-final-130207111143-phpapp02

Fixing it

• Separate cluster for receiving product info update notifications from the cluster that serves users

• Admission control: Don’t let a system receive more requests than it can handle– Throttling

• Batch the notifications

Page 43: Techtalktrack2 sid-final-130207111143-phpapp02

Learning from it

• Isolate the systems serving internal requests from those serving production traffic

• Admission control to ensure that a system is isolated from the over-enthusiasm of a client

• Look at the granularity at which we’re working

Page 44: Techtalktrack2 sid-final-130207111143-phpapp02

TEENAGERIncreasing complexity

Page 45: Techtalktrack2 sid-final-130207111143-phpapp02
Page 46: Techtalktrack2 sid-final-130207111143-phpapp02

THANK YOU

Page 47: Techtalktrack2 sid-final-130207111143-phpapp02

Mistake?

• Sub-optimal decision– Not all information/scenarios considered– Insufficient information– Built for a different scenario

• Due to focus on “functional” aspects• A mistake is a mistake– … in retrospect