kopio large scale web application development: server … · · 2011-08-16large-scale web...

Large-scale Web Application Development:Server Side

Tancred Lindholm D.Sc.tancred(a.)google.com

Contents

Web ApplicationsServing Web Applications at Scale

Environment characteristicsRequirements and Best Practices

Web Application InfrastructureDistributed coordination: ChubbyStorage: GFS and BigTableProcessing (batch): MapReduce

Web Application Overview

Here: Web Application = interactive web page that closely resembles a traditional application (UI-wise)Architecture = Client/Server (it's back from the 90s...)

Client in user's browserServer in data centerClient-Server communication using HTTP

Client-side heavily guided -- constrained by W3C stackHTML/XML for structureCSS for stylingJavascript for programmingJavascript "libraries": XHR, DOMTopic for next week

Serving Web Applications at ScaleOne Server vs Data Center (distribution?)One Engineer Lots of Engineers (usability?)

Large-scale ComputingInternet-scale computingMillions of users24/7 availability

no maintenance breaks!Petabytes of data (1PB = 60 years of video)Thousands of CPUsAlthough efficiently used, still lots of of energyCost of software inefficiencies can be huge

Google's Hardware Philosophy[1]

Truckloads of low-cost machines

What does this imply?Software must tolerate failureApplication’s particular machine should not matterNo special machines, just 2 or 3 flavors

Truckloads of engineers... :)

An isolated engineer writes his own system, in his own noncanonical way, in his own favorite fringe language

... and it works great!A group of 1000 engineers does the same

... and there's no end to the confusion!As an engineer you want APIs using familiar concepts and no surprises

especially when you on average use 1 new API / day

Survival Strategies

KISSKnowledge and QualityData-driven engineeringMonitoringManaging continuous evolutionDid I already mention KISS?

KISSKISS = Keep It Simple, Stupid!Simplicity built by introducing good interfaces to (complex) functionality

consistent minimal metaphor for the functionalitywhat is the problem that is solved by the interface?be careful of layers that add nothingiterations with use help narrow down

What is simple?usually your peers know, you as implementor do notbuilding functionality x makes you too much of an expert, thus unsuited for designing its interface?

Knowledge and Quality

Writing optimally performing software often requires cross-layer understanding

engineers that know the stack, not just a sliceA culture of engineers passionate about software

debating, code reviews, taking things apart, testing, caring about the “right thing” exposes code to much analysis“Given enough eyeballs all bugs are shallow” (Linus’ Law)

Data-Driven EngineeringTheory and practice may clash at large scales

in theory, no need to handle most common path separately

in practice, usually make sensein theory, use the best big-Oh algorithm

in practice, simpler & slightly slower may win!Know your input data!

Size, extreme valuesWhile the median is 1k, what if you hit that 1G item?

Quality of datayour superfast HTML parser is useless if it requires valid input, and the corpus is 90% invalid “street HTML”!

Data-Driven Engineering“When in doubt, mapreduce”

Large-scale computing judo: use the large capacity to quickly get data on a large number of inputs

Micro-benchmarkingPrototype, feed realistic data, identify hotspots early

MonitoringYour code is up and running, all is well…… or is it?

is the code acceptably fast?what is the failure rate?redundancy sometimes masks excessive failures at cost of performancewhat is the resource use -- resources cost!

Gathering stats about codehelps isolate problemsprovides data for next steps

Measuring can be challenginghow can property x be measured?are you really measuring the right thing?Needed: mindset of an empirical scientist!

Continuous EvolutionCurrent system up and running 24/7, user happiness at level xChallenge: deploy feature y to yield happiness level x+1 without degraded user experience in betweenNowadays software is continuously evolving

Seems especially true of cloud softwareMeaningful question: What version of Office are you using?Meaningless question: What version of Google Docs are you using?!

Some strategies to follow…

Continous Evolution StrategiesBe incremental, i.e., deploy in small steps

even the most trivial step can be very challenging due to configurations, security, volume, data quality, ...

Provide a fallbackDon't fix what ain't broken

KISS

Also, you should remember KISSThings to look out for

Abstractions that don't solve a problemAre you using concepts you made up yourself?"This is a special case, so the rules do not apply..."

The Chubby lock service for loosely-coupled distributed

systemsPublication: "The Chubby lock service for loosely-coupled distributed systems"Mike Burrows, Google Inc.

Chubby Outline

Distributed consensusChubbyDesign spaceService operationsLockingCases of use and Misuse

Distributed Consensus

How can n unreliable machines reach agreemente.g. select a master among themselves, agree on order of events, etc.agree on file contents

Actually impossible in a fully asynchronous system (Fisher, 1981) Paxos one practically used algorithm

Properties1. Agreement = everyone learns a proposed value2. Nontriviality = only learn proposed values3. Consistency = only one value learned4. Liveliness = the value will eventually be learned

Relies on clocks for livelinessthus not fully asynchronous

Chubby

Google's lock service that helps clients reach distributed agreementPrimary use case is coordinating selection of master among nodesReliableAvailablleSimple interfaceTypical setup size ("Chubby cell") ~10k machines on ~gigabit lanUsed by GFS and BigTable

Design space

lock service rather than Paxos librarydevelopers do not plan for high availability, when the time comes locking is simpler to start usinga lock-based interface is more familiar to our programmers (and maybe misleading)

read and write small files. There's a need for a mechanism for advertising consensus results → manage small files

coarse-grained usefew lock acquisitions over longer timelots of readers, few writersfine-grained locking has different load characterstics, could be built supported by Chubby

Service Interface

file system interface!Hierarchical names e.g. /ls/foo/wombat/pouch

significantly reduced the effort needed to write basic browsing and name space manipulation toolsreduced the need to educate casual Chubby usersnot to be confused with ordinary Unix file system

small filesno moves, no linksACLs not inheritedephemeral files disappear when no longer used

primary use: each file can be used as a lock and to hold the agreed-upon data

Locking

Name can be held in exclusive write-only mode shared read mode

Example/foo write-locked: only writer can read & write/foo read-locked: readers can read, nobody can write

not mandatory, you may read/write without locks why -- lock breaking in huge system not feasible

Locking: sequences

Consider case where1. A acquires L, initiates operation O1 on T2. A fails, B acquires L before O1 on T3. O1 arrives on T, executed outside intended lock!

Chubby solution 1Lock sequence assoicated with lockeg. seq(L)=1000, O1 carries seq(L), T can read lock sequenceIf sequence of operation matches, the lock is valid.

Chubby solution 2For less Chubby-aware codePrevent locking L for a duration if A fails

Cases of Use and Misuse

While Chubby designed as a lock service most popular use was as a name server.

DNS TTL does not scale with lots of client across names (was a serious problem), Chubby's cache invalidation protocol does (scales with actual changes).

Misuse by uninformed(?) engineerswithout caching a simple infinite loop on e.g. a file open could be disastrousinitially no quotas --> larger and larger files were added... (now 256k limit)combated by design reviews

TCP congestion control conflicted with client lease renewal, so UDP had to be used

The Google File System[2]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

SOSP 2003

(Original slides by Alex Moshchuk)

MotivationGoogle needed a good distributed file system

Redundant storage of massive amounts of data oncheap and unreliable computers

Why not use an existing file system?Google’s problems are different from anyone else’s

Different workload and design priorities (next slide)GFS is designed for Google apps and workloadsGoogle apps are designed for GFS

Assumptions

High component failure ratesInexpensive commodity components fail all the time

“Modest” number of HUGE filesJust a few millionEach is 100MB or larger; multi-GB files typical

Files are write-once, mostly appended toPerhaps concurrently

Large streaming readsHigh sustained throughput favored over low latency

GFS Design DecisionsFiles stored as chunks

Fixed size (64MB)Reliability through replication

Each chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadata

Simple centralized managementNo data caching

Little benefit due to large data sets, streaming readsFamiliar interface, but customize the API

Simplify the problem; focus on Google appsAdd snapshot and record append operations

Single masterFrom distributed systems we know this is a:

Single point of failureScalability bottleneck

GFS solutionsStill use masters (simplicity...)Shadow masterMinimize master involvement

never move data through it, use only for metadataand cache metadata at clients

large chunk sizemaster delegates authority to primary replicas in data mutations (chunk leases)

Simple, and good enough!

MutationsMutation = write or append

must be done for all replicasGoal: minimize master involvementLease mechanism:

master picks one replica asprimary; gives it a “lease” for mutationsprimary defines a serial order of mutationsall replicas follow this order

Data flow decoupled fromcontrol flow

Atomic record append

GFS “speciality”Client specifies dataGFS appends it to the file atomically at least once

GFS picks the offsetworks for concurrent writers

Used heavily by Google appse.g., for files that serve as multiple-producer/single-consumer queues

Fault ToleranceHigh availability

fast recoverymaster and chunkservers restartable in a few seconds

chunk replicationdefault: 3 replicas.

shadow mastersData integrity

checksum every 64KB block in each chunk

Deployment in GoogleMany GFS clustershundreds/thousands of storage nodes eachManaging petabytes of dataGFS is under BigTable, etc.

Data Storage: BigTable[1]

What is BigTable?“Rows and columns” (i.e., a table) for storing data, sparseIn reality, a

distributed,persistent, multi-level sorted map

Full slideset at http://carfield.com.hk/document/distributed/6DeanGoogle.pdfVideo: http://code.google.com/edu/parallel/index.html#_distrib_storage

http://code.google.com/edu/parallel/index.html#_distrib_storage

Multi-dimensional sparse sorted map(row, column, timestamp) => valueColumns grouped into locality groups

Big Table Data Model[1]

Why BigTable?Table model more powerful than just key and value

locality groups provide optimizationrow as unit of atomicitycompression, esp. along timestamp dimension

Simpler than a full DBMSClients participate in optimizationNo complex transactionsNo complex queries

Fully customizable -- if you are Google ;)

Some BigTable Features[1]

Single-row transactions: easy to do read/modify/write operationsLocality groups: segregate columns into different filesIn-memory columns: random access to small itemsSuite of compression techniques: per-locality groupBloom filters: avoid seeks for non-existent dataReplication: eventual-consistency replication across datacenters, between multiple BigTable serving setups (master/slave & multi-master)

BigTable Usage[1]

500+ BigTable cellsLargest cells manage 6000TB+ of data, 3000+ machinesBusiest cells sustain >500000+ ops/second 24 hours/day, and peak much higher

MapReduce

Based on Google EDU slides at: http://code.google.com/edu/submissions/ucsandiego-mapreduce/index.html

http://code.google.com/edu/submissions/ucsandiego-mapreduce/index.html

http://code.google.com/edu/submissions/ucsandiego-mapreduce/index.html

Motivation: Large Scale Data Processing[2]

Want to process lots of data ( > 1 TB)Want to parallelize across hundreds/thousands of CPUs… want to make this easy

(what's the name of that principle again...?)

MapReduce[2][1]

Automatic parallelization & distributionFault-tolerantProvides status and monitoring toolsClean, simple abstraction for programmers

Programming Model

Borrows from functional programmingUsers implement interface of two functions:map (in_key, in_value) ->

(out_key, intermediate_value) listreduce (out_key, aggregated intermediate_value list) ->

out_value list

map

Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key+value pairs: e.g., (filename, no of lines).map() produces one or more intermediate values along with an output key from the input.

reduce

After the map phase is over, all the intermediate values for a given output key are combined together into a listreduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

Example: most common words

Hello World!The world is the place we …

map

map

map hello: 1the: 2world: 2

…the: 2…

…the: 1….

reduce: the1+2+2

…foo: 42the: 5quux: 43…

reduce: …….

reduce: …….

document_id: text ->list of <word, frequency>

gather frequencies by word

sum of list offrequencies

Parallelism

map() functions run in parallel, creating different intermediate values from different input data setsreduce() functions also run in parallel, each working on a different output keyAll values are processed independentlyBottleneck: reduce phase can’t start until map phase is completely finished.

Fault Tolerance

Master detects worker failuresRe-executes completed & in-progress map() tasksRe-executes in-progress reduce() tasks

Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.

Effect: Can work around bugs in third-party libraries!

MapReduce ConclusionsMapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applicationsFun to use: focus on problem, let library deal w/ messy details

References[1] Jeff Dean (Google Fellow), "Handling Large Datasets at Google: Current Systems and Future Directions“

[2] Google Code University: “Problem Solving on Large Scale Clusters”

[3] Burrows, M. The Chubby lock service for loosely-coupled distributed systems, Proceedings of the 7th symposium on Operating systems design and implementation, 2006

kopio large scale web application development: server … · · 2011-08-16large-scale web...

Documents