kopio large scale web application development: server … · · 2011-08-16large-scale web...
TRANSCRIPT
Large-scale Web Application Development:Server Side
Tancred Lindholm D.Sc.tancred(a.)google.com
Contents
Web ApplicationsServing Web Applications at Scale
Environment characteristicsRequirements and Best Practices
Web Application InfrastructureDistributed coordination: ChubbyStorage: GFS and BigTableProcessing (batch): MapReduce
Web Application Overview
Here: Web Application = interactive web page that closely resembles a traditional application (UI-wise)Architecture = Client/Server (it's back from the 90s...)
Client in user's browserServer in data centerClient-Server communication using HTTP
Client-side heavily guided -- constrained by W3C stackHTML/XML for structureCSS for stylingJavascript for programmingJavascript "libraries": XHR, DOMTopic for next week
Serving Web Applications at ScaleOne Server vs Data Center (distribution?)One Engineer Lots of Engineers (usability?)
Large-scale ComputingInternet-scale computingMillions of users24/7 availability
no maintenance breaks!Petabytes of data (1PB = 60 years of video)Thousands of CPUsAlthough efficiently used, still lots of of energyCost of software inefficiencies can be huge
Google's Hardware Philosophy[1]
Truckloads of low-cost machines
What does this imply?Software must tolerate failureApplication’s particular machine should not matterNo special machines, just 2 or 3 flavors
Truckloads of engineers... :)
An isolated engineer writes his own system, in his own noncanonical way, in his own favorite fringe language
... and it works great!A group of 1000 engineers does the same
... and there's no end to the confusion!As an engineer you want APIs using familiar concepts and no surprises
especially when you on average use 1 new API / day
Survival Strategies
KISSKnowledge and QualityData-driven engineeringMonitoringManaging continuous evolutionDid I already mention KISS?
KISSKISS = Keep It Simple, Stupid!Simplicity built by introducing good interfaces to (complex) functionality
consistent minimal metaphor for the functionalitywhat is the problem that is solved by the interface?be careful of layers that add nothingiterations with use help narrow down
What is simple?usually your peers know, you as implementor do notbuilding functionality x makes you too much of an expert, thus unsuited for designing its interface?
Knowledge and Quality
Writing optimally performing software often requires cross-layer understanding
engineers that know the stack, not just a sliceA culture of engineers passionate about software
debating, code reviews, taking things apart, testing, caring about the “right thing” exposes code to much analysis“Given enough eyeballs all bugs are shallow” (Linus’ Law)
Data-Driven EngineeringTheory and practice may clash at large scales
in theory, no need to handle most common path separately
in practice, usually make sensein theory, use the best big-Oh algorithm
in practice, simpler & slightly slower may win!Know your input data!
Size, extreme valuesWhile the median is 1k, what if you hit that 1G item?
Quality of datayour superfast HTML parser is useless if it requires valid input, and the corpus is 90% invalid “street HTML”!
Data-Driven Engineering“When in doubt, mapreduce”
Large-scale computing judo: use the large capacity to quickly get data on a large number of inputs
Micro-benchmarkingPrototype, feed realistic data, identify hotspots early
MonitoringYour code is up and running, all is well…… or is it?
is the code acceptably fast?what is the failure rate?redundancy sometimes masks excessive failures at cost of performancewhat is the resource use -- resources cost!
Gathering stats about codehelps isolate problemsprovides data for next steps
Measuring can be challenginghow can property x be measured?are you really measuring the right thing?Needed: mindset of an empirical scientist!
Continuous EvolutionCurrent system up and running 24/7, user happiness at level xChallenge: deploy feature y to yield happiness level x+1 without degraded user experience in betweenNowadays software is continuously evolving
Seems especially true of cloud softwareMeaningful question: What version of Office are you using?Meaningless question: What version of Google Docs are you using?!
Some strategies to follow…
Continous Evolution StrategiesBe incremental, i.e., deploy in small steps
even the most trivial step can be very challenging due to configurations, security, volume, data quality, ...
Provide a fallbackDon't fix what ain't broken
KISS
Also, you should remember KISSThings to look out for
Abstractions that don't solve a problemAre you using concepts you made up yourself?"This is a special case, so the rules do not apply..."
The Chubby lock service for loosely-coupled distributed
systemsPublication: "The Chubby lock service for loosely-coupled distributed systems"Mike Burrows, Google Inc.
Chubby Outline
Distributed consensusChubbyDesign spaceService operationsLockingCases of use and Misuse
Distributed Consensus
How can n unreliable machines reach agreemente.g. select a master among themselves, agree on order of events, etc.agree on file contents
Actually impossible in a fully asynchronous system (Fisher, 1981) Paxos one practically used algorithm
Properties1. Agreement = everyone learns a proposed value2. Nontriviality = only learn proposed values3. Consistency = only one value learned4. Liveliness = the value will eventually be learned
Relies on clocks for livelinessthus not fully asynchronous
Chubby
Google's lock service that helps clients reach distributed agreementPrimary use case is coordinating selection of master among nodesReliableAvailablleSimple interfaceTypical setup size ("Chubby cell") ~10k machines on ~gigabit lanUsed by GFS and BigTable
Design space
lock service rather than Paxos librarydevelopers do not plan for high availability, when the time comes locking is simpler to start usinga lock-based interface is more familiar to our programmers (and maybe misleading)
read and write small files. There's a need for a mechanism for advertising consensus results → manage small files
coarse-grained usefew lock acquisitions over longer timelots of readers, few writersfine-grained locking has different load characterstics, could be built supported by Chubby
Service Interface
file system interface!Hierarchical names e.g. /ls/foo/wombat/pouch
significantly reduced the effort needed to write basic browsing and name space manipulation toolsreduced the need to educate casual Chubby usersnot to be confused with ordinary Unix file system
small filesno moves, no linksACLs not inheritedephemeral files disappear when no longer used
primary use: each file can be used as a lock and to hold the agreed-upon data
Locking
Name can be held in exclusive write-only mode shared read mode
Example/foo write-locked: only writer can read & write/foo read-locked: readers can read, nobody can write
not mandatory, you may read/write without locks why -- lock breaking in huge system not feasible
Locking: sequences
Consider case where1. A acquires L, initiates operation O1 on T2. A fails, B acquires L before O1 on T3. O1 arrives on T, executed outside intended lock!
Chubby solution 1Lock sequence assoicated with lockeg. seq(L)=1000, O1 carries seq(L), T can read lock sequenceIf sequence of operation matches, the lock is valid.
Chubby solution 2For less Chubby-aware codePrevent locking L for a duration if A fails
Cases of Use and Misuse
While Chubby designed as a lock service most popular use was as a name server.
DNS TTL does not scale with lots of client across names (was a serious problem), Chubby's cache invalidation protocol does (scales with actual changes).
Misuse by uninformed(?) engineerswithout caching a simple infinite loop on e.g. a file open could be disastrousinitially no quotas --> larger and larger files were added... (now 256k limit)combated by design reviews
TCP congestion control conflicted with client lease renewal, so UDP had to be used
The Google File System[2]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003
(Original slides by Alex Moshchuk)
MotivationGoogle needed a good distributed file system
Redundant storage of massive amounts of data oncheap and unreliable computers
Why not use an existing file system?Google’s problems are different from anyone else’s
Different workload and design priorities (next slide)GFS is designed for Google apps and workloadsGoogle apps are designed for GFS
Assumptions
High component failure ratesInexpensive commodity components fail all the time
“Modest” number of HUGE filesJust a few millionEach is 100MB or larger; multi-GB files typical
Files are write-once, mostly appended toPerhaps concurrently
Large streaming readsHigh sustained throughput favored over low latency
GFS Design DecisionsFiles stored as chunks
Fixed size (64MB)Reliability through replication
Each chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadata
Simple centralized managementNo data caching
Little benefit due to large data sets, streaming readsFamiliar interface, but customize the API
Simplify the problem; focus on Google appsAdd snapshot and record append operations
Single masterFrom distributed systems we know this is a:
Single point of failureScalability bottleneck
GFS solutionsStill use masters (simplicity...)Shadow masterMinimize master involvement
never move data through it, use only for metadataand cache metadata at clients
large chunk sizemaster delegates authority to primary replicas in data mutations (chunk leases)
Simple, and good enough!
MutationsMutation = write or append
must be done for all replicasGoal: minimize master involvementLease mechanism:
master picks one replica asprimary; gives it a “lease” for mutationsprimary defines a serial order of mutationsall replicas follow this order
Data flow decoupled fromcontrol flow
Atomic record append
GFS “speciality”Client specifies dataGFS appends it to the file atomically at least once
GFS picks the offsetworks for concurrent writers
Used heavily by Google appse.g., for files that serve as multiple-producer/single-consumer queues
Fault ToleranceHigh availability
fast recoverymaster and chunkservers restartable in a few seconds
chunk replicationdefault: 3 replicas.
shadow mastersData integrity
checksum every 64KB block in each chunk
Deployment in GoogleMany GFS clustershundreds/thousands of storage nodes eachManaging petabytes of dataGFS is under BigTable, etc.
Data Storage: BigTable[1]
What is BigTable?“Rows and columns” (i.e., a table) for storing data, sparseIn reality, a
distributed,persistent, multi-level sorted map
Full slideset at http://carfield.com.hk/document/distributed/6DeanGoogle.pdfVideo: http://code.google.com/edu/parallel/index.html#_distrib_storage
Multi-dimensional sparse sorted map(row, column, timestamp) => valueColumns grouped into locality groups
Big Table Data Model[1]
Why BigTable?Table model more powerful than just key and value
locality groups provide optimizationrow as unit of atomicitycompression, esp. along timestamp dimension
Simpler than a full DBMSClients participate in optimizationNo complex transactionsNo complex queries
Fully customizable -- if you are Google ;)
Some BigTable Features[1]
Single-row transactions: easy to do read/modify/write operationsLocality groups: segregate columns into different filesIn-memory columns: random access to small itemsSuite of compression techniques: per-locality groupBloom filters: avoid seeks for non-existent dataReplication: eventual-consistency replication across datacenters, between multiple BigTable serving setups (master/slave & multi-master)
BigTable Usage[1]
500+ BigTable cellsLargest cells manage 6000TB+ of data, 3000+ machinesBusiest cells sustain >500000+ ops/second 24 hours/day, and peak much higher
MapReduce
Based on Google EDU slides at: http://code.google.com/edu/submissions/ucsandiego-mapreduce/index.html
Motivation: Large Scale Data Processing[2]
Want to process lots of data ( > 1 TB)Want to parallelize across hundreds/thousands of CPUs… want to make this easy
(what's the name of that principle again...?)
MapReduce[2][1]
Automatic parallelization & distributionFault-tolerantProvides status and monitoring toolsClean, simple abstraction for programmers
Programming Model
Borrows from functional programmingUsers implement interface of two functions:map (in_key, in_value) ->
(out_key, intermediate_value) listreduce (out_key, aggregated intermediate_value list) ->
out_value list
map
Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key+value pairs: e.g., (filename, no of lines).map() produces one or more intermediate values along with an output key from the input.
reduce
After the map phase is over, all the intermediate values for a given output key are combined together into a listreduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)
Example: most common words
Hello World!The world is the place we …
map
map
map hello: 1the: 2world: 2
…the: 2…
…the: 1….
reduce: the1+2+2
…foo: 42the: 5quux: 43…
reduce: …….
reduce: …….
document_id: text ->list of <word, frequency>
gather frequencies by word
sum of list offrequencies
Parallelism
map() functions run in parallel, creating different intermediate values from different input data setsreduce() functions also run in parallel, each working on a different output keyAll values are processed independentlyBottleneck: reduce phase can’t start until map phase is completely finished.
Fault Tolerance
Master detects worker failuresRe-executes completed & in-progress map() tasksRe-executes in-progress reduce() tasks
Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.
Effect: Can work around bugs in third-party libraries!
MapReduce ConclusionsMapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applicationsFun to use: focus on problem, let library deal w/ messy details
References[1] Jeff Dean (Google Fellow), "Handling Large Datasets at Google: Current Systems and Future Directions“
[2] Google Code University: “Problem Solving on Large Scale Clusters”
[3] Burrows, M. The Chubby lock service for loosely-coupled distributed systems, Proceedings of the 7th symposium on Operating systems design and implementation, 2006
Q&A