the google file system

*The Google File System

*Design consideration Built from cheap commodity hardware Expect large files: 100MB to many GB Support large streaming reads and small random reads Support large, sequential file appends Support producer-consumer queues for many-way merging and file atomicity Sustain high bandwidth by writing data in bulk

*InterfaceInterface resembles standard file system hierarchical directories and pathnamesUsual operationscreate, delete, open, close, read, and writeMoreoverSnapshot CopyRecord append Multiple clients to append data to the same file concurrently

*Architecture

*Chunk Size64MBMuch larger than typical file system block sizesAdvantages from large chunk sizeReduce interaction between client and masterClient can perform many operations on a given chunkReduces network overhead by keeping persistent TCP connectionReduce size of metadata stored on the masterThe metadata can reside in memory

*Metadata (1)Store three major typesNamespacesFile and chunk identifierMapping from files to chunksLocation of each chunk replicasIn-memory data structuresMetadata is stored in memoryPeriodic scanning entire state is easy and efficient

*Metadata (2)Chunk locationsMaster do not keep a persistent record of chunk locationsInstead, it simply polls chunkservers at startup and periodically thereafter (heartbeat message)Because of chunkserver failures, it is hard to keep persistent record of chunk locationsOperation logMaster maintains historical record of critical metadata changesNamespace and mappingFor reliability and consistency, replicate operation log on multiple remote machines

Client operationsWrite(1)Some chunkserver is primary for each chunkMaster grants lease to primary (typically for 60 sec.)Leases renewed using periodic heartbeat messages between master and chunkserversClient asks master for primary and secondary replicas for each chunkClient sends data to replicas in daisy chainPipelined: each replica forwards as it receivesTakes advantage of full-duplex Ethernet links

*

Client operationsWrite(2)*

Client operationsWrite(3)All replicas acknowledge data write to clientClient sends write request to primaryPrimary assigns serial number to write request, providing orderingPrimary forwards write request with same serial number to secondariesSecondaries all reply to primary after completing writePrimary replies to client

*

Client operationswith ServerIssues control (metadata) requests to master serverIssues data requests directly to chunkserversCaches metadataDoes no caching of dataNo consistency difficulties among clientsStreaming reads (read once) and append writes (write once) dont benefit much from caching at client

*

DecouplingData & flow control are separatedUse network efficiently- pipeline data linearly along chain of chunkserversGoalsFully utilize each machines network bandwidthAvoid network bottlenecks & high-latencyPipeline is carefully chosenAssumes distances determined by IP addresses

*

Atomic Record AppendsGFS appends data to a file at least once atomically (as a single continuous sequence of bytes)Extra logic to implement this behaviorPrimary checks to see if appending data exceeds current chunk boundaryIf so, append new chunk padded out to chunk size and tell client to try again next chunkWhen data fits in chunk, write it and tell replicas to do so at exact same offsetPrimary keeps replicas in sync with itself*

Master operationsLoggingMaster has all metadata informationLose it, and youve lost the filesystem!Master logs all client requests to disk sequentiallyReplicates log entries to remote backup serversOnly replies to client after log entries safe on disk on self and backups!

*

Master operationsWhat if Master RebootsReplays log from diskRecovers namespace (directory) informationRecovers file-to-chunk-ID mappingAsks chunkservers which chunks they holdRecovers chunk-ID-to-chunkserver mappingIf chunk server has older chunk, its staleChunk server down at lease renewalIf chunk server has newer chunk, adopt its version numberMaster may have failed while granting lease

*

Master operationsWhere to put a chunkIt is desirable to put chunk on chunkserver with below-average disk space utilizationLimit the number of recent creations on each chunkserver- avoid heavy write trafficSpread chunks across multiple racks

*

Master operations Re-replication and RebalancingRe-replication occurs when the number of available chunk replicas falls below a user-defined limitWhen can this occur?Chunkserver becomes unavailableCorrupted data in a chunkDisk errorIncreased replication limitRebalancing is done periodically by masterMaster examines current replica distribution and moves replicas to even disk space usageGradual nature avoids swamping a chunkserver

*

Garbage CollectionLazyUpdate master log immediately, butDo not reclaim resources for a bit (lazy part)Chunk is first renamed, not removedPeriodic scans remove renamed chunks more than a few days old (user-defined)OrphansChunks that exist on chunkservers that the master has no metadata forChunkserver can freely delete these

*

*Fault ToleranceFast RecoveryMaster and Chunkserver are designed to restore their state and restart in secondsChunk ReplicationEach chunk is replicated on multiple chunkservers on different racksAccording to user demand, the replication factor can be modified for reliabilityMaster ReplicationOperation logHistorical record of critical metadata changesOperation log is replicated on multiple machines

*ConclusionGFS is a distributed file system that support large-scale data processing workloads on commodity hardwareGFS has different points in the design spaceComponent failures as the normOptimize for huge filesGFS provides fault toleranceReplicating dataFast and automatic recoveryChunk replicationGFS has the simple, centralized master that does not become a bottleneck

GFS is a successful file systemAn important tool that enables to continue to innovate on Googles ideas

********

the google file system

Documents

chunk identifiermapping

new chunk

chunk locationsmaster

pipeline data

appending data

serverissues data requests

client operationswrite2

sequential file