cloud computing - fenix. · pdf filecomponents of cloud computing platforms data storage ......

Post on 30-Jan-2018

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Cloud Computing

Lectures 11, 12 and 13

Cloud Storage

2014-2015

2

Up until now…

• Introduction

• Definition of Cloud Computing

• Grid Computing

• Content Distribution Networks

• Cycle-Sharing

• Distributed Scheduling

• Map Reduce

3

Outline

• Components of Cloud Platforms

• Storage Types

• Storage Products

• Cloud File Systems

• Cloud Object Storage

4

Components of Cloud Computing

Platforms

Data Storage

Execution Model

Programming Model

Mo

nito

ring

•How to program an application?

•How is the platform viewed?

•Which abstraction is accessible: VM? API?

Framework?

•Which operations can I perform?

•How are my data stored and accessed?

•Monitoring: How can I evaluate the state

of executions/nodes/data...?

5

Major Cloud Platforms

• Apache Hadoop

• Amazon Web Services

• Google App Engine

• Microsoft Azure

• OpenStack

6

Storage Types

• A range of search, streaming and indexing variants.

• File System:

• Hierarchical organization, files, permission, streaming data,...

• Object Storage:

• Direct Program <-> Storage interaction

• Object ID indexing

• Tables (no-SQL DB):

• records and tables

• Search

• No relational model

• Relational Databases:

• Full relational model

• Conventional services

• We will see that the categories are becoming blurred...

7

Storage Products (i)

• File System

• Hadoop File System / Google File System

• Object/Byte Storage

• Amazon S3

• MS Azure Blobs

• Table

• Hadoop HBase / Google Big Table (AppEngine Datastore)

• Amazon Simple DB

• MS Azure Tables

• Hadoop Hive

• Yahoo PNUTS

• Relational Databases

• Amazon RDS

• SQL Azure

8

Cloud File System: HDFS/GFS

• Distributed File System

• Reimplementation of the Google File System (GFS).

• Runs on clusters of generic machines.

• HDFS is tuned for:

• Very large files.

• Streaming access.

• Generic hardware.

• Scalability Key: data operations don’t go through the central server.

9

Blocks

• Simplify space management: allocation, replication and a file may grow almost indefinitely.

• Evolution:

• Disk blocks: 512 bytes

• File system blocks: 2,4,8 kB

• HDFS blocks: 64MB

• To eliminate seek steps: contiguous 64MB.

• A file smaller than one block does not occupy 1 block.

10

Namenode

• Manages the file system name space: folder hierarchy, name uniqueness,…

• Maintains the folder tree and the metadata in 2 files: namespace image and edit log.

• HDFS cannot operate without the namenode.

• Files can be written, read, renamed and deleted.

• It is not possible to:• Write in the middle of the file.

• Write concurrently to the same file.

• Fault tolerance mechanism: atomic replication to another machine.

11

Datanode

• Manage a set of blocks.

• Process clients’ or namenode’s

writing/reading requests.

• Periodically notifies the namenode of the

blocks it holds..

• If a block’s replication factor drops below a

configuration value, a new replica is created.

12

Permissions

• Permissions in HDFS are similar to UNIX:

• user, group e other

• read, write e execute

• As the user is very often remote, any

username from a remote node is trusted.

Therefore, protection is weak.

• They are more geared towards managing a

group of users in the cluster.

13

Consistency Model

• Formalization of the visibility of read and write

operations.

• After an operation call finishes, who sees what

and when?

• HDFS model: There are no guarantees that the

last block has been written unless sync() is

called.

14

Error Checking

• The block correction is checked using a hashing function (CRC32 - checksum).

• At file creation:• Client calculates the checksum for each 512 byte block.

• Datanode stores the checksum.

• At file access:• Client reads the data and the checksum from the

datanode.

• If the check fails, it tries other replicas.

• Periodically, the datanode checks its blocks checksum.

15

Reading

• Client contacts the namenode to get the list of the datanodes with the file’s blocks (stored in memory).

• Receives a FSDataInputStream that transparently chooses the best datanode, opens and closes connections to the datanodes, requests blocks from the namenode, repeats operations if necessary and logs failed datanodes.

16

Reading

17

Choosing Nodes: Distance

• Nodes choose the closer sources of data.

• Assumes a tree structured organization.

• Distance equal to the name of hops between the tree nodes.

• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)

• distance(/d1/r1/n1, /d1/r1/n2) = 2 (processes on the same racks)

• distance(/d1/r1/n1, /d1/r2/n3) = 4 (processes on different racks)

• distance(/d1/r1/n1, /d2/r3/n4) = 6 (processes on different datacentres)

18

Distance Between Nodes

19

Writing (+ creating)

• Client requests a new file to the namenode

checking permission and uniqueness. If it succeeds, it receives a FSOutputStream .

• Namenode provides a set of datanodes for replication.

• Blocks write requests are kept in a data queue.

• Unconfirmed block write request are kept in a ack queue.

20

Writing

21

Writing

• In case the datanode fails, the client changes

the block id so that the corrupted replica is

deleted later.

• By default, if one of the replicas is successfully

written, the writing is considered done. The

other replicas are written asynchronously.

22

Command Line Tool

• hadoop fs• ls

• mkdir

• rm

• rmr

• put

• copyToLocal

• copyFromLocal

23

Cloud Object Store:

Amazon Simple Storage System (S3)

24

S3

• Amazon’s persistent object storage system.

• Implementation based on the Dynamo system

(SOSP, 2007).

• Accessible using HTTP: 3 different protocols,

e.g. SOAP.

25

Dynamo: Intuition

• CAP Theorem: Consistency, Availability and Partition

tolerance - Pick two!

• At Amazon: Availability = Client’s trusts

• Cannot be sacrificed.

• In large data centres there are going to be frequent

faults:

• The possibility of a partition has to be included.

• Most data services tolerate small inconsistencies:

• Relaxed consistency ==> Eventual consistency.

26

Consistency Models

• Strong Consistency: Once a write operations is finished for the requester, any subsequent read will return the value that was written.

• Weak Consistency: The system does not guarantee that subsequent accesses return the written value. Some condition must be verified for the written value to be returned (a time interval, an access to a synchro variable,…). The period between the write finishing and the value visibility is called the inconsistency window.

• Eventual Consistency: The system guarantees that, if there no more writes, the updates will become visible for all clients (e.g. DNS): a DNS name update is propagated between zones until all clients see the new value.

27

Variants of Eventual Consistency

• Causal Consistency: Two causally related writes (A happens before B) cannot lead to B being written before A. There are no guarantees regarding write operations that are not causally related.

• Read-your-writes Consistency: Every time a process A writes a value, all subsequent reads must reflect that write (a particular case of causal consistency).

• Session Consistency: A practical implementation of the previous model. All operations are done in the context of a session. During the session, the system guarantees “read-your-writes”. In the case of certain faults, the session is ended and the “read-your-writes” guarantee is restarted.

• Monotonic Reads Consistency: If a process has seen a subsequent value, subsequent reads will never return a previous value.

• Monotonic Writes Consistency: Systems that do not guarantee ordered writes in the same process. Very rare…

28

Dynamo Assumptions

• Interaction Model:• Total reads and writes with unique IDs.

• Binary objects with up to 5GB.

• No operations on multiple objects.

• ACID properties (Atomicity, Consistency, Isolation, Durability):• Atomicity/Isolation: total writes of an object.

• Durability: replicated write.

• Only the consistency isn’t strong.

• Efficiency:• Optimize for the 99,9 percentile.

29

Design Decisions

• Incremental Scalability:

• Adding nodes has to be simple.

• Load balancing and support for heterogeneity:

• The system must distribute the requests.

• And support nodes with different characteristics.

• Solution: nodes in a Chord like DHT.

30

Design Decisions

• Symmetry:

• All nodes are equally responsible peers.

• Decentralization:

• Avoid single points of failure.

31

Dynamo: Design Decisions

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Stability

Write Availability Vector clocks and conflict resolution of writes

Version size does not depend on the update rate

Temporary Faults Relaxed quorum and hinted handoff

High availability and durability

Permanent Faults Anti-entropy with Merkle trees Synchronizes replicas asynchronously

Membership and Fault detection

Gossip based membership protocol

Maintains symmetry and avoids and centralized

directory

32

Dynamo: API

• Two operations:• put(key, context, object)

• key: object ID.

• context: vector clocks and object’s history.

• object: data to be written.

• get(key)

33

Partitioning and Replication

• Uses consistent hashing.

• Similar to Chord:

• Each node has an id in the key space.

• Nodes are arranged in a ring.

• Data are stored in the node with the lowest key

that is larger than the object’s

• Replication:

• All objects are replicated in the N nodes that

follow the node associated with the object.

34

The Chord Ring with Replication

35

Virtual Nodes

• Problem: few nodes or heterogeneous nodes

lead to bad load balancing.

• Dynamo solution:

• Use virtual nodes

• Each physical nodes has several “virtual

node”tickets.

• More powerful machines can have more tickets.

• “Virtual node” tickets are distributed randomly.

36

Data Versions

• Nodes for writing and reading are selected based on load.

• So, we have eventual consistency:

• There may be different versions written on different replicas.

• Conflict resolution is made when reading and not when writing.

• Syntactic Reconciliation:

• Some changes can be made automatically. For formats with clearly identifiable parts and operations (e.g. mail file).

• Semantic Reconciliation:

• The user must decide.

• Divergence is uncommon. For all read operations:

• 99.94% - 1 version;

• 0.00057% - 2 versions;

• 0.00047% - 3 versions;

• 0.00009% - 4 versions.

• Timeout:

• After a number of generations without writing, versions are discarded.

37

Vector Clocks (i)

• Represents time in a distributed system

without clock sync.

• Replaces physical time with causality.

• A vector clock is a list of (node, counter) pairs.

• If all positions of the vector clock time of an

event A are smaller than those of another

event B then A happened before B. There is a

causal chain of events from A to B.

38

Vector Clocks (ii)

Real time

39

Object Versions

• If we assign a vector clock timestamp to all object versions we can detect divergent replicas.

• Example:

• X, Y e Z are servers with replicas of object D.

• D5 is a semantic reconciliation performed by the user.

40

Executing get() e put()

• For good performance, two possibilities:

• Route requests through a load balancer that chooses the node based on the load:

• Creates a bottleneck.

• Use a client side library to choose the node where to send the request (which will be the coordinator):

• Requires recompiling the client. Probably irrelevant in AWS.

• Then the coordinator executes the quorum reads or writes.

41

Read/Write Operations

• Dynamo supports writing and reading using a quorum model. This allows not waiting for all replicas when you do an operation.

• Consider R and W are the number or read and write replicas that must synchronously take part in an operation.

• If R + W > N we have a quorum based system, then the set of replicas used for writing always overlap with the set of read replicas:

• It is impossible to read an object without seeing the latest written object.

• Latency is determined by the slowest node in the R (or W) set. Therefore, to improve performance, one lowers R or W.

42

Sloppy Quorum

• To ensure availability, Dynamo uses a “sloppy

quorum”.

• Each data item is stored on N nodes of list

spanning multiple machines and data centers

(preference list).

• Operations are performed not on the N

existing replicas but on the first healthy N

nodes on the preference list.

43

Tolerating Temporary Faults:

Hinted Handoff

• Assuming N = 3. If A is unavailable or fails when we write, send a replica to D.

• D marks the replica as temporary and returns the data to A as soon as it recovers.

• Replicas are chosen from a preference list of nodes.

• Preference lists always span multiple datacenters for fault tolerance.

44

Membership and Fault Detection

• Ring Membership:

• At startup use an external entry point to avoid

partitioned rings.

• Gossip asynchronously to update the DHT.

Exchange membership lists with random node

every 2 seconds.

• Fault Detection:

• Faults are detected by neighbours with periodic

messages with a timeout on reply.

45

Permanent Faults

• When a hinted replica (that has write-ops

belonging to another replica) is considered

failed:

• Data is synchronized with the new replica using

Merkle trees.

46

Merkle Trees

• Accelerates synchronization between nodes by comparing trees of hashes.

• Each tree node has a hash of the children.

• It makes it very easy to identify what needs to be exchanged.

• The update can be asynchronous:

• An out-of-date tree is not serious.

47

Merkle Trees: Dynamo

• Each node has a set of keys.

• All objects are leafs of the Merkle tree.

• Replicas exchange the top of the Merkle tree

periodically.

• If it's different, they recursively exchange the

hash of lower nodes.

48

Back to S3

• Additional issues when compared to Dynamo:

• Access to S3 is controlled by an ACL based on the clients’

AWS identity and checked with their secret key.

• Occasionally, some S3 calls fail and must be repeated.

Programs accessing S3 should take this into account.

• Dynamo replication is performed between data centers.

• This large scale replication has some lag.

49

Service Level Agreements

• Hosting contracts and cloud platforms, like S3, include SLAs.

• Very often described as average, median and/or variances of response times:

• Extreme cases are always problematic.

• Amazon optimizes for 99,9% of the requests:

• Example: 300ms response time for 99,9% of the requests below a peak request rate of 500 request per second.

50

Buckets and Objects

• S3 data are stored as Dynamo objects.

• Operations on objects are:

– PUT, GET, DELETE, HEAD (get metadata)

• Objects can be grouped in buckets.

• Buckets are used for delimiting namespaces:

• http://mybucket.s3.amazonaws.com/myobj

• http://s3.amazonaws.com/mybucket/myobj

51

S3: REST GET

Sample Request

GET /my-image.jpg HTTP/1.1

Host: bucket.s3.amazonaws.com

Date: Wed, 28 Oct 2009 22:32:00 GMT

Authorization: AWS 02236Q3V0WHVSRW0EXG2:0RQf4/cRonhpaBX5sCYVf1bNRuU=

Sample Response

HTTP/1.1 200 OK

x-amz-id-2: eftixk72aD6Ap51TnqcoF8eFidJG9Z/2mkiDFu8yU9AS1ed4OpIszj7UDNEHGran

x-amz-request-id: 318BC8BC148832E5

Date: Wed, 28 Oct 2009 22:32:00 GMT

Last-Modified: Wed, 12 Oct 2009 17:50:00 GMT

ETag: "fba9dede5f27731c9771645a39863328"

Content-Length: 434234

Content-Type: text/plain

Connection: close

Server: AmazonS3

[434234 bytes of object data]

See http://s3.amazonaws.com/doc/s3-

developer-guide/RESTAuthentication.html

52

S3: REST PUT

Sample RequestPUT /my-image.jpg HTTP/1.1Host: myBucket.s3.amazonaws.comDate: Wed, 12 Oct 2009 17:50:00 GMTAuthorization: AWS 15B4D3461F177624206A:xQE0diMbLRepdf3YB+FIEXAMPLE=Content-Type: text/plainContent-Length: 11434Expect: 100-continue[11434 bytes of object data]

Sample ResponseHTTP/1.1 100 ContinueHTTP/1.1 200 OKx-amz-id-2: LriYPLdmOdAiIfgSm/F1YsViT1LW94/xUQxMsF7xiEb1a0wiIOIxl+zbwZ163pt7x-amz-request-id: 0A49CE4060975EACx-amz-version-id: 43jfkodU8493jnFJD9fjj3HHNVfdsQUIFDNsidf038jfdsjGFDSIRpDate: Wed, 12 Oct 2009 17:50:00 GMTETag: "fbacf535f27731c9771645a39863328"Content-Length: 0Connection: closeServer: AmazonS3

53

S3: REST in Javapublic void createBucket() throws

Exception{

// S3 timestamp pattern.String fmt = "EEE, dd MMM yyyy

HH:mm:ss ";SimpleDateFormat df = new

SimpleDateFormat(fmt, Locale.US);

df.setTimeZone(TimeZone.getTimeZone("GMT"));

// Data needed for signatureString method = "PUT";String contentMD5 = "";String contentType = "";String date = df.format(new Date()) +

"GMT";String bucket = "/onjava";

// Generate signatureStringBuffer buf = new StringBuffer();buf.append(method).append("\n");buf.append(contentMD5).append("\n");buf.append(contentType).append("\n");buf.append(date).append("\n");buf.append (bucket);String signature =

sign(buf.toString());

// Connection to s3.amazonaws.comHttpURLConnection httpConn = null;URL url = new

URL("http","s3.amazonaws.com",80,bucket);

httpConn = (HttpURLConnection) url.openConnection();

httpConn.setDoInput(true);httpConn.setDoOutput(true);httpConn.setUseCaches(false);httpConn.setDefaultUseCaches(false);httpConn.setAllowUserInteraction(true);httpConn.setRequestMethod(method);httpConn.setRequestProperty("Date",

date);httpConn.setRequestProperty("Content-

Length", "0");String AWSAuth = "AWS " + keyId + ":" +

signature;

httpConn.setRequestProperty("Authorization", AWSAuth);

// Send the HTTP PUT request.int statusCode =

httpConn.getResponseCode();if ((statusCode/100) != 2){

// Deal with S3 error stream.InputStream in =

httpConn.getErrorStream();String errorStr = getS3ErrorCode(in);

}}

54

S3: REST in JetS3t

String awsAccessKey = "YOUR_AWS_ACCESS_KEY";

String awsSecretKey = "YOUR_AWS_SECRET_KEY";

AWSCredentials awsCredentials =

new AWSCredentials(awsAccessKey, awsSecretKey);

S3Service s3Service = new RestS3Service(awsCredentials);

S3Bucket euBucket = s3Service.createBucket("eu-bucket", S3Bucket.LOCATION_EUROPE);

55

Windows Azure

56

• Volatile storage:

• Instance disk

• Memory cache

• Persistent Storage:

• Windows Azure Storage:

• Blobs (objects)

• Tables

• Queues

• SQL Azure:

• Relational DB

Azure Storage (i)

57

Azure Storage (ii)

• Service is accessible via Web Services or libraries on top of these (C#, VB, Java).

• Blobs, Tables e Queues are stored in partitions.

• Partitions are the replication and load balancing unit. Blobs and queues are not sharded. Tables may be.

• All partitions have 3 replicas.

• Partitions are represented in a DFS as one or more extents (contiguous files) of up to 1GB.

58

Blobs

• A blob is a <name, object> pair.

• Allows storage of objects from a few bytes up to

50GB.

• Blobs are stored in containers.

• There is no hierarchy in blob storage but it can be

simulated because names may contain “/”s.

• URLs schema:

http://<StorageAccount>.blob.core.windows.net/<Co

ntainer>/<BlobName>

59

Operations on Blobs

• Put: creating

• Get: reading

• Set: updating

• Delete: eliminating

• Lease: 1 minute locking.

60

Next Time...

• Storage in Cloud Platforms

top related