Outline for Today’s Lecture
Administrative:
Objective: – Peer-to-peer file systems
• Mechanisms employed• Issues• Some examples
The Security EnvironmentThreats
Security goals and threats
Intruders
Common Categories1. Casual prying by nontechnical users2. Snooping by insiders3. Determined attempt to make trouble (or
personal gain)4. Commercial or military espionage
Accidental Data Loss
Common Causes
1. Acts of God- fires, floods, wars
2. Hardware or software errors- CPU malfunction, bad disk, program bugs
3. Human errors- data entry, wrong tape mounted, rm *
Reliability Mechanisms(Redundancy)
• Replication of data, geographically distributed– As simple as backups– First-class replication (Coda)– Voting schemes
• Error detection-correction– Erasure codes (encode n blocks into >n blocks,
requiring r blocks to recover original content of original n)
– Parity bits, checksums
Basics of Cryptography
Relationship between the plaintext and the ciphertext
• Secret-key crypto called symmetric-key crypto– If keys are long enough there are OK
algorithms– Secret key must be shared by both parties
Secret-Key Cryptography
Public-Key Cryptography• All users pick a public key/private key pair
– publish the public key– private key not published
• Public key is (usually*) the encryption key• Private key is (usually*) the decryption key
• RSA
One-Way Functions• Function such that given formula for f(x)
– easy to evaluate y = f(x)• But given y
– computationally infeasible to find x
• Example: Hash functions – produce fixed size result– MD5– SHA
Digital Signatures
(b)• Computing a signature block– Hash is fixed length – apply private key as encryption key*
• What the receiver gets– Use public key as decryption key* on signature block to get hash back– Compute the hash of document part– Do these match?
• Assumes E(D(x)) = x when we usually want D(E(x))=x• Public key must be known by receiver somehow – certificate
Distributing Public Keys
• Certificate authority– Trusted 3rd party– Their public key known
• Send name and public key, digitally signed by ca
Byzantine Generals ProblemReaching consensus among geographically separated
(distributed) players if some of them are compromised.• Generals of army units need to agree on a common
plan of attack (consensus)• Traitorous generals will lie (faulty or malicious)• Generals communicate by sending messages directly
general-to-general through runners between units (they won’t all see the same intell)
• Solutions are for all loyal generals to reach consensus, in spite of liars (up to some % of generals being bad)
Solution with Digital Sigs
• Iteratively execute “rounds” of message exchanges
• As each message passes by, the receiving general digitally signs it and forwards it on.
• Each General maintains the set of orders received
• Inconsistent orders indicate traitor
Peer-to-peer File Systems
Problems with Centralized Storage Server Farms
• Weak availability:– Susceptible to point failures and DoS attacks
• Management overhead– Data often manually partitioned to obtain scale– Management and maintenance large fraction of cost
• Per-application design (e.g., GoogleOS)– High hurdle for new applications
• Don’t leverage the advent of powerful clients– Limits scalability and availability
Slides from Shenker and Stoica, UCB
What is a P2P system?
• A distributed system architecture:– No centralized control– Nodes are symmetric in function
• Large number of (perhaps) server-quality nodes• Enabled by technology improvements
Node
Node
Node Node
Node
Internet
Slides from Shenker and Stoica, UCB
P2P as Design Style
• Resistant to DoS and failures– Safety in numbers, no single point of attack or
failure
• Self-organizing– Nodes insert themselves into structure– Need no manual configuration or oversight
• Flexible: nodes can be– Widely distributed or co-located– Powerful hosts or low-end PCs– Trusted or unknown peers
Slides from Shenker and Stoica, UCB
Issues
• Goal is to have no centralized server and to utilize desktop-level idle resources.
• Trust – privacy, security, data integrity– Using untrusted hosts
• Availability – – Using lower “quality” resources– Using machines that may regularly go off-line
• Fairness – freeloaders who just use and don’t contribute any resources– Using voluntarily contributed resources
Issues
• Goal is to have no centralized server and to utilize desktop-level idle resources.
• Trust – privacy, security, data integrity– Using untrusted hosts -- crypto solutions
• Availability – – Using lower “quality” resources -- replication– Using machines that may regularly go off-line
• Fairness – freeloaders who just use and don’t contribute any resources– Using voluntarily contributed resources – use economic
incentives
What Interface?
• Challenge for P2P systems: finding content– Many machines, must find one that holds file
• Essential task: Lookup(key)– Given key, find host (IP) that has file with that key
• Higher-level interface: Put()/Get()– Easy to layer on top of lookup()– Allows application to ignore details of storage
• System looks like one hard disk
– Good for some apps, not for others
Slides from Shenker and Stoica, UCB
Distributed Hash Tables vs Unstructured P2P
• DHTs good at:– exact match for “rare” items
• DHTs bad at: – keyword search, etc. [can’t construct DHT-based Google]– tolerating extreme churn
• Gnutella etc. good at:– general search– finding common objects– very dynamic environments
• Gnutella etc. bad at:– finding “rare” items
Slides from Shenker and Stoica, UCB
DHT Layering
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
• Application may be distributed over many nodes• DHT distributes data storage over many nodes
Slides from Shenker and Stoica, UCB
Two Crucial Design Decisions
• Technology for infrastructure: P2P– Take advantage of powerful clients– Decentralized– Nodes can be desktop machines or server quality
• Choice of interface: Lookup and Hash Table– Lookup(key) returns IP of host that “owns” key– Put()/Get() standard HT interface– Some flexibility in interface (no strict layers)
Slides from Shenker and Stoica, UCB
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: Overlay
Slides from Shenker and Stoica, UCB
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: put()
put(K1,V1)
Slides from Shenker and Stoica, UCB
put(K1,V1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: put()
Slides from Shenker and Stoica, UCB
(K1,V1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: put()
Slides from Shenker and Stoica, UCB
get(K1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: get()
Slides from Shenker and Stoica, UCB
get(K1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
A DHT in Operation: get()
Slides from Shenker and Stoica, UCB
Key Requirement
• All puts and gets for a particular key must end up at the same machine– Even in the presence of failures and new
nodes (churn)
• This depends on the DHT routing algorithm– Must be robust and scalable
Slides from Shenker and Stoica, UCB
DHTs
• Examples– CAN– Chord– Pastry– Tapestry– In BitTorrent and
Coral CDN
• Keyspace partitioning – ownership of keys split among participating nodes– Node has ID and owns keys
“close” to its ID by some distance function
• Hash filename to key• Routing in the overlay
– To node with a closer ID or else it’s mine
PASTRY Overlay Network
k
Route k
• Nodes assigned 1-dimensional IDs in hash space at random (e.g., hash on IP address)
• Each node has log n neighbors & maintains routing table
• Lookup with fileID k is routed to live node with nodeID close to k
PAST• Rice Univ. and MSR Cambridge UK• Based on Internet-based overlay• Not traditional file system semantics • File is associated with fileID upon insertion into
PAST and can have k replicas– fileID is secure hash of filename, owner’s public key,
random salt #– K nodes whose nodeIDs are “closest” to msb of fileID
• Instead of directory lookup, retrieve by knowing fileID
• DHash replicates each key/value pair at the nodes after it on the circle
• It’s easy to find replicas
• Put(k,v) to all
• Get(k) from closest
N32
N10N5
N110
N99
N80N60
N20K19
K19
N40 K19
Data Availability via Replication
Slides from Shenker and Stoica, UCB
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block19
N68
Copy of19
First Live Successor Manages Replicas
Slides from Shenker and Stoica, UCB
Other P2P FS examples
Farsite
• Microsoft Research – intended to look like NTFS• Desktops on LAN (not Internet-scale)• 3 roles: client, member of directory group, file host• Directory metadata managed by Byzantine replication• File hosts store encrypted replicated file data• Directory group stores secure hash of content to
validate authenticity of file• Multiple namespace tree roots with namespace
certificate provided by CA• File performance by local caching under leasing
system
LOCKSS
• Lots of Copies Keeps Stuff Safe (HPLabs, Stanford, Harvard, Intel)
• Library application for L-O-N-G term archival of digital library content (deal with bit rot, obsolescence of format, malicious users).
• Continuous audit and repair of replicas based on taking polls of sites with copies of content (comparing digest of content and repairing my copy if it differs from consensus).
• Rate-limited and churn of voter lists to deter attackers from compromising enough copies to force a malicious “repair”.
Sampled Poll• Each peer holds for every preserved Archival Unit
– reference list of peers it has discovered– friends list of peers its operator knows externally– history of interactions with others (balance of contributions)
• Periodically (faster than rate of storage failures)– Poller takes a sample of the peers in its reference list– Invites them to vote: send a hash of their replica
• Compares votes with its local copy– Overwhelming agreement (> 70%) Sleep blissfully– Overwhelming disagreement (< 30%) Repair– Too close to call Raise an alarm
• To repair, the peer gets the copy of somebody who disagreed and then reevaluates the same votes
Churn of Voter Lists
• Reference List– Take out voters, so that the next poll is based ondifferent group– Replenish with some “strangers” and some “friends”
• Strangers: Accepted nominees proposed by voters who agree with poll outcome
• Friends: From the friends list• The measure of favoring friends is called friend bias• History
– Poller owes its voters a vote (for their future polls)– Detected misbehavior penalized in victim’s history