criteria overview - university of waterlooa78khan/cs446/additional... · 2011. 7. 27. · • an...

Post on 23-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

file synchronizer

data synchronizer

distributed file system

versioncontrol

Criteria Overview

Fitness for Purpose

• safety is job #1

• don’t lose or corrupt data

• even in the presence of hardware failures

• performance

• low bandwidth connections

• cross-platform

Fitness for FutureModifiability

• Not expecting much change.

• Problem is well-defined and formally specified.

• Perhaps support for new file systems.

• Non-hierarchical FS?

Reusability

• Not designed for reusability.

• Can call the binary.

• Build a data synchronizer?

Cost of Production

• An academic project

• Supporting thousands of users

• Willing to spend time on interesting things

• Written in OCaml

• strongly-typed functional language

• close to formalism

Cost of Operation

• Not a primary design consideration

• In practice requires a server

• DropBox’s competitive advantages:

• DropBox provides the server

• DropBox has a simpler UI

Main AlternativesTrace-Based

• work from a log of edits

• used by:

• distributed DBs

• middleware

• git

State-Based

• work from the current state of the data

• used by:

• unison

• dropbox

• subversion

Safety

Invariant #1

At every moment, each path in each replica has either

1. its original contents (i.e., no change at all has been made to this path), or

2. its correct final contents (i.e., the value that the user expected to be propagated from the other replica).

Invariant #2

At every moment, the information stored on disk about Unison's private state can be either

1. unchanged, or

2. updated to reflect those paths that have been successfully synchronized.

How to atomically replace a file?

How to atomically replace a file?

• Most file systems do not have an atomic replace operation.

1. Create a tmp file with the new contents

2. Delete the old file

3. Rename the tmp file to the proper name

CaveatThe above is almost true there are occasionally brief periods where it is not (and, because of shortcoming of the Posix filesystem API, cannot be). In particular, when it is copying a file onto a directory or vice versa, it must first move the original contents out of the way. If Unison gets interrupted during one of these periods, some manual cleanup may be required. In this case, a file called DANGER.README will be left in our home directory. Next run Unison will warn us about it.

Update Detection Alternatives

Trivial Detection

• Always say every file has been modified

• Requires comparing the state of every file

• Expensive if files are large and link is slow

Exact Detection

• Keep a local copy of the data at the time of last synchronization (Subversion does this)

• Doubles the disk space

• May also be computationally expensive

Modtime Check

• Works in theory but not in practice

• In *nix, renaming a file does not change its modtime

• changes the modtime of the parent dir

• directory modtime may be changed for other reasons

• renaming a file near the root will make the whole tree look dirty

INode + Modtime

• Dirty =

• inode changed or

• modtime > last synchronization time

• INodes contain file metadata: size, permissions, owner, group, etc.

• but not file names

• Ok for Posix systems, but not for Windows

Online Detection

• Listen to file system events

• What DropBox does

• Easier to implement at the user level on Windows than on *nix

• Some wrappers for Unison try to do this

Reconciliation

What does the user expect?

Synchronization (a simple example)

A synchronizer should propagate changes...

DIR

f’ g’

ba

DIR

f’ g

ba

DIR

f g’

ba

DIR

f’ g’

ba

snc / 6

Easy

Cas

e: N

on-c

onfli

ctin

g ch

ange

s

Easy

Cas

e: C

onfli

ctin

g ch

ange

s

... as long as they do not conflict:

DIR

f’ g’’

ba

DIR

f’ g’

ba

DIR

f g’’

ba

DIR

f’ g’

ba

snc / 7

A more interesting example

If a file gets renamed on one side and modified on the other, what shouldthe synchronizer do?

DIR

f g

ba

DIR

f g’

bac

g

snc / 8

rename modify

???

Three reasonable possibilities:

1. Copy old version with new name ( ); report a conflict for old name( )

DIR

f

a

DIR

f g’

bac

g

c

g

2. Modify the file in the first replica and move it in the second

DIR

f

a

DIR

f

ac

g’

c

g’

3. Do nothing (report a conflict)

snc / 9

Another unclear case

Suppose a file is created on one side and its parent directory is deletedon the other side...

DIR

d

DIR

DIR

f

d

ba

g’

DIR

f

a

What should happen?

snc / 10

Another unclear case

Suppose a file is created on one side and its parent directory is deletedon the other side...

DIR

d

DIR

DIR

f

d

ba

g’

DIR

f

a

What should happen?

snc / 10

1. Nothing; a conflict should be reported

2. The siblings ( ) should be deleted from the second replica, leavingjust the file ( ) and its parent directory ( )

3. The siblings and parent directory should all be deleted from thesecond replica; the file should be moved to a special “orphanage” andthe user alerted

snc / 11

A, B are filesystemsp is the path to be synchronizedreturns “new” file systems with synchronized contents

Specification

Conflicts

Key question: What is a conflict?

Our answer: A conflict occurs when the two replicas do not agree (atsome path), and both have been changed.

Formally, we say there is a conflict at path if

“ and are different at ”

and “ has been changed at (or below) ”

and “ has been changed at (or below) ”

snc / 17-b

Core Specification

Each run of a file synchronizer takes filesystems , , and as inputsand yields new filesystems and as outputs. A run is said to beacceptable if, for all paths :

(1) if , then

if , then

(“don’t overwrite user changes”)

(2) if , then

if , then

(“only change replicas by (completely) propagating user changes”)

(3) if there is a conflict at , then and

(“don’t change (at or below) conflicting paths”)

A synchronizer implementation is correct if all its runs are acceptable.

snc / 18

Architecture

Client FS Server FS

rpc over ssh

ServerClient

System Architecture

replicaarchive

replicaarchive

su

er I

U

updatedetector detector

update

reconciler

transportagent

snc / 29

PerformanceThe RSync Algorithm

Extensions

Synchronizing Multiple replicas

We’ve treated just the two-replica case in this specification (and in ourimplementation).

Pairwise synchronization can be used to keep 3-5 replicas in sync. Justsynchronize successive pairs in a star or ring topology.

For synchronizing more replicas, both specification and implementationcan be extended straightforwardly... iff we require that all replicasparticipate in every synchronization.

For synchronizing many replicas, we need to deal with the fact that onlya subset may participate in any given sync. Problems become significantlytrickier. (Need something like version vectors.)

snc / 23

Data Synchronization

• Should fit in the formalism

• As long as the data is hierarchical

• Just extend the notion of path to records

top related