new protocols for remote file synchronization based on erasure codes utku irmak svilen mihaylov...
TRANSCRIPT
![Page 1: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/1.jpg)
New Protocols for Remote File Synchronization Based on Erasure Codes
Utku Irmak
Svilen Mihaylov
Torsten Suel
Polytechnic University
![Page 2: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/2.jpg)
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
![Page 3: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/3.jpg)
Introduction
Remote File Synchronization Problem: How to update the outdated version of a file over a network with minimal amount of communication
When the versions are very similar, the total data transmitted should be significantly smaller than the file size
Machine A Machine B
Current Version Outdated Version
![Page 4: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/4.jpg)
Common Applications Synchronization of User Files
Synchronization between different machines that may only be connected over over a slow network (home and work machine)
Both rsync and unison are widely used tools Web and Ftp Site Mirroring
Significant similarities between successive versions Including sites distributing new versions of a software rsync is widely used
![Page 5: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/5.jpg)
Common Applications Content Distribution Networks
File synchronization is a natural approach to for updating content replicated at the network edge
Web Access over Slow Links A user revisiting a webpage may already have a previous
version in the browser cache It would be desirable to avoid the entire transmission This idea is implemented in rproxy which uses rsync
algorithm
![Page 6: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/6.jpg)
Problem Formalization We have two files (strings) over some alphabet : fnew
(current file), fold (outdated file) We have two machines: C (the client), S (the server)
connected by a communication link C only has a copy of fold, and S only has a copy of fnew
Goal: Design a protocol between the parties that result C holding a copy of fnew while minimizing the total communication cost
![Page 7: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/7.jpg)
Problem Formalization The communication cost should depend on the
degree of similarity between the two files The Hamming distance The edit distance The edit distance with block moves
We focus mainly on the edit distance with block moves. We assume that each block move operation adds 3 to the distance, while other operations add 1
![Page 8: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/8.jpg)
Problem Formalization We focus on single-round protocols between client
and server Single-round protocols can be more easily integrated into
existing tools currently relying on rsync Multiple rounds are undesirable in many scenarios
involving small files or large latencies Multi-round protocols can introduce other complications
due to state that may have to be kept at the server for best performance
![Page 9: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/9.jpg)
Assumptions The collection consists of unstructured files We are not concerned with issues of
consistency in between synchronization steps A simple two-party scenario where it is
known which files need to be updated and which is the current version
![Page 10: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/10.jpg)
Contributions We describe a new approach to single-round file
synchronization based on erasure codes We derive a protocol that communicates at most
O(k lg(n) lg(n/k)) bits on files with edit distance with block moves of at most k
We derive another practical algorithm and optimized implementation that achieves very promising improvements over rsync
![Page 11: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/11.jpg)
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
![Page 12: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/12.jpg)
A Simple Multi-Round Protocol Runs in a number of rounds In the first round, server partitions the file
into blocks of size bmax and sends a hash (MD5) for each block
Client attempts to match the received hashes to all possible alignments in the outdated file.
Client responds with a bit vector to notify the server which of the hashes are understood
Server repeats the process for the blocks whose hashes did not find a match
Once block size bmin is reached, the server sends all the unmatched blocks
![Page 13: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/13.jpg)
A Simple Multi-Round Protocol
![Page 14: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/14.jpg)
A Simple Multi-Round Protocol Given two files with edit distance with block moves of k, if
we choose bmax = next smaller power of 2 of n/k bmin = lg(n) hash size = 4lg(n) bits
Lemma: If we partition fnew into some number of blocks, then at most k of these blocks do not occur in fold On each level, at most k hashes do not find a match
The algorithm transmits at most O(k lg(n) lg(n/k) ) bits and correctly updates the file with probability at least 1-1/n
![Page 15: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/15.jpg)
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
![Page 16: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/16.jpg)
An Efficient Single-Round Protocol First, we define complete multi-round algorithm:
Sends hashes for all blocks
Second, we describe Systematic Erasure Code briefly
![Page 17: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/17.jpg)
Erasure Code Erasure Code: Given k source
data items of size s, which are encoded into n>k encoded items of same size s.
If any n-k of the encoded items are lost they can be recovered
A systematic erasure code is the one where the encoded data items consist of k source items plus n-k additional items
Figure by Luigi Rizzo
![Page 18: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/18.jpg)
An Efficient Single-Round Protocol
Any hash value sent in the complete multi-round algorithm that would not be sent in the simple multi-round algorithm is not transmitted
![Page 19: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/19.jpg)
An Efficient Single-Round Protocol
Any hash value that would be sent by the simple multi-round algorithm is also not sent to the client, but considered lost
![Page 20: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/20.jpg)
An Efficient Single-Round Protocol
On each level there can be at most 2k lost blocks Client can recreate the entire level of hashes using the 2k
erasure hashes and recovering the lost hashes
![Page 21: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/21.jpg)
An Efficient Single-Round Protocol Theorem: Given a bound k on the edit distance between fold
and fnew, the erasure-based file synchronization algorithm correctly updates fold to fnew with probability at least 1-1/n, using a single message of O(k lg(n) lg(n/k)) bits
We note that there are highly efficient single-message protocols for estimating the file distance k
Another property of the protocol is that by broadcasting a single message, the current version can be communicated to several clients that have different outdated versions
![Page 22: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/22.jpg)
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
![Page 23: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/23.jpg)
A Practical Protocol Based on Erasure Codes Previous protocol has two main shortcomings:
The protocol requires us to estimate an upper bound on the file distance k. An underestimation would make the recovery impossible at the client
More importantly, the algorithm does not support compression of unmatched literals
To address these problems we design another erasure-based algorithm that works better in practice
![Page 24: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/24.jpg)
A Practical Protocol Based on Erasure Codes The hashes are sent from client to server For level i, mi erasure hashes are sent The server identifies the common blocks and then sends
unmatched literals in compressed form
![Page 25: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/25.jpg)
Implementation Overview We included three additional optimizations over rsync :
Server now transmits the resulting delta and bit vector to allow the client create the same reference file
1) We replace gzip algorithm used for transmission of the unmatched literals and match tokens with an optimized delta compressor
![Page 26: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/26.jpg)
Implementation Overview
3) We integrate decomposable hashes:
This technique allows the hash of a child block to be computed from the hashes of its parent and sibling, halving the number of erasure hashes transmitted
2) We make a better choice of the number of bits per hash:
We assume some upper bound on the probability of a collision, say 1/2^d for some d, then we use lg(n)+lg(y)+d bits per hash
n is the file size
y is the total number of hashes sent from client to server
![Page 27: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/27.jpg)
Preliminary Results For the experiments we used the gcc and emacs datasets,
consisting of 2.7.0 and 2.7.1 of gcc and 19.28 and 19.29 of emacs
![Page 28: New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University](https://reader035.vdocument.in/reader035/viewer/2022062321/56649e195503460f94b06886/html5/thumbnails/28.jpg)
Conclusions We have described a new approach to remote
file synchronization based on erasure codes Using this approach, we derived a single-
round protocol that is feasible and communication efficient w.r.t a common file distance measure