2258 ieee transactions on communications, vol. 64, no. …fredsala/sala-synchronization.pdf ·...

16
2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing Files From a Large Number of Insertions and Deletions Frederic Sala, Student Member, IEEE, Clayton Schoeny, Student Member, IEEE, Nicolas Bitouzé, Student Member, IEEE, and Lara Dolecek, Senior Member, IEEE Abstract— Developing efficient algorithms to synchronize between different versions of files is an important problem with numerous applications. We consider the interactive synchroniza- tion protocol introduced by Yazdi and Dolecek, based on an ear- lier synchronization algorithm by Venkataramanan et al. Unlike preceding synchronization algorithms, Yazdi and Dolecek’s algo- rithm is specifically designed to handle a number of deletions linear in the length of the file. We extend this algorithm in three ways. First, we handle nonbinary files. Second, these files contain symbols chosen according to nonuniform distributions. Finally, the files are modified by both insertions and deletions. We take into consideration the collision entropy of the source and refine the matching graph developed by Yazdi and Dolecek by appropri- ately placing weights on the matching graph edges. We compare our protocol with the widely used synchronization software rsync, and with the synchronization protocol by Venkataramanan et al. In addition, we provide tradeoffs between the number of rounds of communication and the total amount of bandwidth required to synchronize the two files under various implementation choices of the baseline algorithm. Finally, we show the robustness of the protocol under imperfect knowledge of the properties of the edit channel, which is the expected scenario in practice. Index Terms— Two-way communication, deletion channel, insertions and deletions, synchronization, edits, coding for synchronization, rsync, practical protocols. I. I NTRODUCTION C ONSIDER two users, A and B , with ownership of files X and Y , respectively. File Y is a modified version of file X , where the modifications are modeled by symbol insertions and deletions. For example, let X = 101 D 10101 D 0 D 1001 and Y = 11 I 0101010010 I . Manuscript received October 2, 2014; revised April 7, 2015, October 30, 2015, and March 30, 2016; accepted March 31, 2016. Date of publication April 8, 2016; date of current version June 14, 2016. This research was supported in part by the NSF GRFP and NSF grants CCF-1162501, and CCF-1527130. This paper was presented at the IEEE International Symposium on Information Theory, Istanbul, Turkey, July 2013 [1] and the IEEE Allerton Conference on Communications, Control, and Computing, Monticello, IL, USA, October 2013 [2]. The associate editor coordinating the review of this paper and approving it for publication was T. M. Duman. F. Sala, C. Schoeny, and L. Dolecek are with the Electrical Engineering Department, University of California at Los Angeles, Los Angeles, CA 90095 USA (e-mail: [email protected]; [email protected]; [email protected]). N. Bitouzé is with the Department of Electronics, TELECOM Bretagne, École Nationale Supérieure des Télécommunications de Bretagne, Plouzané 29200, France, and also with the Laboratory for Robust Information Systems, University of California at Los Angeles, Los Angeles, CA 90095 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCOMM.2016.2552175 Here, file Y was formed by deleting three symbols from file X and inserting two symbols into the resulting string. The deleted symbols are marked D and insertions are marked I. In this work, our goal is to develop efficient algorithms that allow user B to synchronize file Y to file X . That is, we wish to introduce a protocol that user B may use to reconstruct file X from file Y with suitably low probability of error. We allow for interactive communication: users A and B may communicate back and forth. Naturally, we seek to minimize the overall communication bandwidth, measured by the total amount of information exchanged by the two users, while simultaneously maintaining a low probability of reconstruction error. We refer to such algorithms as synchronization algorithms. Observe that synchronizing strings can be viewed as a special case of the general problem of object reconciliation [3]. Synchronization algorithms find numerous areas of use, including data storage, file sharing, source code control sys- tems, and cloud applications. For example, cloud storage services such as Dropbox synchronize between local copies and cloud backups each time users make changes to local versions. Similarly, synchronization tools are necessary in mobile devices. Specialized synchronization algorithms are used for video and sound editing. Synchronization tools are also capable of performing data deduplication. We also note that synchronization protocols may be applied to the prob- lem of DNA sequencing, potentially improving on traditional approaches such as “shotgun” sequencing. A. Prior Work The earliest literature on the synchronization channel did not explicitly deal with interactive communication. These initial works focused on studying one-way codes capable of correcting insertions and deletions. The two seminal works are by Dobrushin [4] and Levenshtein [5], which study the synchronization channel from the information-theoretic and coding-theoretic points of view, respectively. In [5], Levenshtein showed that a class of codes called Varshamov-Tenengolts (VT) codes are capable of correcting a single insertion or a single deletion. Such codes were originally introduced to correct aysmmetric errors, rather than synchronization errors. For the single insertion or single deletion setting, the family of VT codes with checksum parameter a = 0 are known to have asymptotically optimal rate (as the code length goes to infinity) and are conjectured to be optimal in all cases. A non-binary generalization of 0090-6778 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 27-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

Synchronizing Files From a Large Number ofInsertions and Deletions

Frederic Sala, Student Member, IEEE, Clayton Schoeny, Student Member, IEEE,Nicolas Bitouzé, Student Member, IEEE, and Lara Dolecek, Senior Member, IEEE

Abstract— Developing efficient algorithms to synchronizebetween different versions of files is an important problem withnumerous applications. We consider the interactive synchroniza-tion protocol introduced by Yazdi and Dolecek, based on an ear-lier synchronization algorithm by Venkataramanan et al. Unlikepreceding synchronization algorithms, Yazdi and Dolecek’s algo-rithm is specifically designed to handle a number of deletionslinear in the length of the file. We extend this algorithm in threeways. First, we handle nonbinary files. Second, these files containsymbols chosen according to nonuniform distributions. Finally,the files are modified by both insertions and deletions. We takeinto consideration the collision entropy of the source and refinethe matching graph developed by Yazdi and Dolecek by appropri-ately placing weights on the matching graph edges. We compareour protocol with the widely used synchronization software rsync,and with the synchronization protocol by Venkataramanan et al.In addition, we provide tradeoffs between the number of roundsof communication and the total amount of bandwidth required tosynchronize the two files under various implementation choicesof the baseline algorithm. Finally, we show the robustness of theprotocol under imperfect knowledge of the properties of the editchannel, which is the expected scenario in practice.

Index Terms— Two-way communication, deletion channel,insertions and deletions, synchronization, edits, coding forsynchronization, rsync, practical protocols.

I. INTRODUCTION

CONSIDER two users, A and B , with ownership of files Xand Y , respectively. File Y is a modified version of file X ,

where the modifications are modeled by symbol insertions anddeletions. For example, let

X = 101D

10101D

0D

1001 and

Y = 11I0101010010

I.

Manuscript received October 2, 2014; revised April 7, 2015,October 30, 2015, and March 30, 2016; accepted March 31, 2016.Date of publication April 8, 2016; date of current version June 14,2016. This research was supported in part by the NSF GRFP and NSFgrants CCF-1162501, and CCF-1527130. This paper was presented at theIEEE International Symposium on Information Theory, Istanbul, Turkey,July 2013 [1] and the IEEE Allerton Conference on Communications,Control, and Computing, Monticello, IL, USA, October 2013 [2].The associate editor coordinating the review of this paper and approving itfor publication was T. M. Duman.

F. Sala, C. Schoeny, and L. Dolecek are with the Electrical EngineeringDepartment, University of California at Los Angeles, Los Angeles, CA 90095USA (e-mail: [email protected]; [email protected]; [email protected]).

N. Bitouzé is with the Department of Electronics, TELECOM Bretagne,École Nationale Supérieure des Télécommunications de Bretagne,Plouzané 29200, France, and also with the Laboratory for Robust InformationSystems, University of California at Los Angeles, Los Angeles, CA 90095USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCOMM.2016.2552175

Here, file Y was formed by deleting three symbols fromfile X and inserting two symbols into the resulting string.The deleted symbols are marked D and insertions aremarked I.

In this work, our goal is to develop efficient algorithms thatallow user B to synchronize file Y to file X . That is, we wish tointroduce a protocol that user B may use to reconstruct file Xfrom file Y with suitably low probability of error. We allow forinteractive communication: users A and B may communicateback and forth. Naturally, we seek to minimize the overallcommunication bandwidth, measured by the total amount ofinformation exchanged by the two users, while simultaneouslymaintaining a low probability of reconstruction error. We referto such algorithms as synchronization algorithms. Observe thatsynchronizing strings can be viewed as a special case of thegeneral problem of object reconciliation [3].

Synchronization algorithms find numerous areas of use,including data storage, file sharing, source code control sys-tems, and cloud applications. For example, cloud storageservices such as Dropbox synchronize between local copiesand cloud backups each time users make changes to localversions. Similarly, synchronization tools are necessary inmobile devices. Specialized synchronization algorithms areused for video and sound editing. Synchronization tools arealso capable of performing data deduplication. We also notethat synchronization protocols may be applied to the prob-lem of DNA sequencing, potentially improving on traditionalapproaches such as “shotgun” sequencing.

A. Prior Work

The earliest literature on the synchronization channel didnot explicitly deal with interactive communication. Theseinitial works focused on studying one-way codes capable ofcorrecting insertions and deletions. The two seminal worksare by Dobrushin [4] and Levenshtein [5], which studythe synchronization channel from the information-theoreticand coding-theoretic points of view, respectively. In [5],Levenshtein showed that a class of codes calledVarshamov-Tenengolts (VT) codes are capable of correctinga single insertion or a single deletion. Such codes wereoriginally introduced to correct aysmmetric errors, ratherthan synchronization errors. For the single insertion or singledeletion setting, the family of VT codes with checksumparameter a = 0 are known to have asymptotically optimalrate (as the code length goes to infinity) and are conjecturedto be optimal in all cases. A non-binary generalization of

0090-6778 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2259

the VT codes which corrects a single symbol insertion ordeletion was introduced by Tenengolts in [6].

Finding codes capable of correcting more than onesynchronization error is a notoriously difficult problem.Constructions are given in, for example, [7] and [8]. However,such codes have low rates. For example, the rate of thes-deletion-correcting Helberg code [7] is lower than that ofthe s-deletion-correcting repetition code. More results areavailable if certain restrictions are imposed on the types ofsynchronization errors due to the channel. For example, iferrors are always repetitions (insertions of symbols identi-cal to the previous symbol), Dolecek and Anantharam [9]presents a code construction based on a generalization ofthe VT codes which is rate-optimal. More information aboutsuch results is described in surveys on the synchronizationchannel, [10], [11].

Another problem of interest is to develop codes whichare effective against insertions and deletions occurring witha certain probability (rather than codes which guarantee thecorrection of a prescribed number of errors). Such codes,introduced in [12], are based on a concatenated schemeusing a nonlinear inner (“watermark”) code and a non-binarylow-density parity-check (LDPC) outer code. Improvements tothe scheme were suggested in [13].

All of the previously described works have focused on theone-way communication setting. For the interactive communi-cation setting, Orlitsky [14] gave fundamental bounds on therate and complexity for a given edit distance. This work wasfollowed by a series of papers such as [15], [16], and [17],which give explicit protocols for interactive communication.Recently, a work by Venkataramanan et al. [18] developed alow-complexity scheme based on a divide-and-conquer strat-egy. The idea in [18] is to divide the edited file into substringssufficiently short to ensure that each substring contains only asingle edit. By edit, here and throughout the remainder of thepaper, we refer to insertions and deletions. In this case, theVT codes may be applied to correct the synchronization error.It was shown that for the case where the total number of editsis fixed, the scheme is order-optimal. The authors generalizetheir results in [19], where the protocol is modified to dealwith more general types of errors (e.g., bursts, substitutionerrors) and limited rounds of communication.

We are interested in the case where the number of editsis proportional to the length of the file. This is the settingwe observe in typical applications. In a recent work byMa et al. [20], achievability bounds were given on the over-head necessary for synchronization from deletions. The settinghere is for a number of edits linear in the file length and wherethe deletions may be viewed as the output of a Markov process.In particular, the choice of whether to delete a particular bitor not follows a two-state process (delete and do not delete),thus modeling burst deletions. In our previous work [21], weproposed a practical interactive protocol for synchronization inthe presence of a large number of edits. The proposed protocolis order-optimal and has polynomial complexity. In [21], thesynchronization protocol was introduced for the case wherethe source file X is binary and uniformly distributed, and mayonly be affected by deletions. Other works dealing with this

setting include [22], which develops an algorithm for PDAsynchronization using fast set reconciliation, and [23], whichdeals with interactive communications and coding for a verylarge number of symbol changes.

In addition, there exist practical synchronization tools, notbased on coding-theoretic ideas. One of the most popular suchprotocols is rsync. This protocol is a UNIX-based synchroniza-tion tool [24] based on an algorithm which uses two hashes,one strong and one weak. The weaker hash is an easy tocompute “rolling” hash. The edited file is split into segments oflength k, and the rolling hash is applied to each such segment.The rolling hash is also applied to all consecutive segmentsof length k in the original file, and these hashes are comparedin order to “match” segments in the original and edited files.Matching segments are checked for equality using the morepowerful hash. If this hash fails, the entire segment is sent tothe edited version of the file.

B. Our Contributions

In the present work, we further develop and evaluate theprotocol we introduced in [21]. We generalize the protocol toallow for insertions and deletions rather than just deletions.Furthermore, we let source symbols be non-binary and weallow for general i.i.d. source distributions (rather than theuniform distribution we worked with in [21]). We also evaluatethe performance of our algorithm, comparing it against themethod introduced in [18] and against practical tools such asrsync.

The remainder of the paper is organized as follows.Section II formalizes our problem statement and describesthe structure of the proposed protocol. Sections III and IVrespectively focus on the two main building blocks of ourprotocol: the matching module and the edit recovery module.Finally, Section V compares our protocol with the one intro-duced in [18] and with the rsync tool, discusses modificationsthat can improve the performance of our protocol, and showsthe robustness of our protocol against imperfect knowledge ofthe parameters of the channel.

II. BACKGROUND AND PROTOCOL OVERVIEW

In this section we review the protocol introduced in ourprevious works, [1], [2], and [21].

A. Preliminaries

We introduce some notation and tools that we need todescribe and prove our results. Throughout this paper, alllogs are base 2. We denote by Xt ′

t the substring Xt , . . . , Xt ′of X from indices t to t ′. Because we do not consideruniform sources anymore, we need an information-theoretictool to measure the likelihood of two independent sub-strings being equal (our protocol relies on matching pivotsin X and Y ):

Definition 1: (Rényi [25]) The collision entropy of adiscrete random variable Z ∼ μ(z) is defined by

H2(Z)�= − log E(μ(z)) = − log

z

μ2(z). (1)

Page 3: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2260 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

The collision entropy is related to the probability thatthere is a collision between two i.i.d. samples Z , Z ′ ∼μ(z) by Pr

{Z = Z ′} = 2−H2(Z). In our context, because

X is i.i.d., H2(Xt ) does not depend on t , and we thereforewrite it as H2. Because X is i.i.d., the collision probabilitybetween two distinct substrings of X of equal length l isPr{Xt+l−1

t = Xt ′+l−1t ′ } = 2−l H2 .

The edit channel is a channel where the input string maybe affected by insertions (where spurious symbols are insertedinto the input string) and deletions (where input symbols areremoved).

Let X be the length-n input to the edit channel and Ybe its output. We describe the effects of the edit channelon input string X through the edit pattern. The edit patternE = E1, E2, . . . , Er is defined so that the output Y is obtainedfrom X as follows. For 1 ≤ t ≤ r , where n ≤ r < ∞,

• If Et = 0, X j is transmitted and the process moves onto symbol X j+1,

• If Et = −1, X j is deleted and the process moves on tosymbol X j+1,

• If Et = 1, a new symbol from X drawn with distributionμ(x) is inserted.

To see how this works, take X and Y defined over thealphabet {0, 1, 2, 3} with X = 00

D122133

D10 and Y =

012 01︸︷︷︸I

23I1 0310︸︷︷︸

I

310. We see that Y is derived from X by

inserting seven symbols and deleting two. These are labeledby D and I , respectively. Here, an edit pattern describing thesechanges is

E = (0,−1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0,−1, 0, 0).

We also need the following concentration theorem:Theorem 1 (Hoeffding [26]): Consider i.i.d. random vari-

ables Z1, Z2, . . . , Zl that take values in an interval of length I .Let the expected value of the random variables E(Z j ) be M.Then, for every ε > 0,

Pr

⎧⎨

∣∣∣∣∣∣

l∑

j=1

Z j − l M

∣∣∣∣∣∣≥ lε

⎫⎬

⎭ ≤ 2 exp

(−2lε2

I 2

). (2)

B. Problem Statement

The problem that we seek to address is the following: Letuser A own a file X of length |X | = n with symbols fromthe alphabet X drawn according to the distribution μ(x). Letuser B own file Y which is the output of an edit channel withinput X . Here, insertions and deletions occur with probabilitiesβi and βd , respectively. By this, we mean that the probabilityof entering the insertion state is βi and the probability ofentering the deletion state is βd in Fig. 1. The inserted symbolsare drawn with the same distribution μ(x) as the symbolsin X . Our goal is to create a communication protocol operatingbetween users A and B through a two-way, error-free channelso that B is able to recover file X with a negligible errorprobability for small β = βi + βd and large n.

For our theoretical results, we will use the following conven-tions throughout the remainder of the paper. First, we assumethat the source collision entropy H2 is a fixed positive constant.

Fig. 1. Edit channel for our setup. The states in the process are the noriginal file symbols. A file symbol may experience a symbol insertion withprobability βi . In this case, the process remains in the same state. The processmoves on to the next state (the following file symbol) by either deleting thecurrent file symbol (with probability βd ) or transmitting it (with probability1 − βi − βd ).

We prove our theorem for small β = βi + βd , by which wemean that there exists some βH2 > 0 such that our resultshold for all 0 < β < βH2 . Note that there is a differentrange of β for each different constant H2. Now, for each β,we require sufficiently large n. By this, we mean that thereis a positive nβ (which is different for each β) such that ourresults hold for all n > nβ . In other words, we will prove ourresult for β ∈ (0, βH2) and for each such β, for n ∈ (nβ,∞).

We will also rely on the notation O(·), o(·),�(·). Whensuch notation is applied to a function of β but not n, weuse the following definitions. In any expression involving β,H2 is treated as a constant.

• f (β) = O(g(β)) if there exists a constant d > 0 andβ∗ > 0 such that for all 0 < β < β∗, | f (β)| ≤ d|g(β)|,

• f (β) = o(g(β)) if for all fixed d > 0, there exists β∗ > 0such that for all 0 < β < β∗, | f (β)| ≤ d|g(β)|.

When the notation is used to denote a function of n or bothn and β, β acts as a constant, and thus the notation operatessolely on n. We use this convention since our results hold inan interval of n that is a function of β (that is, n ∈ (nβ,∞)).Then, we write

• f (n) = O(g(n)) if there exists a constant d > 0 andn∗ > 0 such that for all n > n∗, | f (n)| ≤ d|g(n)|,

• f (n) = o(g(n)) if for all fixed d > 0, there exists n∗ > 0such that for all n > n∗, | f (n)| ≤ d|g(n)|,

• f (n) = �(g(n)) if and only if g(n) = O( f (n)).

Note that our final βH2 and nβ are selected so that βH2 < β∗and nβ > n∗ for all the β∗’s and n∗’s from the O(·), o(·),�(·)expressions that we use. In cases where there is the potentialfor confusion, such as o(1), we will make clear which of βand n are being referred to.

In our initial work, [21], we studied the problem under thefollowing simplifying assumptions: A binary alphabet X ={0, 1} was used; more specifically, the file X was generatedby an i.i.d. Bernoulli source with parameter 1

2 . In addition,we allowed only for deletions, so that the insertion probabilityβi = 0. Under this model, we proved the following theorem:

Theorem 2 [21]: In the binary, deletion-only, uniformdistribution case (q = 2), there exists a deterministic syn-chronization protocol between users A and B on a two-way,error-free channel, that on average transmits O(nβd log 1

βd)

bits and generates an estimate X = X(1), . . . , X(n) of Xat user B, such that Pr

{X(i) = X (i)

}≤ 2−�(n) for every

1 ≤ i ≤ n.

Page 4: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2261

Fig. 2. Illustration of the synchronization protocol. The original string X is broken up into segment substrings Si , colored in green, and pivot substrings Pi ,shown in red. User A sends the pivot strings to the matching module, which matches them in the edited string Y as Pi j . Between the matched pivots are the

segments Fi . It is the goal of the edit recovery module to synchronize these strings to the Si . The results are sent to the error-correcting code (ECC) decodermodule, which corrects errors introduced in the first two modules and produces the final reconstructed X .

The result in Theorem 2 is optimal up to a multiplicativefactor, as shown in [21]. This property was proven by applyingthe achievability bound in [20] to the given problem setting,giving a lower bound on the number of bits exchangedof n(βd log 1

βd+ O(βd )). Now we generalize the preceding

theorem for general alphabets X with |X | = Q and q =log2(Q)�, allow for insertions, and let the X (t) be drawnaccording to the distribution μ(x). Recall that H2 refers tothe collision entropy of X (t). We have the resulting theorem:

Theorem 3: In the problem setting involving files selectedaccording to i.i.d. (not necessarily uniform) distributions overarbitrary alphabets with fixed collision entropy H2 > 0affected by insertions and deletions, there exists a deter-ministic synchronization protocol between users A and Bon a two-way, error-free channel, that on average trans-mits O( nq

H2(βi + βd ) log 1

βi+βd) bits and generates an esti-

mate X = X(1), . . . , X(n) of X at user B, such that

Pr{

X(i) = X (i)}

≤ 2−�(n) for every 1 ≤ i ≤ n.

Let us comment on this result. First, as previously dis-cussed, our result is for β = βi + βd in a regime (0, βH2)such that β is sufficiently small compared to H2. In particular,β log 1

β � H2. The required number of bits to send theentire file is given by O(nq). For large n and in the regimeof interest, our result significantly improves on this naiveapproach, since the β

H2log 1

β term in the bandwidth of ouralgorithm is very small.

We will prove Theorem 3 in two steps. First, we exhibita synchronization protocol with the desired characteristics.The following subsection (II-C) is dedicated to this task.

Afterwards, we derive a result (Lemma 1) which computesthe bandwidth of the protocol and its final error probability.Lemma 1 will depend on a more involved series of results,which we introduce in Sections III and IV.

C. Protocol Description

In the following we briefly explain the protocol (extendingthe algorithm introduced in [21]) which satisfies the require-ments of Theorem 3. Our protocol consists of three modules,as shown in Figure 2.

We divide the string X into segment substrings Si and pivotsubstrings Pi :

X = S1, P1, S2, P2, . . . , Sk−1, Pk−1, Sk .

Segment substrings have length LS , which is selected to besignificantly larger than the length of the pivot substrings L P ,and in such a way that the expected number of edits withinLS symbols is on the order of 1. Pivot strings, on the otherhand, are selected to be short enough that with high probabilitythey do not contain any edits, but long enough to ensure thatthere are very few copies of the pivots in the overall string.In particular, in this work, we take L P = O( 1

H2log 1

β ) and

LS = 1β . In short, user A sends pivot strings to user B

and user B attempts to find occurrences of these pivots in Y .After successfully matching pivots in Y , user B splits Y intosubstrings according to the locations of pivot strings in Yand then tries to recover from the insertions and deletionswithin each substring by exchanging recovery bits with user A.The insertion/deletion recovery algorithm for each segment isbased on the work of Venkataramanan et al. in [18].

Page 5: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2262 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

After the successful completion of the first two steps, user Asends the parity check bits of a systematic error-correctingcode in order to correct the errors created in the first two steps.We show that the three steps of the protocol generate anestimate of X with an exponentially small error probabilitywhile exchanging the desired number of bits between usersA and B . A more precise explanation of these protocol stepsfollows:

1) Matching Module: The purpose of the matching moduleis to locate substrings of Y that correspond to the pivotstrings Pi from X . There are three possible outcomesfor each of the Pi ’s. The first outcome is that thematching module is able to successfully find the matchcorresponding to Pi in Y . The second outcome takesplace when the matching module erroneously finds amatch which does not correspond to the transmitted Pi .The last outcome occurs when the matching module isnot able to find a match for Pi in Y .Suppose that the matching module finds (possibly incor-rect) matches for pivots Pi1 , . . . , Pik′−1

where k ′ ≤ k.Based on these found matches, the matching moduledivides string Y into substrings as follows.

Y = F1, Pi1 , F2, Pi2 , . . . , Fk′−1, Pik′−1, Fk′ .

The indices of matched pivots are then sent back touser A, which accordingly divides string X into

X = F1, Pi1 , F2, Pi2 , . . . , Fk′−1, Pik′−1, Fk′ .

2) Edit Recovery Module: The goal of the edit recoverymodule is to correct all insertions and deletions in thesegment substrings.The matching module has divided the problem of syn-chronizing the long string Y with X into multiplesimpler problems of synchronizing an Fj with the corre-sponding Fj . For each of these problems, users A and Binteractively communicate the necessary information forrecovering from insertions and deletions within each Fi

by following the synchronization protocol introducedin [18]. Note that this protocol deals with binary stringsonly. We therefore generalize this algorithm for non-binary strings by applying the works of Tenengolts [6].Following this step, the module forms the string

X = F1, Pi1 , F2, Pi2 , · · · , Fk′−1, Pik′−1, Fk′ .

Here, each Fj is the estimate of Fj after correcting theedits in Fj . The resulting string is then sent to the nextmodule. Notice that the output X of the edit recoverymodule has the same length as X , and that furthermore,the pivots that were properly matched by the matchingmodule are aligned in both strings: this motivates the useof “classical” coding theory to correct residual errors.

3) ECC Decoder Module: This is the final module, whosetask is to correct potential errors created in the firsttwo modules. There are two types of such errors. First,it is possible that the matching module detects a pivot Pi

at a wrong position in Y . If this occurs, substrings Fj

and Fj+1 most likely differ from respectively Fj and

Fj+1 by a very large number of edits. This is a regimein which the Venkataramanan et al. synchronizationprotocol is not designed to operate. Hence, Fj and Fj+1may be different from Fj and Fj+1, respectively.The other source of error is that, even in betweentwo properly matched pivots, the synchronization pro-tocol in [18] is not error-free, and there is a smallprobability that hash collisions cause errors in the editrecovery module.For the sake of conciseness, we do not further describethis module and focus on the matching module andedit recovery module, making sure that the rate of theresidual errors at the output of these first two modules islow enough to be corrected by a suitable code withoutconsuming too much bandwidth.

Now we show that the total bandwidth consumed by oursynchronization protocol is O( nq

H2β log( 1

β )) and that our final

error probability is 2−�(n). These estimates, which we provein Lemma 1, complete the proof of Theorem 3.

Lemma 1: On average, the total number of transmitted bitsof the proposed synchronization protocol is O( nq

H2β log( 1

β )).

The protocol’s final estimate X has error probabilityPr{

X(i) = X (i)}

≤ 2−�(n) for every 1 ≤ i ≤ n.Proof: According to Lemma 2 in Section III, there exists

a matching module that matches k ′ = (1 − L Pβ − 2β +o(β))k pivots with at most βk pivot mismatches for L P =

1H2

(20 + 6 log 1

β

), with probability 1 − 2−�(n). Now, with

probability at most 2−�(n), the matching module producesmany incorrect matches. However, even in cases with manybad matches, it is easy to see that the edit recovery andECC decoder modules consume an amount of bandwidththat is polynomial in n. Since these cases have exponentiallylow probability, for large enough n, their contribution to theaverage bandwidth goes to zero. Therefore, for the remainderof the discussion, we consider only the “good” (with manygood pivot matches and few mismatches) cases given byLemma 2.

The first step is to compute the number of pivots k used inthe matching module. We have that

k =⌈

n + L P

LS + L P

⌉=⌈

n + 1H2

(20 + 6 log 1β )

1β + 1

H2(20 + 6 log 1

β )

⌉.

Now, since β < 1, log 1β = o( 1

β ). Moreover, since H2 is aconstant, the denominator can be replaced with 1/β +o(1/β).Similarly, the numerator can be written as n + o(1/β). Now,since all of these o(1/β) terms are positive for β < 1, we maywrite that

2<

n1β + o( 1

β )≤ k ≤ n

1β + o( 1

β )+ o( 1

β )

1β + o( 1

β )+ 1< nβ + 2.

Here, we used the properties that o( 1β ) < 1

β and that since

the o( 1β ) terms are positive, n

1/β+o(1/β) < n1/β . Thus, since

k is bounded by nβ/2 and nβ + 2, it is certainly O(nβ).The transmission of one symbol requires q bits, so that inthe first step, the number of bits consumed in sending the

Page 6: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2263

pivots is given by q(k − 1)L P = q O(nβ)O( 1H2

log( 1β )) =

O( nqH2

β log( 1β )).

Next, the indices of the matched pivots are sent from user Bto user A, at a cost of no more than k < nβ + 2 bits. Nowwe must evaluate the bandwidth needed by the edit recov-ery module, the second step of our algorithm. The averagebandwidth cost of synchronizing each of the substrings Fj isat most

δ j

(4q + (6q

(1 + 1

H2

)+ 4) log |Fj | + 10

),

where δ j represents the number of insertions and deletions(edits) in Fj , |Fj | is the length of Fj , and we used a centraldelimiter with parameter c = 3. This follows from Theorem 4in Section IV.

The total synchronization cost for this stage is given by nomore than

nβ + 2 + E

⎣k′∑

j=1

δ j

(4q + (6q + 6q

H2+ 4) log |Fj | + 10

)⎤

⎦.

The main challenge here is to compute E(δ j log |Fj |).This is done by similar means to the equivalent statementin [21]. We have that E(δ j log |Fj |) = ∑

� Pr{|Fj | =�}E[δ j log |Fj |||Fj | = �]. Note that E[δ j log |Fj |||Fj | = �] =E[δ j log �]. Now, our edit channel implies that the averagenumber of insertions in a string of length � is given by �

1−βi−�

and the average number of deletions is given by �βd1−βi

, giving

an average total number of edits of �(βi+βd )1−βi

= �β1−βi

= 2�β2−β .

Thus,

E[δ j log �] = 2β

2 − β� log �.

With this, we may write

E(δ j log |Fj |) = E

[2β

2 − β|Fj | log |Fj |

].

Now, we can estimate this term using identical logic to [21,Appendix 1]. We have that E

[2β

2−β |Fj | log |Fj |]

≤ 16 +8 log 1/β. The logic of the proof is identical; the only dif-ference is that we must show 2β

2−β (LS + L P) ≤ 2, which isclearly true for small β.

We write the total bandwidth of the edit recovery moduleas no more than

nβ + 2 + E

⎣k′∑

j=1

δ j

(4q + (6q + 6q

H2+ 4) log |Fj | + 10

)⎤

≤ nβ + 2 + E

⎣k′∑

j=1

δ j (4q + 10)

+ k ′(6q

(1 + 1

H2

)+ 4)

(16 + 8 log

1

β

).

The right hand term can be estimated by taking k ′ ≤ nβ + 2

and is indeed in O( nqH2

β log 1β ). We note that E

[∑k′j=1 δ j

]is

the average number of edits in X , which is 2nβ2−β . Putting all

of these expressions together, we conclude that the number ofbits used in the second module is at most

nβ + 2 + (8q + 20)nβ

2 − β+ O

(nq

H2β log

1

β

),

which is indeed O( nqH2

β log( 1β )).

As we will see in Section IV, the algorithm exits thesecond stage with a probability of error that is at mostζ = O(β). By applying a random permutation to the inputof the ECC module (and the inverse permutation to the outputof the ECC decoder), we can eliminate non-uniformity in theremaining errors. Therefore, we view these errors as havingbeen produced by a Q-ary symmetric channel with probabilityof error at most O(β), split among Q−1 symbols. The numberof parity check symbols required for user A to send to user Bto recover from these errors is

nHQ

(1 − ζ,

ζ

Q − 1,

ζ

Q − 1, . . . ,

ζ

Q − 1

)= O(nβ logQ

1

β),

where HQ refers to the Q-ary entropy function. The band-width in terms of bits is then O( n

q β log 1β ). Adding up the

bandwidth consumed from all three stages, we conclude thatour synchronization protocol’s bandwidth is O( nq

H2β log 1

β ), asdesired.

As described above, under the setup given by Lemma 2(many good pivot matches and few bad matches) with prob-ability 1 − 2−�(n) we can recover from all errors in the finalreconstructed file. In the remainder of the cases, the outputerror probability of any particular bit of the estimate X is atmost 2−�(n). This concludes our proof. �

In the following sections, we give further detail into eachof the modules.

III. THE MATCHING MODULE

Adapting the matching module from [21] to our scenarioin which edits are not limited to deletions is not straight-forward and is one of our main contributions. This sectioncharacterizes the common goal of the matching module inboth the deletions-only scenario and in ours, describes ouralgorithm, and explains why that algorithm had to be funda-mentally modified to account for the presence of insertions.

A. Good and Bad Matches

For all i from 1 to k −1, the matching module considers Pi ,the i -th pivot.1 We respectively denote by pi and pi the indicesof its first and last symbols in X . The matching module thencomputes the list of all occurrences of Pi in Y . The goal isto identify which of these occurrences is the correct instanceof Pi . We say that a substring Y t+L P−1

t of Y is the good matchof Pi if:

• The edit pattern E pi +opi +o corresponding to pivot i is

all-zero, and• Yt is the symbol X pi (that is, the symbol X pi was trans-

mitted, not edited, and can be found in Y at position t).

1We make a slight abuse of notation by allowing strings like Pi or Yt+L P −1t

to represent both the content of the string and its location within X or Ydepending on context.

Page 7: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2264 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

Fig. 3. Simple visualization of the matching module. Each green horizontalline contains one layer (a copy of the string Y ). In layer i , we try to match theith pivot string in Y . These matches are shown in light red. Good matchesare shown in darker red. The edges between the vertices corresponding tomatches are darker for higher probability (low weight). Good paths are thosewith low total weight (in this case, the path that includes the dark red pivots1,3,4,5,6).

Thus subsequent symbols of Y t+L P−1t are the symbols

that resulted from the transmission of the rest of X pipi

.

Every occurrence of Pi in Y which is not the good matchis called a bad match. Notice that a pivot has no good matchif at least one of its symbols was edited. As a consequence,even when the substring Y t+L P−1

t is at the correct location andequals Pi , it is not a good match if there was an edit withinthe substring. For example, suppose X pi = X pi −1. If X pi wasdeleted, then the corresponding substring of Y is a bad match.If X pi−1 was deleted and there were no other edits in Pi , thenthe corresponding substring of Y is a good match. We remarkthat this classification allows us to ignore certain corner caseswithout altering the main result.

B. The Matching Graph

In both the deletions-only scenario and in ours, the goal ofthe matching module is to find, at once, the good matches ofas many pivots as possible.

In the presence of deletions only considered in [21], onecan argue that if a certain number of symbols occur betweentwo pivots in X , then at most as many symbols occur betweentheir good matches in Y , since there cannot be insertions.One can therefore define a system of binary constraints thatanswer questions the type “can this match of Pi and thismatch of Pi ′ simultaneously be good matches?”. This systemof constraints is nicely represented by an acyclic graph onwhich vertices are matches, and two vertices are connectedif and only if they correspond to matches which can besimultaneously good matches. Maximum paths on this graphrepresent combinations of matches that can be simultaneouslygood, and it was proven in [21] that with high probability, thelongest such path represents a combination of matches whichincludes a large number of good matches.

In our situation, because insertions as well as deletions canhappen, there is no similar logical constraint. The distancein X between two pivots can be greater then, equal to, or lessthan the distance in Y of their good matches. We can howeverapproximately quantify the likelihood of two matches beingsimultaneously good matches, given an estimate of βi and βd

(for instance, if the insertion and deletion rates are very low,there is only an extremely small probability that the distancebetween two pivots can be much higher or much lower thanthat of their good matches). We therefore use a graph with thesame vertex set as before, where edges are now weighted bythe probability that the matches corresponding to their verticesare simultaneously good. The matching problem, from aconstraint-solving problem in the deletions-only case, becomesan optimization problem. An example of the matching graphis depicted in Fig. 3.

We now formally define our matching graph G = (V, E).The set of vertices V is partitioned into k + 1 layers0, . . . ,k :

• For 1 ≤ i ≤ k − 1, each vertex v in i corresponds to amatch of Pi in Y ,

• 0 = {v0} and k = {vk}, where v0 and vk respectivelycorrespond to the fact that the beginning of X matchesthat of Y , and that the end of X matches that of Y .

We further respectively define, for each layer i , gi and b

ias the set of good and the set of bad vertices in i ,respectively. Note that, according to our definition of goodmatch,

gi contains at most one vertex.

For 1 ≤ i ≤ k − 1 and for a vertex v ∈ i , we let vand v respectively denote the first and last indices in Y of thematch of Pi that corresponds to vertex v. We also set v0 = 0and vk = |Y | + 1. We also define good and bad vertices asvertices corresponding respectively to good and bad matches.We consider v0 and vk as good vertices.

Consider two vertices u ∈ i and v ∈ j with i < j .We introduce two quantities:

D(u, v) = v − u − 1,

δ(u, v) = D(u, v) − (( j − i − 1)L P + ( j − i)LS). (3)

The quantity δ(u, v) is the number of net edits (the numberof insertions minus the number of deletions) that must haveoccurred between Pi and Pj for u and v to be good vertices.Therefore intuitively speaking, if for instance βi = βd � 1, theexpected value of δ(u, v) is 0, and a high value of |δ(u, v)|would indicate that u and v cannot simultaneously be goodvertices.

With this in mind, we define the set of edges E of our(oriented) graph G. There is an edge from vertex u ∈ i tovertex v ∈ j if and only if i < j . We assign the followingweight to the edge u → v:

w(u, v) = |δ(u, v)| + ( j − i − 1)W, (4)

where W is a positive constant to be fixed later. The first termof w(u, v) is used to give high weight to paths that representan edit pattern with many edits, while the second term is usedto penalize edges that skip one or several layers (i.e., edgesfrom i to j where the number j − i − 1 of layers betweenlayer i and layer j is non-zero).

C. Theoretical Properties of the Matching Graph

In the following, we assume βi = βd = β/2 for the sakeof simplicity. (Similar results can be derived for βi = βd ).

We define R�= L Pβ. Then,

Page 8: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2265

Lemma 2: Let k′ = (1− L Pβ−2β+o(β))k and take L P =1

H2

(20 + 6 log 1

β

). Then, the matching module matches k′

pivot strings (k′ ≤ k) with at most βk pivot mismatches. Thisoccurs with probability 1 − 2−�(n).

We will prove Lemma 2 by building up a series of results.Lemma 3: With probability 1 − 2−�(n), there are (1 − R +

o(β))k pivots with a good match in Y .Proof: For any pivot Pi , the probability that Pi has a good

match in Y is (at least) the probability that no symbol of Pi isedited: (1 − β)L P . Now, we have from the binomial theoremthat

(1 − β)L P = 1 − L Pβ +L P∑

i=2

(L P

i

)(−1)iβ i .

We show by induction that(L P

i

)β i are o(β) for i ≥ 2. The base

case is for i = 2. Then, we have(L P

2

)β2 = O( 1

H2log 1

β )2β2.Thus, there exists d > 0 such that

(L P

2

)β2 ≤

(d/H2β log

1

β

)2

,

for sufficiently small β. Now, since for small β, (log 1β )2 is

much smaller than 1β , we may write, for all fixed d ′ > 0 and

sufficiently small β, that(

L P

2

)β2 ≤

(d/H2β log

1

β

)2

< d ′(d/H2)2β2 1

β= d ′(d/H2)

2β.

Now, since d/H2 is a constant, we conclude that(L P

2

)β2

is o(β).Next, assume that

(L Pi

)β i is o(β). Then,

( L Pi+1

)β i+1 <

L p−ii+1

(L Pi

)β i+1 ≤ L Pβ

(L Pi

)β i = L Pβo(β). However, L P =

O( 1H2

log 1β ), which is o( 1

β ), so that indeed L Pβ is smaller

than 1, giving us( L P

i+1

)β i+1 = L Pβo(β) = o(β). This

concludes the argument.Thus, (1 − β)L P = 1 − L Pβ + o(β) = 1 − R + o(β).

Therefore, the expected value of the number of pivots with agood match in Y is (1 − R + o(β))k.

The remainder of the proof is the application of Theorem 1.Take each of the Zi to be an indicator random variable for thematch of a pivot. We take Zi = 1 for a good match andZi = 0 for a bad match. In Theorem 1, we take � = k andM = (1 − R + o(β)). Note that the interval length I = 1 forindicator variables. Next, we take ε = o(β). Plugging theseterms into Theorem 1 and using the fact (showed in the proofof Lemma 1) that k ≥ nβ/2, we may write that the probabilityof a number of good matches between Mk −o(β)k and Mk +o(β)k is at least

1 − 2 exp(−2o(β)2k) = 1 − 2 exp(−2o(β)2((nβ)/2)).

Switching logarithm bases, we get a probability of

1 − 2−o(β)2nβ.

Now, recall that when we use notation such as �(n), βis treated as a constant, so that we can replace o(β)2nβ

with �(n). Next, the integers in the range [Mk −o(β)k, Mk +o(β)k] = [(1 − R)k, (1 − R + 2o(β))k] are all in the set ofintegers that can be written (1 − R + o(β))k, and the proof isconcluded. �

We now fix the constant W to 1/β. The next two resultsare Lemma 4 and Lemma 5. Lemma 4 upper bounds theweight of Pg , the path connecting good vertices, definedbelow. Lemma 5 gives a bound on the number of good andbad vertices on low-weight paths.

Below, we use Lemma 3 to bound the weight of the path Pg

joining all good vertices, which can be formally defined as thepath v0 = v

gi0

→ vgi1

→ vgi2

→ · · · → vgik′−1

→ vgik′ = vk (with

i0 = 0 and ik′ = k) such that:

• For all j with 0 ≤ j ≤ k ′, vgi j

∈ gi j

,• For all j with 0 ≤ j ≤ k ′ − 1, every layer i ′ between i j

and i j+1 is such that gi ′ = ∅.

Lemma 4: With probability 1 − 2−�(n),

w(Pg) ≤ (1 + RW + O(β))k. (5)Proof: We only need to show that (5) is satisfied with

probability 1−2−�(n) when there are (R −o(β))k pivots withno good match in Y : using Lemma 3 and the union boundwill then conclude the proof.

We write the weight of path Pg as the sum of the weights ofits edges, and we then decompose this sum using (4) (the firstterm is the number of net edits, while the second correspondsto the contribution of skipped layers):

w(Pg) ≤k′−1∑

j=0

w(i gj , i g

j+1)

≤k′−1∑

j=0

∣∣δ(i j , i j+1)∣∣+

k′−1∑

j=0

(i j+1 − i j − 1)W

≤k−1∑

i=0

|δ(i, i + 1)| + (k − k ′)W, (6)

where the last step comes from the triangle inequality

|δ(i j , i j+1)| =∣∣∣∣∣∣

i j+1∑

i=i j

δ(i, i + 1)

∣∣∣∣∣∣≤

i j+1∑

i=i j

|δ(i, i + 1)|. (7)

Using Theorem 1 with ε = β2, we have, with probability1 − 2−�(n) (again, in �(n), β is treated as a constant),

k−1∑

i=0

|δ(i, i + 1)| ≤ (β + β2)n = (1 + O(β))k. (8)

Furthermore, Lemma 3 bounds the number (k − k ′) ofpivots with no good match in Y under (R − o(β))k.Hence, as W = 1/β, the following holds:

w(Pg) ≤ ((R − o(β))W + 1 + O(β)) k

= (1 + RW + O(β))k. (9)

�The next result, Lemma 5 bounds the number of good

and bad pivot matches on a low-weight path. Combined withLemma 4, which upper bounds the weight of the path Pg ,

Page 9: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2266 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

it justifies the fact that the lowest-weight path in G has a largenumber of good vertices and few bad vertices. Observe thatLemma 5 and Lemma 4 together imply Lemma 2.

Lemma 5: For a random string X and a random editpattern, with L P = 1

H2

(20 + 6 log 1

β

), if we pick any path Q

from v0 to vk with weight w(Q) ≤ (1 + RW + O(β))k, thenwith probability at least 1 − 2−�(n) the path Q has at least(1−R−2β+o(β))k good vertices and at most βk bad vertices.

The proof of Lemma 5 is deferred to the Appendix.Since there exists a path of weight (1 + RW + O(β)) k

(Lemma 4) and since with very high probability, any pathin G of weight lower than or equal to (1 + RW + O(β)) khas a large number of good vertices and a small numberof bad vertices, we implement our matching module as asearch for the shortest path (that is, lowest-weight path)in G.

Next, we briefly comment on the complexity of finding ashortest path on the matching graph. As described, we performa lowest-weight path search in G with Djikstra’s algorithm.In the context of our problem, the worst-case complexity isO(|V|2). There are k = n+L P

L S+L P� < nβ+2 layers in our graph,

and the average number of instances of a pivot in each layer is

given by 2−H2 L P |Y | = 2− log(O( 1β )) n(1−βd )

1−βi= βn. Therefore,

|V| = O(n2β2), and the complexity of our matching modulealgorithm is upper bounded by O(|V|2) = O(n4β4).

The following section describes the second module of ourprotocol: the edit recovery module.

IV. THE EDIT RECOVERY MODULE

Once the pivots have been matched, the problem of syn-chronizing the two potentially large files can be split intoindependent synchronization problems for which the expectednumber of edits is low (since it is controlled by the choiceof the segment length). We therefore use the algorithmfrom [18], which works well in this scenario. This sectionbriefly describes that algorithm.

A. Synchronizing From a Single Edit

If two strings differ by only an insertion or a deletion, theycan by synchronized in a single round of communication andat a low cost in terms of bandwidth. This is achieved by usingthe non-binary Varshamov-Tenengolts (VT) codes introducedin [6]. The VT codes are a family of codes that correct a singleinsertion or deletion:

Definition 2 (cf. [6]): For all 0 ≤ ρ < Q and 0 ≤ σ < n,the VT code V Tρ,σ (n, Q) is the set of all Q − ary vectors(a1, . . . , an) such that

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

n∑

i=1

ai ≡ ρ (mod Q),

n∑

i=1

(i − 1)si ≡ σ (mod n),

(10)

where the sequence (s1, . . . , sn) is defined by s1 = 1 and, forall 2 ≤ i ≤ n,

si ={

0 if ai ≤ ai−1,

1 if ai > ai−1.(11)

Since V Tρ,σ (n, Q) corrects a single insertion or deletion forany ρ and σ , synchronizing X with Y when they only differby an insertion or a deletion can be done by computing thevalues ρ and σ for which X ∈ V Tρ,σ (n, Q) and transmittingthese values, then decoding Y in V Tρ,σ (n, Q). We thereafterrefer to these values of ρ and σ as the VT-checks of X andwrite V T C(X) = (ρ, σ ). Synchronization therefore costs oneround of communication and log n + log Q bits.

B. Synchronizing From Several Edits

In the case where two strings differ by more than a singleinsertion or deletion, we cannot use the VT codes directly.The protocol from [18] uses a divide-and-conquer approachto isolate edits and correct them using the VT codes. This isachieved recursively in the following manner:

1) If the strings to synchronize are of equal length (eitherthere was no edit, or several with as many insertions asdeletions), user A sends a hash of its string to user B .User B compares this hash to the hash of its own stringand declares that synchronization of this subproblem iscomplete if they match, and otherwise moves to step 3.

2) If the length of the strings to synchronize differ byexactly one (either there was exactly one edit, or severalwith one insertion more/less than deletions), user Asends the VT-checks as well as a hash of its stringto user B . User B then decodes its own string usingthe appropriate VT code, and computes the hash ofthe resulting string. If the hashes match, once againwe declare that synchronization is complete in thissubproblem, and otherwise we move to step 3.

3) In this case, the strings to synchronize for this sub-problem differ by more than a single edit, and cannotbe synchronized directly with VT codes. We thereforedivide this subproblem into two new subproblems bysending a number c of central symbols of user A’sstring to user B that will be used as a delimiter. User Battempts to match these symbols around the center itsown string.

Fig. 4 illustrates this process. L and L ′ respectively denotethe length of the string at user A and of that at user B .

Notice that one could use the algorithm from [18] directlywithout a Matching Module, but that algorithm is designed towork for a number of edits that is o( n

log n ), and it will thereforenot perform as well if one fixes the edit rates and increasesthe file length too much.

Finally, we describe the algorithm (in terms of bandwidthand error probability). This result was proved in [1] and issimilar to the proofs in [18]. To keep our paper concise, wedo not include the proof and refer the reader to [1] for furtherdetails.

Theorem 4 [1]: Let P be an i.i.d. sequence of length Lover X , and let H2 be the collision entropy of Pt for all t .Let P be a sequence obtained from P through δi insertionsand δd deletions, for a total of δ = δi + δd = o

(L

log L

)

edits.2 For any parameter c > 1, there exists an interactive

2We use the definition of little o applied to L , that is, f (L) = o(g(L)) iffor all d > 0 there exists L ′ > 0 such that for all L > L ′, | f (L)| ≤ d|g(L)|.

Page 10: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2267

Fig. 4. Example of a run on the edit recovery module. At the first iteration (top line), because L ′ − L > 1, B requests a central delimiter from A.This delimiter is matched on B , and the algorithm goes on recursively on both sides of the delimiter. Parts of the string that are considered as synchronizedare grayed out.

synchronization protocol that produces at estimate P of Pfrom P, such that:

1) The probability that the protocol does not synchronizecorrectly is

Pr{P = P} ≤ δ log L

Lc, (12)

2) If NA→B and NB→A respectively denote the number ofbits sent from the encoder to the decoder, and from thedecoder to the encoder, then their expected values aresuch that

E(NA→B ) < δ

(4q +

(2qc + 2

qc

H2+ 4

)log L

),

E(NB→A) < 10δ. (13)

3) The probability that the algorithm terminates afterr rounds is at least (1 − (δ + 1)2−r )δ. The expectednumber of rounds taken by the protocol to terminate istherefore approximately 4 + 2 log δ.

After the completion of the edit recovery module, syn-chronization has been performed. However, there may beresidual errors. Here again, we follow the logic of [21]. Thenumber of errors that are due to pivot mismatch is upperbounded by O(β). The errors due to the edit recovery module(caused by, for example, hash collision), is, according toTheorem 4, δ log L S

L3S

for a central delimiter with parameter

c = 3, which contributes an error term of o(β). This isa theoretical characterization; below we discuss a practicalchoice of coding scheme to deal with residual errors.

In [21], in the deletions-only setting, this module usedLDPC codes. Since the above theoretical analysis is agnos-tic towards the specific type of error correcting code used,we allow for the application of other classes of codes.

Reed-Solomon codes are a class of maximum-distanceseparable codes, with dmin = n′ − k + 1, of blocklength n′over G F(Q), capable of correcting n′−k

2 symbol errors. TheReed-Solomon code is an appropriate choice of code sincein our insertion/deletion process, the typical symbol error is

not characterized by a single-bit error (as would be the casefor a Gray-coded data set with Gaussian noise). The inputto the ECC module is interleaved, thus our Reed-Solomoncode yields an error-free output when the residual error rateis lower than n′−k

2k at the output of the edit recovery module.Table II provides performance results for one example of aReed-Solomon code.

This concludes our theoretical analysis of the synchroniza-tion protocol. In the next section, we present experimentalresults based on our software implementation of the protocol.

V. EXPERIMENTAL RESULTS

In this section, we report experimental results about ourprotocol. We compare it with rsync and with the synchroniza-tion protocol from [18], and we discuss several variations ofour protocol that achieve different tradeoffs between a lowbandwidth and a low number of rounds of communication.One way to compare such tradeoffs is to use a cost function,for instance by fixing a positive number γ and defining thecost C as

C = Nbits + γ Nrounds (14)

where Nbits is the total bandwidth consumption in bits andNrounds is the number of rounds required to synchronize. Thistype of cost function is a simple illustrative example; there aremany other possibilities.

A. Comparison With rsync

We compare our scheme versus rsync for varying editprobabilities. We report our results in Fig. 5 with a file lengthn = 50000 symbols, however our conclusions would remainthe same for a wide range of file lengths since the bandwidthconsumption of both rsync and of our protocol are (almost)linear in n.

Our protocol shows a significant improvement in termsof the number of bits transmitted, only using 10% of thebandwidth consumed by rsync at β = 0.006 for instance.

Page 11: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2268 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

Fig. 5. Comparison of the number of bits transmitted used by our schemeversus rsync. Here, the comparison is a function of the edit rate β = βi +βd .At error probability of β = 0.006, we improve on rsync by a factor of 10.The files used are i.i.d. and uniform with alphabet size Q = 52 and lengthn = 50000 symbols (300000 bits). The pivot length is 5 while the segmentlength was β−1.

The comparison was performed on i.i.d. files with symbolsdrawn uniformly from an alphabet of size Q = 52 (e.g., lower-case and upper-case Roman letters), and we used pivot lengthL P = 5 and segment length LS = 1

β .

B. Comparison With Protocol by Venkataramanan et al.

We compare the bandwidth and number of rounds used byour synchronization protocol with that required by the protocolproposed in [18], on files with length from 20000 to 100000with βi = βd = 0.01, and we report the results in Table I.

Our scheme is shown under four configurations. In all ofthem, if the edit recovery module has to synchronize twosubstrings that are so small that sending a central delimiter,VT-checks and hashes would use more bandwidth than sendingthe substring itself, the substring is sent directly.

• “Our scheme, LS = 100” is our scheme with segmentlength 100.

• “Our scheme, LS = 200” is our scheme with segmentlength 200.

• “Our scheme, pre-send delim.” is our scheme underthe following modification: whenever the edit recoverymodule is asked to synchronize two strings of lengths thatdiffer by at most one, it will not only send a hash (andVT-checks if they differ by exactly one), but will alsosend a central delimiter. If the hashes match, it declaresthe strings to be synchronized, the central delimiter isdiscarded and we wasted a bit of bandwidth. Howeverif the hashes do not match, we have saved one roundof communication because we already sent a centraldelimiter. We use segment length 100.

• “Our scheme, stop at 12” is our scheme with segmentlength 100, where when round 12 is reached, the editrecovery module simply sends all the substrings thathave not been synchronized yet, completing the synchro-nization right then. We chose round 12 based on theexperimental results that are explained in more detailsat the end of this section and presented in Fig. 7.

For each file length, a thousand pairs (X, Y ) of fileswere generated, and we report for the five different setupsthe bandwidth (median and worst case) and the num-ber of rounds (median and worst case) that was used tosynchronize.

On average, our scheme requires a slightly higher albeitcomparable bandwidth compared to that used by [18].However, our scheme synchronizes in much fewer rounds ofcommunication. Choosing a higher segment length reduces thegap between the bandwidth consumptions of the two protocolseven further, at the expense of typically about two addi-tional rounds (which remains much lower than the numberof rounds required by [18], especially as the file lengthincreases).

There are mainly two scenarios that cause an increase inbandwidth and round consumption for both our synchroniza-tion protocol and that from [18]:

• When two edits occur very close to each other, it maytake a number of rounds comparable to the logarithm ofthe segment length (for our protocol) or to that of the filelength (for [18]) to isolate them so that the VT codes cancorrect them.

• When a central delimiter suffers from an edit, and the pro-tocol manages to find another occurrence of that centraldelimiter nearby in Y , so that the strings on X and Yon the left and on the right of that central delimiterand of its match differ by a relatively large numberof edits.

Both situations are relatively uncommon, but they may occur,and our scheme is better able to deal with them thanksto the fact that our matching module isolates independentruns of the protocol from [18]: in situations undesirable forthe protocol from [18], only a small portion of the overallrun of our protocol is affected. Therefore, the worst-caseperformance over a thousand synchronizations is much betterfor our protocol than from that from [18], even more so forlarge file lengths. Its contribution to reaching better worst-caseperformance, as well as the fact that it reduces the numberof rounds required to isolate edits, justifies the use of ourMatching Module.

Table II provides insight into the importance of the ECCmodule. In these experiments, we used a [n′, k, d]Q =[63, 61, 3]64 Reed-Solomon code (length 63, dimension 31,single-error-correcting), with a file length of n = 15000 sym-bols (90000 bits), an alphabet size of Q = 64, and 10000 trials.In order to observe a nonzero error rate, we use a lowerhash length, L H , which increases the probability of hashcollisions. Note that any particular run of our protocol withany residual errors counts as a frame error (the bit error-rate is much smaller). The table clearly shows that for aminimal increase in bandwidth, we can drastically lowerthe frame error rate. In addition, due to the lack of amatching module, the scheme from [18] contains a highernumber of errors. The higher frame error rate is due to anincreased number of rounds and a larger average segmentlength.

Using a well-chosen fixed-round stopping criterion (sendingevery substring not yet synchronized when a given round is

Page 12: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2269

TABLE I

COMPARISON OF THE BANDWIDTH AND NUMBER OF ROUNDS USED BY OUR SCHEME AND BY THAT OF VENKATARAMANAN et al. [18]. BOTH THEMEDIAN AND THE WORST CASE (OVER A THOUSAND TRIALS) ARE SHOWN, FOR FILE LENGTHS n RANGING FROM 20000 TO 100000 SYMBOLS.

THE ALPHABET HAS SIZE Q = 64, AND THE FILES ARE i.i.d. AND WITH UNIFORM SYMBOL DISTRIBUTION. THE EDIT CHANNEL HAS

EDIT PROBABILITIES βi = βd = 0.01, THE PIVOTS HAVE LENGTH 5. THE LOWEST COST FOR EACH SCENARIO

IS REPORTED IN BOLD FONT

TABLE II

COMPARISON OF THE AVERAGE BANDWIDTH, AVERAGE NUMBER OF ROUNDS, AND FRAME ERROR RATES FOR BOTH OUR SCHEME AND THE SCHEME

FROM [18], WITH AND WITHOUT THE ECC MODULE. WE USE A [n′, k, d]Q = [63, 61, 3]64 REED-SOLOMON CODE. IN OUR SIMULATION, THE

FILE LENGTH n = 15000 SYMBOLS (90000 BITS), AND THE ALPHABET SIZE Q = 64. NOTE THAT WE SIMULATE FOR TWO DIFFERENT

VALUES OF THE HASH LENGTH, L H , 10 BITS AND 16 BITS. IF THE NUMBER OF RESIDUAL ERRORS COMING INTO THE ECC

MODULE IS FEWER THAN � n(n′−k)2k � = 245 SYMBOLS, THEN THE OUTPUT OF THE ECC MODULE IS ERROR FREE. THE

FILES ARE UNIFORM i.i.d.. THE EDIT PROBABILITIES ARE βi = βd = 0.01 AND THE PIVOT LENGTH IS 5. EACH

SCENARIO WAS SIMULATED WITH 10000 TRIALS

reached) provides an excellent tradeoff: the average bandwidthremains close to that used by [18], while the number ofrounds is controlled (which is especially valuable for higherfile lengths and/or for outlier cases). In Fig. 6, we show thebandwidth consumption of our algorithm under fixed-roundstopping criteria (with n = 50000, βi = βd = 0.01 andQ = 64). Stopping at round 10 only increases the bandwidthrequired to synchronize by less than a percent, while savingup to about 15 rounds of communication in bad cases(around 4 rounds are saved in typical scenarios). We alsoplot a genie-aided stopping criterion that decides for eachsubproblem whether the edit recovery module should send theentire substring at once and terminate, or whether it shouldsend the requested information (delimiters, VT-checks, and/orhashes). The decision is based on the minimization of the totalbandwidth consumption. This genie-aided stopping criterionreduces the bandwidth consumption by almost 30% comparedto the basic scenario, and looking for good heuristics toapproach it is therefore a promising direction for futurework.

In Fig. 7, we show the impact of fixed-round stoppingcriteria on the cost of synchronization, where the cost isC = Nbits + γ Nrounds for γ ranging from 1000 to 8000. Thisjustifies the choice of round 12 as our stopping criterion inTable I.

Fig. 6. Bandwidth spent to synchronize with different stopping criteria, for afile length n = 50000 symbols (300000 bits) on an alphabet of size Q = 64,with insertion and deletion rates βi = βd = 0.01. The red curve showsthe bandwidth spent using fixed-round stopping criteria, while the blue lineshows the bandwidth that would be spent with an optimal stopping criterion(a genie decides, independently for each subproblem, what is the best time tostop in order to minimize bandwidth consumption). The fixed-round stoppingcriterion for round 10 results in an average 74k bits of bandwidth, and itremains within a percent of 70k bits for rounds 13 and afterwards. The genie-aided criterion uses 51k bits on average.

If we modify our protocol to pre-send a central delimiterwhenever a hash is sent (along sometimes with VT-checks),we observe similar consequences as when stopping at a

Page 13: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2270 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

Fig. 7. Cost to synchronize with at stopping criteria, for a file lengthn = 50000 symbols (300000 bits) on an alphabet of size Q = 64, withinsertion and deletion rates βi = βd = 0.01. Four cost functions of the formC = Nbits + γ Nrounds are considered, where Nbits and Nrounds respectivelydenote the bandwidth and number of rounds used to synchronize, and γ is aparameter that we plot for four values between 1000 and 8000. The optimalround at which to stop for γ = 1000 and γ = 8000 are indicated by an arrow.

specific round: the bandwidth consumption is higher thanunder the unmodified version of the protocol, but the numberof rounds required to synchronize is reduced. For small filelengths, this approach synchronizes using a particularly lownumber of rounds without costing much additional bandwidth.

Since residual errors can always be dealt with using anerror correcting code regardless of whether our scheme or thatof [18] is used, we chose to compare the two protocols withoutthat additional step. The average error rates of our protocolwere lower than that of [18] for all three file lengths.

C. Robustness to Imperfect Knowledge of the Edit Channel

It is likely that in practice, the parameters βi and βd ofthe edit channel are not perfectly known to the users. Letus assume that the users only have estimates βi and βd ofβi and βd. In this scenario, our protocol will use a segmentlength 1/(βi + βd), and the weights of the edges on thematching graph will also diverge from their ideal values. Thepivot length remains largely unchanged if the estimates are notorders of magnitude away from the real values of the insertionand deletion rates, since it varies logarithmically with theseestimates.

In Fig. 8, we show the average bandwidth and number ofrounds of communications used to synchronize a file Y with afile X of length n = 60000 symbols, when βi = βd = 0.01 areunknown to the users, for estimates of these rates from 6e−4 to0.033. The lowest amount of bandwidth is reached when theestimates are correct. The number of rounds is approximatelylogarithmic in the segment length and therefore does not varytoo much when the estimates of the insertion and deletion ratesare within a few times larger or smaller than the real rates.This demonstrates the robustness of our scheme to imperfectknowledge of the channel.

In Fig. 9, we assume that user A knows how to jointlyoptimize the segment length and the fixed-round stoppingcriterion described in the previous section (in our experiment,we searched for the best pair of parameters in a grid).

Fig. 8. Variations of the bandwidth and number of rounds require tosynchronize with the estimates of the insertion and deletion rates. The filelength is n = 60000 symbols (360000 bits), the alphabet size is Q = 52, withinsertion and deletion rates βi = βd = 0.01. We set the pivot length to 5 andthe hash length to 10.

Fig. 9. Variations of the cost function when synchronizing with imperfectknowledge of the edit probabilities, for four values of the γ parameter ofthe cost function. The file length is n = 60000 symbols (360000 bits), thealphabet size is Q = 52, with insertion and deletion rates βi = βd = 0.01.We set the pivot length to 5 and the hash length to 10.

We then generate pairs of files for the users to synchronize,with file length n = 60000 symbols and with insertion anddeletion rates βi = βd = 0.01, and measure the cost ofsynchronization when the users are given estimates of theserates and use the segment length and fixed-round stoppingcriterion that are optimal for these estimates. While the optimalcosts are reached when the estimates are correct, estimatingβi and βd as twice larger or smaller than their actual value hasalmost no impact on the cost.

VI. CONCLUSION

In this paper, we considered the interactive synchronizationprotocol introduced by Yazdi and Dolecek [21], extendedto non-binary non-uniform files, with insertions as well asdeletions. In order to achieve this extension of the protocol,we took into consideration the collision entropy of the sourceand refined the matching graph from [21] by appropriatelyweighting its edges.

Page 14: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2271

We showed that (on i.i.d. files with i.i.d. edits) our pro-tocol outperforms the widely used synchronization softwarersync. It also outperforms the synchronization protocol byVenkataramanan et al, not directly in terms of the bandwidthrequired to synchronize, but in the number of rounds requiredand in terms of resilience to outlier cases.

We observed the impact of the addition of a stoppingcriterion. Stopping criteria improve the performance of ourprotocol by providing tradeoffs between bandwidth and roundsof communication, but also by making sure that outliercases do not end up consuming a large number of rounds.An important thing to notice is that the bandwidth used withthe genie-aided stopping criterion is a major improvementcompared to simple criteria: developing heuristic stoppingcriteria that attempt to approach what the genie would do maytherefore improve our protocol even further.

Finally, we demonstrated the robustness of our protocolunder imperfect knowledge of the properties of the editchannel, which is expected in practice.

One of the most important parts of our ongoing work is toadapt our protocol to real files and edit patterns (in particularwithout the i.i.d. requirement). Many conclusions drawn inthe i.i.d. case may not apply for real files: for instance, poorlychosen pivots may occur a very large number of times in thefile (e.g., a common word in a text file, or a sequence ofspaces in a well-indented source code file), but on the otherhand, using stopping criteria at low number of rounds providesan even larger benefit than in the i.i.d. case as it coincidentallyreduces the bandwidth required to synchronize. We report ourpreliminary results on practical examples in [27].

APPENDIX

Below we provide the proof of Lemma 5. The basic ideaof the proof is to examine the paths with many bad pivotmatches (that is, large weight associated with the path) andto show that the probability of such paths is very low. This issimilar to the proof in [21] with a major difference: we are nowconcerned with path weights rather than path lengths (as inthe deletions-only case, the edges are not weighted). The pathweight criteria adds further complexity to our combinatorialarguments, as will be seen below.

Proof: We first find an upper bound to the probabilityof existence of a path Q from v0 to vk with αk bad vertices(α is a constant denoting the fraction of bad pivot matches)and with weight upper-bounded by (recall that W = 1

β )

w(Q) ≤ (1 + RW + O(β))k, (15)

when α > β. We will then use this bound to show that theoverall probability that such a path Q has more than βk badvertices is upper-bounded by 2−�(n). A valid path Q musthave at least (1− R−β +o(β))k vertices, otherwise its weightwould be more than (R + β + o(β))kW = (1 + RW + o(1))k(Here, the o(1) refers to an expression in β). Using this lowerbound on the total number of vertices with the upper boundon the number of bad vertices, we will prove that the numberof good vertices on Q is at least (1 − R − 2β + o(β))k, withprobability at least 1 − 2−�(n).

We denote by Qg the set of indices j such that vi j and vi j+1

are good vertices, and by Qb the set of indices j such thateither vi j or vi j+1 is a bad vertex.

Case 1: β ≤ α ≤ 1/4Let us first consider a fixed value of α such that

β ≤ α ≤ 1/4. Following the analysis from theproof of [21, Th. 5] and adapting it to our general scenario,one can show that when α is fixed and k ′ = |Qg|+ |Qb| is setto (1− R −β +o(β))k, the number of ways to simultaneouslychoose Qg and Qb can be upper-bounded by(

(1 − R − β + o(β))k

(1 − R − β − α + o(β))k

)·(

(R + β + α + o(β))k

αk

)

≈ 2k((1−R−β)H

1−R−β

)+(R+β+α)H

R−β−α

)+o(β)

)

≤ 2αk(5+2 log 1β )

. (16)

Here, H is the binary entropy function (not to be confused withthe collision entropy H2). In our case, unlike in the deletions-only case from [21], k ′ may be larger than (1− R−β+o(β))k.However, the number of ways to simultaneously choose Qg

and Qb is a decreasing function of k ′ when k ′ is in the regimeabove (1− R−β+o(β))k, and is therefore still upper-bounded

by 2αk(5+2 log 1β ).

Again, a valid path Q must have at least (1 − R + o(β))kvertices. Therefore, summing (16) for k ′ from (1− R +o(β))kto k, we conclude that the number of ways to pick the layersfor path Q with αk bad vertices and weight lower than(1 + RW + O(β)) k is upper-bounded by

((R + o(β))k)2αk

(5+2 log 1

β

)

≤ 2αk

(6+2 log 1

β

)

, (17)

for n (and therefore k) large enough.Once we have fixed Qg and Qb, we count how many

potential paths can match the following constraints:

• The weight of the path is at most (1 + RW + O(β)) k,• The good vertices and bad vertices are on layers that

match Qg and Qb.

For now, we have no constraint on whether the potentialvertices of these paths correspond to actual matched pivotsin Y ; rather, we are interested in the distances between thosevertices. The distance D(u, v) = v− u−1 that the edge u → vspans in Y is hereafter referred to as the length of that edge.The number of potential paths Q is determined by the numberof combinations of lengths the edges can take while satisfyingthe two constraints. Since the lengths of the edges in Qg arefixed (X and the edit pattern are fixed, so the length of a goodedge can be determined by the number of layers it spans andthe number of net edits that occur between these layers), weonly focus on the lengths of the edges in Qb.

The sum wb of the weights of these bad edges isupper-bounded by

wb =∑

j∈Qb

w(vi j , vi j+1 ) ≤ w(Q) ≤ (1 + RW + O(β))k.

(18)

Furthermore, for a fixed value of wb, we upper-bound thenumber of combinations of the (non-negative) weights of the

Page 15: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

2272 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016

bad edges under the constraint that their sum is wb: thatnumber is(

wb

|Qb|)

≤(

wb

2αk

)≤(wbe

2αk

)2αk ≤(

1+ RW +O(β)

2αe

)2αk

= 22αk

(log

(1+RW+O(β)

2α e))

≤ 22αk

(1

ln 2 +log(

O(log(1/β))2β

))

≤ 22αk

(1+2 log

(1β

))

, (19)

where the first inequality comes from the fact that|Qb| ≤ 2αk and that we chose α so that 2αk ≤ wb/2,the second inequality is the generic upper bound on bino-

mial coefficients(a

b

) ≤ ( aeb

)b, the fourth inequality usesRW = (L Pβ) 1

β = L P = O( 1H2

log 1β ) = O(log 1

β )

(since H2 is a constant w.r.t β), and the fifth inequality holdsfor β sufficiently small.

Once the individual weights of the bad edges are fixed,there are only two possible values for the length of each ofthe bad edges, or just one if that weight is 0 (when w(u, v) isknown, it fixes |δ(u, v)|, and the length of the edge u → v cantherefore take only up to two values depending on the sign ofδ(u, v)). There are up to 2αk such edges, and therefore at most

22αk

(1+2 log 1

β

)

·22αk = 24αk

(1+log 1

β

)

valid combinations of thelengths of the bad edges. Summing over all possible valuesfor wb from 0 to (1 + RW + O(β))k, the number of potentialpaths for a given choice of Qg and Qb is upper-bounded by

(1+ RW + O(β))k ·24αk

(1+log 1

β

)

≤ 2αk

(5+4 log 1

β

)

, where thatlast inequality holds for n (and thus k) large enough.

We use this last result with (17) to finally upper-bound thenumber of potential paths with αk bad vertices and weight

lower than (1 + RW + O(β)) k by 2αk

(11+6 log 1

β

)

. Each ofthese potential paths is valid if there is a bad match of a pivotat the location that corresponds to of each of the bad vertices.This happens with probability 2−αk H2 L P (where 2−H2 L P isthe collision probability for a single pivot, and collisionsare independent due to the i.i.d. setting of our model), andtherefore the probability that there exists a path Q with αkbad vertices and weight w(Q) ≤ (1 + RW + O(β)) k isupper-bounded by

2αk

(11+6 log 1

β −H2 L P

)

. (20)

Case 2: 1/4 ≤ α ≤ 1We follow the same process as in the first case. The number

of ways to choose Qg and Qb can be upper-bounded by3k ≤ 22k (each of the k layers can be in Qg , in Qb or inneither).

The sum wb of the weight of the bad edges of a path Qwith w(Q) ≤ (1 + RW + O(β)) k is upper-bounded by (18),and therefore the number of ways to choose those weights isless than 2k . Once the weights are chosen, there are up totwo possibilities for the length of each bad edge, so that thenumber of ways to choose the lengths and their correspondingweights is upper-bounded by 22k .

Overall, there are therefore up to 24k potential paths.Since each of these potential paths corresponds to an actualpath in the graph with probability 2−αk H2 L P , the probability

that there is at least one such path is upper-boundedby 2k(4−αH2 L P ).

Conclusion of the proof:We now put the two cases for the range of α together

to find an upper bound on the probability of existence of apath Q with more than βk bad vertices and weight w(Q) ≤(1 + RW + O(β)) k:

k/4∑

(αk)=βk

2αk

(11+6 log 1

β −H2 L P

)

+k∑

(αk)=k/4

2k(4−αH2 L P ), (21)

where we use (αk) as the sum index for ease of presentation.Taking L P = 1

H2

(20 + 6 log 1

β

)ensures that:

• Since β ≤ α, αk(

11 + 6 log 1β − H2L P

)≤ −9αk ≤

−9βk,• For α > 1/4, k (4 − αH2L P ) ≤ k(4−α(20+6 log 1

β )) ≤k(4 − 5 − 3

2 log 1β ) ≤ −k.

The first sum is therefore upper-bounded by k4 2−9βk and the

second by 3k4 2−k , and thus the total is 2−�(n), which yields

the result. �

REFERENCES

[1] N. Bitouzé and L. Dolecek, “Synchronization from insertions anddeletions under a non-binary, non-uniform source,” in Proc. IEEE Int.Symp. Inf. Theory, Istanbul, Turkey, Jul. 2013, pp. 2930–2934.

[2] N. Bitouzé, F. Sala, S. M. S. Tabatabaei Yazdi, and L. Dolecek,“A practical framework for efficient file synchronization,” in Proc. IEEE51st Allerton Conf. Commun., Control, Comput., Monticello, IL, USA,Oct. 2013, pp. 1213–1220.

[3] M. Mitzenmacher and G. Varghese, “The complexity of object reconcil-iation, and open problems related to set difference and coding,” in Proc.IEEE 50th Allerton Conf. Commun., Control, Comput., Monticello, IL,USA, Oct. 2012, pp. 1126–1132.

[4] R. L. Dobrushin, “Shannon’s theorems for channels with synchronizationerrors,” Problems Inf. Transmiss., vol. 3, no. 4, pp. 18–36, 1967.

[5] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-tions and reversals,” Sov. Phys.—Dokl., vol. 10, no. 8, pp. 707–710,Feb. 1966.

[6] G. Tenengolts, “Nonbinary codes, correcting single deletion or insertion(Corresp.),” IEEE Trans. Inf. Theory, vol. IT-30, no. 5, pp. 766–769,Sep. 1984.

[7] A. S. J. Helberg and H. C. Ferreira, “On multiple insertion/deletioncorrecting codes,” IEEE Trans. Inf. Theory, vol. 48, no. 1, pp. 305–308,Jan. 2002.

[8] K. A. S. Abdel-Ghaffar, F. Paluncic, H. C. Ferreira, and W. A. Clarke,“On Helberg’s generalization of the Levenshtein code for multipledeletion/insertion error correction,” IEEE Trans. Inf. Theory, vol. 58,no. 3, pp. 1804–1808, Mar. 2012.

[9] L. Dolecek and V. Anantharam, “Repetition error correcting sets:Explicit constructions and prefixing methods,” SIAM J. Discrete Math.,vol. 23, no. 4, pp. 2120–2146, Jan. 2010.

[10] M. Mitzenmacher, “A survey of results for deletion channels and relatedsynchronization channels,” Probab. Surv., vol. 6, pp. 1–33, Jun. 2009.

[11] H. Mercier, V. K. Bhargava, and V. Tarokh, “A survey of error-correctingcodes for channels with symbol synchronization errors,” IEEE Commun.Surveys Tuts., vol. 12, no. 1, pp. 87–96, Feb. 2010.

[12] M. C. Davey and D. J. C. MacKay, “Reliable communication overchannels with insertions, deletions, and substitutions,” IEEE Trans. Inf.Theory, vol. 47, no. 2, pp. 687–698, Feb. 2001.

[13] J. A. Briffa and H. G. Schaathun, “Improvement of the Davey–MacKayconstruction,” in Proc. IEEE Int. Symp. Inf. Theory Appl., Auckland,New Zealand, Dec. 2008, pp. 1–4.

[14] A. Orlitsky, “Interactive communication of balanced distributions andof correlated files,” SIAM J. Discrete Math., vol. 6, no. 4, pp. 548–564,1993.

[15] A. Orlitsky and K. Viswanathan, “Practical protocols for interactivecommunication,” in Proc. IEEE Int. Symp. Inf. Theory, Washington, DC,USA, Jun. 2001, p. 115.

Page 16: 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. …fredsala/Sala-Synchronization.pdf · 2016-10-13 · 2258 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 64, NO. 6, JUNE 2016 Synchronizing

SALA et al.: SYNCHRONIZING FILES FROM A LARGE NUMBER OF INSERTIONS AND DELETIONS 2273

[16] G. Cormode, M. Paterson, S. C. Sahinalp, and U. Vishkin, “Communi-cation complexity of document exchange,” in Proc. ACM-SIAM Symp.Discrete Algorithms, San Francisco, CA, USA, Jan. 2000, pp. 197–206.

[17] A. V. Evfimievski, “A probabilistic algorithm for updating files over acommunication link,” in Proc. ACM-SIAM Symp. Discrete Algorithms,San Francisco, CA, USA, Jan. 1998, pp. 300–305.

[18] R. Venkataramanan, H. Zhang, and K. Ramchandran, “Interactive low-complexity codes for synchronization from deletions and insertions,” inProc. IEEE 48th Allerton Conf. Commun., Control, Comput., Monticello,IL, USA, Sep./Oct. 2010, pp. 1412–1419.

[19] R. Venkataramanan, V. Narasimha Swamy, and K. Ramchandran,“Efficient interactive algorithms for file synchronization under gen-eral edits,” in Proc. 51st Annu. Allerton Conf. Commun., Control,Comput. (Allerton), Monticello, IL, USA, Oct. 2013, pp. 1226–1233.

[20] N. Ma, K. Ramchandran, and D. Tse, “Efficient file synchronization:A distributed source coding approach,” in Proc. IEEE Int. Symp. Inf.Theory, St. Petersburg, Russia, Jul./Aug. 2011, pp. 583–587.

[21] S. M. S. Tabatabaei Yazdi and L. Dolecek, “A deterministic polynomial-time protocol for synchronizing from deletions,” IEEE Trans. Inf.Theory, vol. 60, no. 1, pp. 397–409, Jan. 2014.

[22] D. Starobinski, A. Trachtenberg, and S. Agarwal, “Efficient PDA syn-chronization,” IEEE Trans. Mobile Comput., vol. 2, no. 1, pp. 40–51,Jan./Mar. 2003.

[23] M. Braverman and A. Rao, “Towards coding for maximum errors ininteractive communication,” in Proc. 43rd Annu. ACM Symp. TheoryComput., 2011, pp. 159–166.

[24] A. Tridgell, “Efficient algorithms for sorting and synchronization,”Ph.D. dissertation, Dept. Comput. Sci., Austral. Nat. Univ., Canberra,Australia, 2000.

[25] A. Rényi, “On measures of entropy and information,” in Proc. BerkeleySymp. Math., Statist. Probab., 1960, pp. 547–561.

[26] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” J. Amer. Statist. Assoc., vol. 58, no. 301, pp. 13–30,Mar. 1963.

[27] C. Schoeny, “Efficient file synchronization,” M.S. thesis, Dept. Elect.Eng., Univ. California, Los Angeles, Los Angeles, CA, USA, 2014.

Frederic Sala (S’13) received the B.S.E. degreein electrical engineering from the University ofMichigan, Ann Arbor, in 2010, and the M.S. degreein electrical engineering from the University ofCalifornia at Los Angeles (UCLA), in 2013, wherehe is currently pursuing the Ph.D. degree in electri-cal engineering. He is associated with LORIS andCoDESS Laboratories at UCLA.

His research interests include information theoryand coding with a focus on error-correction codes,including applications to synchronization and data

storage in nonvolatile memories. He is a recipient of the NSF GraduateResearch Fellowship and the UCLA Edward K. Rice Outstanding MastersStudent Award.

Clayton Schoeny (S’09) received the B.S. andM.S. degrees in electrical engineering from the Uni-versity of California at Los Angeles, in 2012 and2014, respectively, where he is currently pursuingthe Ph.D. degree with the Electrical EngineeringDepartment. He has industry experience with TheAerospace Corporation, DIRECTV, and SPAWAR.

His research interests include coding theory andinformation theory, and he is associated with LORISand CoDESS Laboratories. He is a recipient of theHenry Samueli Excellence in Teaching Award.

Nicolas Bitouzé (S’10) was born in Alencon,France, in 1985. He received the M.Sc. degree incomputer science from the University of Rennes 1,and the Magistère Informatique et Télécommunica-tions degree from the Ecole Normale Supérieurede Cachan, Rennes, France, in 2008. He hasbeen associated with the Department of Electron-ics, TELECOM Bretagne, Brest, France, and withthe LORIS Laboratory, University of California atLos Angeles. His research interests are in thearea of communication theory, coding theory, and

information theory.

Lara Dolecek (S’05–M’10–SM’12) received theB.S. (Hons.), M.S., and Ph.D. degrees in electri-cal engineering and computer sciences, and theM.A. degree in statistics from the University ofCalifornia at Berkeley (UC Berkeley). She was aPost-Doctoral Researcher with the Laboratory forInformation and Decision Systems, MassachusettsInstitute of Technology. She is currently an Asso-ciate Professor with the Electrical EngineeringDepartment, University of California at Los Angeles.Her research interests span coding and information

theory, graphical models, statistical algorithms, and computational methods,with applications to emerging systems for data storage, processing, andcommunication. She received the 2007 David J. Sakrison Memorial Prize forthe most outstanding doctoral research with the Department of Electrical Engi-neering and Computer Sciences, UC Berkeley. She received the IBM FacultyAward (2014), the Northrop Grumman Excellence in Teaching Award (2013),the Intel Early Career Faculty Award (2013), the University of CaliforniaFaculty Development Award (2013), the Okawa Research Grant (2013), theNSF CAREER Award (2012), and the Hellman Fellowship Award (2011).With her research group, she also received the Best Paper Award from theIEEE GLOBECOM 2015 conference. She also serves as an Associate Editorof the IEEE TRANSACTIONS ON COMMUNICATIONS.