thesis proposal coding for synchronization errorsashahras/proposal.pdf · behind our highly...

23
Thesis Proposal Coding for Synchronization Errors Amirbehshad Shahrasbi Computer Science Department Carnegie Mellon University Thesis Committee Bernhard Haeupler, Chair Venkatesan Guruswami Mahdu Sudan (Harvard) Rashmi Vinayak Sergey Yekhanin (Microsoft Research) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy.

Upload: others

Post on 26-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Thesis Proposal

Coding for Synchronization Errors

Amirbehshad Shahrasbi

Computer Science DepartmentCarnegie Mellon University

Thesis CommitteeBernhard Haeupler, Chair

Venkatesan GuruswamiMahdu Sudan (Harvard)

Rashmi VinayakSergey Yekhanin (Microsoft Research)

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Page 2: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Abstract

Coding theory is the study of algorithms and techniques that facilitate reliableinformation transmission over noisy mediums, mostly through combinatorial objectscalled error-correcting codes. Following the inspiring works of Shannon and Hamming,a sophisticated and extensive body of research on error correcting codes has led toa deep and detailed theoretical understanding as well as practical implementationsthat have helped fuel the digital revolution. Error-correcting codes can be found inessentially all modern communication, computation, data storage systems. While beingremarkably successful in understanding the theoretical limits and trade-offs of reliablecommunication under errors and erasures, the coding theory literature significantly lagsbehind when it comes to overcoming errors that concern the timing of communications.In particular, the study of correcting synchronization errors, i.e., symbol insertions anddeletions, while initially introduced by Levenshtein in the 60s, has significantly fallenbehind our highly sophisticated knowledge of codes for Hamming-type errors.

This thesis investigates coding against synchronization errors under a variety ofmodels and attempts to understand trade-offs between different qualities of interest inrespective coding schemes such as rate, distance, and algorithmic qualities of the code.Most of our results rely on synchronization strings, simple yet powerful pseudorandomobjects that have proven to be very effective solutions for coping with synchronizationerrors in various settings.

Through indexing with strings that satisfy certain pseudo-random properties, weprovide synchronization codes that achieve near-optimal rate-distance trade-off. Wefurther attempt to provide constructions that enable fast encoding/decoding proce-dures. We study the same problem under the list-decoding regime, where the decoderis expected to provide a short list of codewords that is guaranteed to contain the sentmessage. We will also try to better understand fundamental limits of list-decoding forsynchronization errors such as the list-decoding capacity or maximal error resilience forlist-decodable synchronization codes. This thesis furthermore studies synchronizationstrings and other related pseudo-random string properties as combinatorial objectsthat are of independent interest. Such combinatorial objects will be used to extendsome of our techniques to alternative communication problems such as coding fromblock transposition errors or coding for interactive communication.

Page 3: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Contents

1 Introduction 11.1 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Inspiring Questions: Results and Future Directions . . . . . . . . . . . . . . 3

1.3.1 Rate vs. Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Efficiency and Decoding Complexity . . . . . . . . . . . . . . . . . . 31.3.3 List Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.4 Error Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.5 Alternative Settings and Other Synchronization Errors . . . . . . . . 5

2 Coding via Indexing with Pseudo-Random Strings 62.1 Approaching the Singleton Bound: Core Idea . . . . . . . . . . . . . . . . . . 6

2.1.1 Pseudo-random Property . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Indexing Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Repositioning (Decoding) . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Codes Approaching the Singleton Bound . . . . . . . . . . . . . . . . 8

2.2 List Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 High-Rate List Decodable InsDel Codes over Large Alphabets . . . . 102.2.2 Bounds on List-Decoding Capacity . . . . . . . . . . . . . . . . . . . 10

3 Pseudo-Random Strings Properties and Coding Consequences 123.1 Online Decoding: Channel Simulations and Interactive Communication . . . 12

3.1.1 Channel Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Interactive Communication . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Local Decoding: Coding for Block Errors . . . . . . . . . . . . . . . . . . . . 143.3 Existence and Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Extremal Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.2 Infinite Synchronization Strings . . . . . . . . . . . . . . . . . . . . . 16

4 Future Directions 174.1 (Near) Linear Time Repositioning . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Rate-Distance Trade-off and Error Resilience . . . . . . . . . . . . . . . . . . 17

Page 4: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

1 Introduction

Following the inspiring works of Shannon and Hamming, a sophisticated and extensive bodyof research on error correcting codes has lead to an advanced theoretical understanding aswell as a substantial practical impact on digital systems. Error correcting codes are vitalelements of many modern communication, computation, and data storage systems. Whilebeing remarkably successful in understanding the theoretical limits and trade-offs of reliablecommunication under errors and erasures, the coding theory literature lags significantlybehind when it comes to overcoming errors that concern the timing of communications.In particular, the study of correcting synchronization errors, i.e., symbol insertions anddeletions, while initially introduced by Levenshtein in the 60s, has significantly fallen behindour highly sophisticated knowledge of codes for Hamming-type errors. This thesis will focuson studying and better understanding codes for synchronization errors.

Synchronization Errors. Consider a stream of symbols being transmitted through anoisy channel. There are two basic types of noise that we will consider, Hamming-type er-rors and synchronization errors. Hamming-type errors consist of erasures, that is, a symbolbeing replaced with a special “?” symbol indicating the erasure, and substitutions in whicha symbol is replaced with another symbol of the alphabet. We will measure Hamming-typeerrors in terms of half-errors. The wording half-error comes from the realization that whenit comes to code distances erasures are half as bad as symbol corruptions. An erasure is thuscounted as one half-error while a symbol substitution counts as two half-errors. Synchro-nization errors consist of deletions, that is, a symbol being removed without replacement,and insertions, where a new symbol is added somewhere within the stream.

Synchronization errors are strictly more general and harsher than half-errors. In par-ticular, any symbol substitution, worth two half-errors, can also be achieved via a deletionfollowed by an insertion. Any erasure can be interpreted as a deletion together with theextra information where this deletion has taken place. This shows that any error patterngenerated by k half-errors can also be replicated using k synchronization errors, makingdealing with synchronization errors at least as hard as half-errors. The real problem thatsynchronization errors bring with them, however, is that they cause sending and receivingparties to become “out of sync”. This easily changes how received symbols are interpretedand makes designing codes or other systems tolerant to synchronization errors an inherentlydifficult and significantly less well-understood problem.

1.1 Scope of the Thesis

The study of coding for synchronization errors was initiated by Levenshtein [Lev66] in 1966when he showed that Varshamov-Tenengolts codes can correct a single insertion, deletion,or substitution error with an optimal redundancy of almost log n bits. Ever since, synchro-nization errors have been studied in various settings. In this section, we categorize some ofthe commonly studied settings and specify the one relevant to this thesis.

The first important aspect is the noise model. Several works have studied coding forsynchronization errors under the assumption of random errors, most notably, to study thecapacity of deletion channels, which independently delete each of the transmitting symbols

1

Page 5: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

with some fixed probability. (See [Mit09, MBT10, CR19].) This thesis exclusively focus onworst-case error models in which correction has to be possible from any (adversarial) errorpattern bounded only by the total number of insertions and deletions.

Another angel to categorize the previous work on codes for synchronization error is thenoise regime. In the same spirit as ordinary error correcting codes, the study of familiesof synchronization codes have included both ones that protect against a fixed number ofsynchronization errors and ones that consider error count that is a fixed fraction of theblock length. The inspiring work of Levenshtein [Lev66] falls under the first category andis followed by several works designing synchronization codes correcting k errors for specificvalues of k [Slo02, Ten84, HF02, GS18] or with k as a general parameter [AGPFC11, BGZ17].In this thesis, we focus on the second category, i.e., infinite families of synchronization codeswith increasing block length that are defined over a fixed alphabet size and can correct fromconstant-fractions of worst-case synchronization errors.

We furthermore mainly focus on codes that can be efficiently constructed and decoded– in contrast to merely existential results. The first such code was constructed in 1990 bySchulman and Zuckerman [SZ99]. They provided an efficient, asymptotically good synchro-nization code with constant rate and constant distance. We will further discuss the previouswork and how contributions of this thesis fit into them when describing the contributions inthe following chapters.

1.2 Preliminaries

In this section, we present preliminary definitions and set the notation for the rest of thisproposal. We start with the definition of the edit distance.

Definition 1.1 (Edit distance). The edit distance between two strings S1, S2 ∈ Σ∗, orED(S1, S2), is the minimum number of insertions and deletions required to transform S1

into S2.

Similar to error correcting codes for Hamming-type errors, an error correcting code forinsertions and deletions (or InsDel code for short) is defined as a subset C ⊆ Σn for alphabetset Σ and block length n. The minimum distance of C is defined as the minimum editdistance between codewords (members) of C, i.e.

δC =minx,y∈C ED(x, y)

2n.

We remark that the edit distance of two strings of length n can be as large as 2n and hencethe normalizing divisor is 2n. Further, the rate of the code C is defined as rC = log |C|

n log |Σ| . Anencoding function EncC : Σnr → Σn for C is a bijective function that maps any string in Σnr

to a member of C and a decoding function DecC is one that takes any string in w ∈ Σn andreturns the (unique) codeword that is within δn edit distance of w or ⊥ if such codeworddoes not exist.

In this work we often consider families of codes, that are formally defined as follows.

Definition 1.2 (Family of Codes). A family of codes C with distance δ and rate r is definedas an infinite series of codes with increasing block length like C1, C2, · · · with distance δ andrespective rates r1, r2, · · · where limn→∞ ri = r.

2

Page 6: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

1.3 Inspiring Questions: Results and Future Directions

In this section, we overview the general questions that this thesis attempts to address. Someof these questions were (partially) answered by the works done as part of this thesis andsome inspire future research directions.

1.3.1 Rate vs. Distance

The problem of characterizing the maximal rate of information transmission over a givennoisy channel has been a fundamental question in information and coding theories. In thesetting of InsDel codes, this question translates into finding the maximal rate achievable bya family of InsDel codes with a specific minimum distance.

Question 1.3. For a given q and δ ∈ (0, 1), what is the largest rate r where there exists afamily of InsDel codes with minimum distance δ and rate r.

One can also consider an alphabet-free version of Question 1.3. In other words, havingthe freedom of choosing the alphabet size of the family of codes as large as one wishes, whatwould the largest achievable rate be.

Question 1.4. For a given δ ∈ (0, 1), what is the largest rate r where there exists an integerq and a family of InsDel codes with minimum distance δ, rate r, and alphabet size q.

1.3.2 Efficiency and Decoding Complexity

In addition to codes with a good minimum distance, we furthermore seek efficient algorithmsfor the encoding and error-correction tasks associated with the code. We say a code is efficientif it has encoding and decoding algorithms running in polynomial time in terms of the blocklength. While it is often not hard to show that random codes exhibit a good rate anddistance, designing codes which can be decoded efficiently is much harder.

We remark that most codes which can efficiently correct from symbol substitutions arealso efficient for half-errors. For insdel codes the situation is slightly different. While itremains true that any code that can uniquely be decoded from any δ fraction of deletionscan also be decoded from the same fraction of insertions and deletions [Lev65] doing soefficiently is often much easier for the deletion-only setting than the fully general insdelsetting.

Question 1.5. Find families of codes that attain (near) optimal rate-distance trade-off asspecified in Question 1.3 and Question 1.4 and are efficiently encodable and decodable fromInsDels or deletions. How about linear time or near-linear time encoding/decoding proce-dures?

1.3.3 List Decoding

Thus far, we have considered the minimum distance of a code as its principal error correctingquality since having a minimum distance of δ guarantees that any codeword of the code canbe uniquely recovered after being exposed to δ fraction of synchronization errors.

3

Page 7: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

We, furthermore, study the questions outlined above under the model of “list-decoding”,i.e., when the decoding algorithm is allowed to report a (short) list of potential codewordsthat is guaranteed to include the transmitted word if the number of errors is small enough.A list-decodable InsDel code is defined as follows.

Definition 1.6 (List-Decodable InsDel Codes). A family of insdel codes C is (γ, δ, L(n))-list-decodable if for any code C in C with block length n, there exists a function D : Σ∗ → 2C

such that |D(w)| ≤ L(n) for every w ∈ Σ∗ and for every codeword x ∈ C and every wordw obtained from x by δ · n deletions of characters in x followed by γ · n insertions, it is thecase that x ∈ D(w).

We often mention list-deocdability without specifying the list size function L(n) in whichcase we consider a polynomial function L. In other words, a code C is called (γ, δ)-list-decodable if it is (γ, δ, L(n))-list-decodable for some polynomial function L(·).

The questions presented before can also be asked with regard to list-deodable InsDelcodes. Before stating the questions, we remark that the symmetry between insertion anddeletion errors that hold in the unique decoding model does not hold for list-decoding. Asmentioned before, an InsDel that is uniquely decodable from a fraction δ of deletions isalso uniquely-decodable from δ fraction of a mixture of insertions and deletions since bothdecoding qualities are equivalent to a minimum distance of δ. This does not necessarilyhold for list-decodable InsDel codes. Therefore, in characterizing the list-decoding radius,we specify two parameters γ and δ as the fraction of respective insertions and deletions.

Question 1.7. For a given q, δ ∈ (0, 1) and γ > 0, what is the largest rate r where thereexists a family of InsDel codes with rate r that is (γ, δ)-list-decodable.

Once again, this question can be asked in an alphabet-independent fashion.

Question 1.8. For a given δ ∈ (0, 1) and γ > 0, what is the largest rate r where thereexists an integer q and a family of InsDel codes with rate r over an alphabet of size q that is(γ, δ)-list-decodable.

Similar computational efficiency questions can be asked for list-decodable codes.

Question 1.9. Find families of codes that attain (near) optimal trade-off between rate andlist-decoding radius as specified in Question 1.7 and Question 1.8 and are efficiently encod-able and decodable from InsDels or deletions. How about linear time or near-linear timeencoding/decoding procedures?

1.3.4 Error Resilience

A special case of the rate-distance trade-off question discussed in Section 1.3.1 is to determinemaximal fraction of synchronization errors from which a family of q-ary list-decodable codeswith non-vanishing information rate can correct.

Question 1.10. For a given alphabet size q, what is the largest δ0 ∈ (0, 1) so that thereexists a family of InsDel codes with minimum distance δ and a positive rate.

4

Page 8: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

We investigate the same question for list-decodable InsDel codes. However, as mentionedbefore, list-decodable codes need to be specified with two parameters δ and γ due to the lackof symmetry between insertions and deletions under the list-decoding model. Therefore, theerror resilience in this case is characterized by a two dimensional region as follows.

Question 1.11. For a given alphabet size q, identify the set of all pairs (γ, δ) where thereexists a family of InsDel codes that is (γ, δ)-list-decodable and has a positive rate.

1.3.5 Alternative Settings and Other Synchronization Errors

This thesis further considers alternative models in which correcting synchronization errorsis of interest.

Block Transpositions and Duplications. We study codes that correct from alternativetypes of synchronization errors, namely, block transpositions and block duplications. Blocktransposition errors allow for arbitrarily long substrings of the message to be moved toanother position in the message string. Similarly, block duplication errors are ones that picka substring of the message and copy it between two symbols of the communication. Thefirst asymptotically good efficient InsDel codes proposed by Schulman and Zuckerman [SZ99]where able to correct from a combination of InsDels and block transpositions.

Question 1.12. Find efficient/near-linear time/linear time synchronization codes that cancorrect from block errors or combinations of block errors and InsDels. What is the optimalrate-distance trade-off?

Coding for Interactive Communication Interactive communication between two par-ties is one in which parties take turn to transmit a message symbol to one another. Eachparty is assumed to hold a private information denoted by X and Y and the goal is for bothparties to compute some function f(X, Y ). Any strategy for computing f(X, Y ) is called aprotocol.

A coding scheme for interactive communication is one that takes any protocol that com-putes some function f in noiseless communication and converts it into a protocol that com-putes f over a noisy channel. The rate of an interactive coding scheme is defined as themaximum ratio of the length of the protocol in the absence of noise over the length of theprotocol in the presence of noise over all functions f .

In an interactive setting, a single deletion can halt the communication by leading to astate in which both parties wait to hear from the other side. To resolve this, we consider theInsDel interactive channel model as defined by Braverman et al. [BGMO16] where adversarialerrors can only occur in the form of a deletion followed by an insertion – either in the samedirection or the opposite one.

Question 1.13. Find (efficient/fast) coding schemes for interactive communication with (1)near-optimal rate or (2) optimal error resilience.

5

Page 9: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

2 Coding via Indexing with Pseudo-Random Strings

In this section, we present the idea of constructing InsDel codes via indexing appropriatelychosen Hamming-type error correcting codes with strings that satisfies a certain pseudo-random property. We define the operations of string indexing and code indexing as follows.

Definition 2.1 (String Indexing). For strings S ∈ ΣnS = S1, S2, · · · , Sn and I ∈ Σn

I =I1, I2, · · · , In, we define S indexed by I as S × I = (S1, I1), (S2, I2), · · · , (Sn, In). Note thatS × I ∈ (ΣS × ΣI)

n.

Definition 2.2 (Code Indexing). For string I ∈ ΣnI and code C ⊆ Σn

C, we define C indexedby I, or C × I ⊆ (ΣS ×ΣI)

n, as a code that is obtained by indexing each codeword of C withI.

We will show that through indexing, one can essentially compartmentalize the task ofcorrecting synchronization errors into (1) repositioning, i.e., reordering the symbols of thereceived stream of data in a way that most symbols end up locating at their original positionin the sent message, and (2) correcting Hamming-type errors induced by the (unavoidable)imperfections in the repositioning procedure.

2.1 Approaching the Singleton Bound: Core Idea

It is easy to verify that the Singleton bound holds for InsDel codes.

Theorem 2.3 (Singleton Bound [Sin64]). For any family of InsDel codes C with rate r anddistance δ,

r ≤ 1− δ.

A series of works by Guruswami et al. [GL16, GW17] provides codes that achieve a rateof Ω((1 − δ)5) and 1 − O(

√δ) while being able to efficiently recover from a δ fraction of

insertions and deletions in high-noise and high-rate regimes respectively. We derive efficientfamilies of codes over adequately large alphabet that can approach the Singleton bound overthe entire spectrum of distance.

Theorem 2.4. For any ε > 0 and δ ∈ (0, 1) there exists an encoding map E : Σk → Σn anda decoding map D : Σ∗ → Σk, such that, if EditDistance(E(m), x) ≤ δn then D(x) = m.Further, the rate is k

n> 1 − δ − ε, |Σ| = exp(1/ε), and E and D are explicit and can

respectively be computed in linear and quadratic time in n.

This will be the only technical portion of this document presenting the core idea behindmost of the techniques used to address questions outlined in Section 1.3. We start by definingthe pseudo-random string property ε-self-matching.

Definition 2.5. String S ∈ Σn is ε-self-matching if it contains no two identical non-alignedsubsequences of length nε or more, i.e., there exist no two sequences a1, a2, · · · , abnεc andb1, b2, · · · , bbnεc where for all is ai 6= bi and S[ai] = S[bi].

6

Page 10: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

2.1.1 Pseudo-random Property

We first point out that random strings over an alphabet of size Ω(ε−2) satisfy ε-self-matchingproperty with high probability. Note that the probability of two given non-aligned subse-quences of length nε in a random string over alphabet Σ being identical is 1

|Σ|nε . Also, there

are no more than(nnε

)2pairs of such subsequences. Therefore, by the union bound, the proba-

bility of such random string satisfying ε-self-matching property is(nnε

)2 1|Σ|nε ≤

(nenε

)2nε 1|Σ|nε =(

e2

|Σ|ε2

)nεand, thus, if |Σ| = Ω(ε−2), a random string satisfies ε-self-matching property with

high probability.

2.1.2 Indexing Scheme

Consider a communication channel where a stream of n message symbols are communicatedfrom the sender to the receiver and suffers from up to nδ adversarial insertions or deletionsfor some 0 ≤ δ < 1. Let m1,m2, · · · ,mn represent the message symbols that sender wants toget to the receiver and s1, s2, · · · , sn be some ε-self-matching string that the sender and thereceiver have agreed upon. To communicate its message to the receiver, we have the sendersend the sequence m× s = (m1, s1), (m2, s2), · · · , (mn, sn) through the channel.

Note that in this setting a portion of the channel alphabet is designated to the ε-self-matching string and thus, does not contain information. This portion will be used to repo-sition the message symbols on the receiving end of the communication as we will describe inthe next section.

2.1.3 Repositioning (Decoding)

We now show that having the indexing scheme described above, the receiver can correctlyidentify the positions of most of the symbols it receives. Let us denote the sequence of symbolsarriving at the receiving end by (m′1, s

′1), (m′2, s

′2), · · · , (m′n′ , s′n′). We show the following.

Lemma 2.6. There exists an algorithm for the receiving party that, having (m′1, s′1), · · · , (m′n′ , s′n′)

and s1, · · · , sn, guesses the position of all received symbols in the sent string such that po-sitions of all but O(n

√ε) of the symbols that are not deleted in the channel are guessed

correctly. This algorithm runs in Oε(n2) time.

Note that if no error occurs, the receiver expects the index portion of the received sym-bols to be similar to the ε-self-matching string s. Having this observation, we present thedecoding algorithm in Algorithm 1. The decoding algorithm calculates the longest commonsubsequence between the synchronization string, s, and the index portion of the receivedstring, s′, and assigns each of the symbols from the received string that appear in the com-mon subsequence to the position of the symbol from s that corresponds to it under thecommon subsequence. The algorithm repeats this procedure 1/

√ε times and after each

round eliminates received symbols whose positions are guessed.

Proof of Lemma 2.6. Clearly, Algorithm 1 takes quadratic time as it mainly runs Oε(1)instances of LCS computation over strings of length O(n).

7

Page 11: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Algorithm 1 Insertion Deletion Decoder

Input: s, (m′1, s′1), · · · , (m′n′ , s′n′)

1: s′ = [s′1, s′2, · · · , s′n′ ]

2: for i = 1 to n′ do3: Position[i]← Undetermined

4: end for

5: for i = 1 to 1√ε

do

6: Compute LCS(s, s′)7: for all Corresponding s[i] and s′[j] in LCS(s, s′) do8: Position[j]← i9: end for

10: Remove all elements of LCS(s, s′) from s′

11: end for

Output: Position

To prove the correctness guarantee, we remark that there are two types of incorrectguesses for symbols that are not deleted by the adversary and bound the number of incorrectguesses of each type.

I) The position of the received symbol remains Undetermined by the end of the algorithm:Note that if by the end of the algorithm there are k original symbols–i.e., symbols thatare originally sent by the sender and not inserted by the adversary–that have undeter-mined positions, then the remainder of s′ after 1/

√ε rounds has a common subsequence

of size k with s. This implies that, in each round of the for loop, |LCS(s, s′)| ≥ k. Notethe total size of these LCSs cannot exceed the initial size of s′ that is n′. Therefore,k · 1√

ε≤ n′ ≤ 2n⇒ k ≤ 2

√εn.

II) The position of the received symbol is incorrectly guessed in one recurrence of the forloop: We claim that the number of such wrong assignments in each round of the forloop is no more than nε. Let s[i] and s′[j] be corresponding elements under LCS(s, s′)in Line 7 while the received symbol that s′[j] identifies is the i′th symbol sent by thesender. This implies that s[i] = s′[j] = s[i′]. If there are more than nε such incorrectguesses in one LCS computation, we have nε such pairs of identical symbols in s thatconstitute a self-matching of size nε in s and violate the assumption of s being an ε-self-matching string. Therefore, overall there are no more than 1√

ε·nε = n

√ε incorrect

determination of the original positions of received symbols.

2.1.4 Codes Approaching the Singleton Bound

We now use the discussions on ε-self-matching strings and Lemma 2.6 to construct efficientsynchronization codes that can approach the Singleton bound. Note that the indexing schemefrom Section 2.1.2 and Lemma 2.6 essentially gives a way to reduce insertions and deletion tosymbol substitutions and erasures at the cost of designating a portion of the message symbolsto an ε-self-matching string. More precisely, with the indexing scheme from Section 2.1.2 in

8

Page 12: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

place, a receiver can use Algorithm 1 to guess the position of the symbols it receives in thesent message and rearrange them to recover the message sent by the sender.

Let m denote the recovered message and Position denote the output of Algorithm 1.More precisely, for any 1 ≤ i ≤ n, the decoder sets m[i] = j if only for one value of j,Position[j] = i. If there are zero or multiple received symbols that are guessed to be atposition i, the decoder simply decides m[i] = ?.

We claim that m is different from m by no more than n(δ+ 12√ε) half-errors. Note that

if adversary applies no errors and Algorithm 1 guesses the positions perfectly, m = m. Inthe following steps we add these imperfections and see the effect in the Hamming distanceof m and m.

• Each deleted symbol turns a detection in m to a ? and therefore adds one half-errorto the Hamming distance of m and m.

• Each inserted symbol can either turn a detection in m to a ? or a ? to an incorrectvalue. Therefore, each insertion also adds one half-error to the Hamming distance ofm and m.

• Each incorrectly guessed symbol can also change up to two symbol in m and thereforeincrease the Hamming distance between m and m by up to four.

This implies that the m and m are far by no more than n(δ+ 12√ε). Having this reduction,

we derive codes promised in Theorem 2.4 by taking the following near-MDS codes from[GI05] and indexing their codewords with a self-matching string.

Theorem 2.7 (Guruswami and Indyk [GI05, Theorem 3]). For every r, 0 < r < 1, and allsufficiently small ε > 0, there exists an explicitly specified family of GF(2)-linear (also calledadditive) codes of rate r and relative distance at least (1 − r − ε) over an alphabet of size2O(ε−4r−1 log(1/ε)) such that codes from the family can be encoded in linear time and can alsobe (uniquely) decoded in linear time from a fraction e of errors and s of erasures provided2e+ s ≤ (1− r − ε).

Note that the alphabet size of the codes from Theorem 2.4 is exponentially large in termsof ε−1. This is in sharp contrast to the Hamming error setting, where there are codes knownthat can get ε close to unique decoding capacity with alphabets of polynomial size in termsof 1/ε. While large alphabet sizes might seem as an intrinsic weakness of indexing basedcode constructions, it turns out that an exponentially large alphabet size is necessary. Wepresent the following theorem from [HSS18] that shows any such code require exponentiallylarge alphabet size in terms of exp(ε−1).

Theorem 2.8. There exists a function f : (0, 1) → (0, 1) such that for every δ, ε > 0every family of insdel codes of rate 1− δ − ε that can be uniquely decoded from δ-fraction of

synchronization errors must have alphabet size q ≥ exp(f(δ)ε

).

2.2 List Decoding

List-decodable codes for insertions and deletions have been studied in the literature. Gu-ruswami and Wang [GW17] have provided positive-rate binary deletion codes that can be

9

Page 13: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

list-decoded from close to 12

fraction of deletions. Recent works of Wachter-Zeh [WZ17] andHayashi and Yasunaga [HY18] have studied list-decoding by providing Johnson-type boundsfor synchronization codes that relate minimum edit-distance of the code to its list decodingproperties. The bounds presented in [HY18] show that binary codes by Bukh, Guruswami,and Hastad [BGH17] can be list-decoded from a fraction ≈ 0.707 of insertions. Via a concate-nation scheme used in [GW17] and [GL16], Hayashi and Yasunaga furthermore made thesecodes efficient. A recent work of Liu, Tjuawinata, and Xing [LTX19] also derives boundson list-decoding radius, provides efficiently list-decodable insertion-deletion codes over smallalphabets, and gives a Zyablov-type bound for synchronization codes.

2.2.1 High-Rate List Decodable InsDel Codes over Large Alphabets

Using a similar indexing based construction as presented in Section 2.1, we derive familiesof list-decodable codes that approach the optimal rate in the large alphabet regime. Moreprecisely, for every 0 ≤ δ < 1, every 0 ≤ γ < ∞ and every ε > 0 there exist codes of rate1−δ−ε and constant alphabet (so q = Oδ,γ,ε(1)) and sub-logarithmic list sizes. Furthermore,our codes are accompanied by efficient (polynomial time) decoding algorithms. We stressthat the fraction of insertions can be arbitrarily large (more than 100%), and the rate isindependent of this parameter.

Theorem 2.9. For every 0 < δ, ε < 1 and γ > 0, there exist a family of list-decodableInsDel codes that can protect against δ-fraction of deletions and γ-fraction of insertions and

achieves a rate of at least 1− δ− ε or more over an alphabet of size(γ+1ε2

)O( γ+1

ε3)

= Oγ,ε (1).These codes are list-decodable with lists of size Lε,γ(n) = exp (exp (exp (log∗ n))), and havepolynomial time encoding and decoding complexities.

The codes from Theorem 2.9 are obtained by indexing an appropriately chosen code withself-matching strings. We will not describe the details of the construction but the main ideais to maintain a list of candidate symbol for each position instead of declaring a ? whenrepositioning algorithm finds multiple candidates for some position and replace the near-MDS error correcting code from Theorem 2.7 with a high-rate list-recoverable code like onesfrom [HRZW17, GX17, KRRZ+19].

Definition 2.10. A code C given by the encoding function E : Σnr → Σn is called to be(α, l, L)-list recoverable if for any collection of n sets S1, S2, · · · , Sn ⊂ Σ of size l or less,there are at most L codewords for which more than αn elements appear in the list thatcorresponds to their position, i.e.,

|x ∈ C | |i ∈ [n] | xi ∈ Si| ≥ αn| ≤ L.

2.2.2 Bounds on List-Decoding Capacity

As a part of this thesis, we have provided bounds on zero-error list-decoding capacity ofinsertion-only and deletion-only channels in a joint work with Bernhard Haeupler and MadhuSudan [HSS18]. The lower-bounds are obtained by an analysis of list decodability radius ofrandom codes and upper-bounds are obtained by proposing strategies for adversary that

10

Page 14: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

guarantee a small ensemble of received word on the receiver side of the communication.These bounds have been since improved [LTX19] but there is still a gap between the boundswhich leaves room for further investigation.

11

Page 15: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

3 Pseudo-Random Strings Properties and Coding Con-

sequences

The pseudo-random properties utilized in the construction of codes in previous sectionsgreatly impact the qualities of the resulting code. In this section, we introduce a line ofresearch pursued in this thesis that explores:

1. Different pseudo-random properties and coding guarantees that they entail.

2. Constructions of strings with such pseudo-random properties.

3.1 Online Decoding: Channel Simulations and Interactive Com-munication

In [HS17], synchronization strings are introduced as follows.

Definition 3.1 (ε-synchronization strings). String S ∈ Σn is an ε-synchronization string iffor every 1 ≤ i < j < k ≤ n+ 1 we have that ED (S[i, j), S[j, k)) > (1− ε)(k − i).

In simpler terms, ε-synchronization property is a pseudo random property that requiresall neighboring substrings of the string to be far apart under the edit distance metric. Itis shown in [HS17] that ε-synchronization is not only a strictly stronger property than self-matching property but also a hereditary extension of it.

Indexed to symbols of a communication in an insertion-deletion channel, synchronizationstrings can be used to guess the original position of symbols. However, for synchronizationstrings, the repositioning can be done in an online fashion, i.e., the position of each symbolis guessed upon its arrival and without waiting for the rest of the communication to takeplace. This comes at the cost of translating nδ InsDel errors into O(nδ) half-errors (ratherthan n(δ + ε) which was the case in Algorithm 1). This enables a delay-free simulationof a channel with Hamming-type errors over a given InsDel channel with adequately largealphabet size.

3.1.1 Channel Simulations

Having a channel afflicted by synchronization errors, one can put two simulation agentson the two ends of the channel who can simulate a channel with Hamming-type errorsover the given channel. In other words, the sender/receiver sends/receives symbols to/fromtheir corresponding agent and the simulation guarantees that the channel would seem like achannel with Hamming-type errors to the parties.

Note that the indexing scheme from Section 2.1 almost achieves this goal by reducingsynchronization errors to half errors through indexing. However, this procedure requiresall symbols to be communicated before the repositioning algorithm of Section 2.1 can startrunning and, therefore, introduces a delay. A true channel simulation would not add suchdelay. More precisely, a round of error-free communication in a simulated channel is onethat communicates the ith symbol sent by the sender as the ith symbol to the receiver onceit arrives at the other side and prior to i+ 1st symbol being sent by the sender.

12

Page 16: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

This subtle requirement can be satisfied through using synchronization strings as theindexing sequence and utilizing their online repositioning algorithm. Before presenting thechannel simulations, we remark an interesting negative result of [HSV18] stating that, asopposed to codes, when it comes to channel simulations, no channel simulator can reduce δfraction of synchronization errors to δ + ε half-errors for arbitrarily small ε.

Theorem 3.2. Assume that n uses of a synchronization channel over an arbitrarily largealphabet Σ with a δ fraction of insertions and deletions are given. There is no determinis-tic simulation of a half-error channel over any alphabet Σsim where the simulated channelguarantees more than n (1− 4δ/3) uncorrupted transmitted symbols. If the simulation israndomized, the expected number of uncorrupted transmitted symbols is at most n(1− 7δ/6).

Here is a description of the simulations.

Theorem 3.3. [Channel Simulations]

(a) Suppose that n rounds of a one-way/interactive insdel channel over an alphabet Σ with aδ fraction of insertions and deletions are given. Using a long-distance ε-synchronizationstring over alphabet Σsyn, it is possible to simulate n (1−Oε(δ)) rounds of a one-way/interactive corruption channel over Σsim with at most Oε (nδ) symbols corruptedso long as |Σsim| × |Σsyn| ≤ |Σ|.

(b) Suppose that n rounds of a binary one-way/interactive insertion-deletion channel witha δ fraction of insertions and deletions are given. It is possible to simulate n(1 −Θ(√δ log(1/δ))) rounds of a binary one-way/interactive corruption channel with Θ(

√δ log(1/δ))

fraction of corruption errors between two parties over the given channel.

All of the simulations mentioned above are efficient.

3.1.2 Interactive Communication

Very little is known regarding coding for interactive communication in the presence of syn-chronization errors. A 2016 coding scheme by Braverman et al. [BGMO16], which can beseen as the equivalent of Schulman’s seminal result for Hamming-type interactive codingschemes [Sch96], achieves a small constant communication rate while being robust against a1/18− ε fraction of errors. The coding scheme relies on edit-distance tree codes, which are acareful adaptation of Schulman’s original tree codes [Sch93] for edit distance, so the decodingoperations are not efficient and require exponential time computations. A recent work bySherstov and Wu [SW19] closed the gap for maximum tolerable error fraction by introducinga coding scheme that is robust against 1/6−ε fraction of errors which is the highest possiblefraction of insertions and deletions under which any coding scheme for interactive communi-cation can work. Both Braverman et al. [BGMO16] and Sherstov and Wu [SW19] schemesare of constant communication rate, over large enough constant alphabets, and inefficient.In this thesis we have addressed the questions on finding interactive coding schemes that arecomputationally efficient or achieve super-constant communication efficiency.

We use our large alphabet interactive channel simulation along with constant-rate effi-cient coding scheme of Ghaffari and Haeupler [GH14] for interactive communication over

13

Page 17: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

substitution channels to obtain a coding scheme for insertion-deletion channels that is effi-cient, has a constant communication rate, and tolerates up to 1/44 − ε fraction of errors.Note that despite the fact that this coding scheme fails to protect against the optimal 1/6−εfraction of synchronization errors as the recent work by Sherstov and Wu [SW19] does, it isan improvement over all previous work in terms of computational efficiency as it is the firstefficient coding scheme for interactive communication over insertion-deletion channels.

Theorem 3.4. For any constant ε > 0 and n-round alternating protocol Π, there is anefficient randomized coding scheme simulating Π in presence of δ = 1/44 − ε fraction ofedit-corruptions with constant rate (i.e., in O(n) rounds) and in O(n5) time that works withprobability 1 − 2Θ(n). This scheme requires the alphabet size to be a large enough constantΩε(1).

Next, we use our small alphabet channel simulation and the corruption channel interactivecoding scheme of Haeupler [Hae14] to introduce an interactive coding scheme for insertion-deletion channels. This scheme is not only computationally efficient, but also the first withsuper constant communication rate. In other words, this is the first coding scheme forinteractive communication over insertion-deletion channels whose rate approaches one as theerror fraction drops to zero. Our computationally efficient interactive coding scheme achievesa near-optimal communication rate of 1−O(

√δ log(1/δ)) and tolerates a δ fraction of errors.

Besides computational efficiency and near-optimal communication rate, this coding schemeimproves over all previous work in terms of alphabet size. As opposed to coding schemesprovided by the previous work [BGMO16, SW19], our scheme does not require a large enoughconstant alphabet and works even for binary alphabets.

Theorem 3.5. For sufficiently small δ, there is an efficient interactive coding scheme forfully adversarial binary insertion-deletion channels which is robust against δ fraction of edit-corruptions, achieves a communication rate of 1−Θ(

√δ log(1/δ)), and works with probability

1− 2−Θ(nδ).

3.2 Local Decoding: Coding for Block Errors

A local repositioning algorithm is one that guesses the position of a received symbol usingonly the knowledge of a small O(log n)-sized neighborhood of the surrounding received sym-bols, as opposed to all received symbols (which is what Algorithm 1 does). In [HS18], weproposed a pseudo-random string property called long-distance synchronization strings asfollows.

Definition 3.6 (c-long distance ε-synchronization string). String S ∈ Σn is a c-long distanceε-synchronization string if for every pair of substrings S[i, j) and S[i′, j′) that are eitheradjacent or of total length c log n or more, ED (S[i, j), S[i′, j′)) > (1− ε)l where l = j + j′−i− i′.

Indexed with a long-distance synchronization string, a stream of symobls can be reposi-tioned in a local fashion.

14

Page 18: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Theorem 3.7. For a communication over a synchronization channel that is indexed by along-distance synchronization string, there exists an online and local repositioning algorithmthat guesses the position of each received symbol using only the symbol itself and Oε(log n)symbols preceding it in Oε(log3 n) time. Also, among all symbols that are not deleted by theadversary, the position of no more than nδ

1−ε will be incorrectly guessed.

Local repositioning algorithms allow for synchronization codes that can correct from blocktranspositions and block duplications. Block transposition errors allow for arbitrarily longsubstrings of the message to be moved to another position in the message string. Similarly,block duplication errors are ones that pick a substring of the message and copy it betweentwo symbols of the communication.

We will present codes that can achieve a rate of 1 − δ − ε and correct from some O(δ)fraction of synchronization errors, a O(δ/ log n) fraction of block errors, or a combinationof them. A similar result for insertions, deletions, and block transpositions was shownby Schulman and Zuckerman [SZ99] where they provided the first constant distance andconstant rate such code. They also show that the O(δ/ log n) resilience against block errorsis optimal up to a constant factor.

Theorem 3.8. For any 0 < r < 1 and sufficiently small ε there exists a code with rate rthat corrects nδinsdel synchronization errors and nδblock block transpositions or replications aslong as 6δinsdel + (c log n)δblock < 1− r − ε for some c = O(1). The code is over an alphabetof size Oε(1) and has O(n) encoding and O(N log3 n) decoding complexities where N is thelength of the received message.

Note that the local quality of the repositioning algorithm implies that any symbol at thereceiver that does not have any synchronization errors or block error borders in its O(log n)neighborhood, is correctly repositioned by the local repositioning algorithm. Therefore, withnδblock block errors, no more than nδblock log n repositioning guesses would be incorrect. Thisimplies an O(n/ log n) block error resilience.

3.3 Existence and Construction

As part of this thesis, we have studied synchronization strings and other related pseudo-random strings presented so far in great details. We provided constructions of such stringswith almost perfect properties as summed up in the following theorem from [HS18].

Theorem 3.9. There is a deterministic algorithm that, for any constant 0 < ε < 1 andn ∈ N, computes a c = ε−O(1)-long-distance ε-synchronization string S ∈ Σn where |Σ| =ε−O(1). This construction runs in linear time and, moreover, any substring S[i, i+ log n] canbe computed in Oε(log n) time.

3.3.1 Extremal Properties

We have also studied extremal questions that are raised by the definition of the synchro-nization string property. One interesting question is what is the minimal function of ε asalphabet size for which ε-synchronization strings exist. We provide the following positiveresult in [CHL+19].

15

Page 19: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

Theorem 3.10. For any ε ∈ (0, 1), there exists and alphabet Σ of size O(ε−2) so that forany n ≥ 1, there exists an ε-synchronization string of length n over Σ.

We further show in [CHL+19] that any such alphabet has to be of size Ω(ε−3/2). Thisleaves us with the open question of where the minimal alphabet size lies between Ω(ε−3/2)and O(ε−2).

A similar question can be asked for non-specific values of ε, i.e., what is the smallestalphabet size over which arbitrarily long ε-synchronization strings exist for any ε < 1. It iseasy to observe that any binary string of length 4 or more contains two identical neighboringsubstrings. Also, it has been shown that arbitrarily long 11

12-synchronization strings exist

over an alphabet or size four [CHL+19]. This leaves the open question of whether arbitrarilylong synchronization strings exist over ternary alphabet or not.

Question 3.11. For a given ε > 0, what is the smallest alphabet over which arbitrarily longε-synchronization strings exists?

Question 3.12. Is there an ε0 < 1 so that arbitrarily long ε0-synchronization string existover trenary alphabet?

3.3.2 Infinite Synchronization Strings

An infinite ε-synchronization string is naturally defined as an infinite string, in which, anytwo neighboring intervals [i, j) and [j, k) have an edit distance of at least (1− ε)(k − i). Aspart of this thesis, we have studied the existence and construction of infinite synchronizationstrings. The following theorem sums up our findings.

Theorem 3.13. For all 0 < ε < 1, there exists an infinite ε-synchronization string S overa poly(ε−1)-sized alphabet so that any prefix of it can be computed in linear time. Further,for any i, S[i, i+ log i] can be computed in O(log i).

16

Page 20: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

4 Future Directions

Besides open problems discussed throughout the proposal, we present a couple of other topicsto be explored.

4.1 (Near) Linear Time Repositioning

We presented a near-linear time synchronization codes in Theorem 3.8 using the local repo-sitioning algorithm from Theorem 3.7. However, the rate-distance trade-off of such codesis off from the Singleton bound by a constant factor. The question of finding codes withrate-distance trade-off that approach the singleton bound and are decodable in linear ornear-linear time remains open.

4.2 Rate-Distance Trade-off and Error Resilience

As mentioned above, it is known that there exists positive-rate binary deletion codes thatare list-decodable from any fraction of errors smaller than 1

2. Also, there are codes that can

list-decode from a fraction ≈ 0.707 of insertions.In a recent unpublished work with Bernhard Haeupler and Venkatesan Gurusuwami we

attempted to identify the error resilience region for list-decodable codes. We also hope thatwe derive bounds for the capacity of list-decodable InsDel codes through identifying theerror resilience region. Note that if for α0 = (γ0, δ0) fraction of errors obtaining positive-ratecodes is not possible, then with α = ρα0 fraction of errors (for some ρ < 1), an adversarycan kill the information content in the first ρ fraction of symbols of the communication and,therefore, imply an upper bound of 1− ρ on the capacity of the channel.

A similar question can be asked for unique-decodable synchronization codes, i.e., whatis the largest fraction of errors δ0 where there exist positive-rate synchronization codes withminimum edit-distance δ0. For binary alphabets, it is easy to see that δ0 ≤ 1

2. However, most

resilient binary codes with positive rate to date are ones introduced by Bukh, Guruswami,and Hastad [BGH17] that can correct a

√2− 1 ≈ 0.4142 fraction of errors. Determining the

optimal error resilience for uniquely-decodable synchronization codes remains an interestingopen question. We refer the reader to [CR19] for a more comprehensive review of past workson the error resilience for synchronization codes.

17

Page 21: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

References

[AGPFC11] Khaled AS Abdel-Ghaffar, Filip Paluncic, Hendrik C Ferreira, and Willem AClarke. On helberg’s generalization of the levenshtein code for multiple dele-tion/insertion error correction. IEEE Transactions on Information Theory,58(3):1804–1808, 2011.

[BGH17] Boris Bukh, Venkatesan Guruswami, and Johan Hastad. An improved boundon the fraction of correctable deletions. IEEE Transactions on InformationTheory, 63(1):93–103, 2017.

[BGMO16] Mark Braverman, Ran Gelles, Jieming Mao, and Rafail Ostrovsky. Coding forinteractive communication correcting insertions and deletions. In Proceedingsof the International Conference on Automata, Languages, and Programming(ICALP), 2016.

[BGZ17] Joshua Brakensiek, Venkatesan Guruswami, and Samuel Zbarsky. Efficient low-redundancy codes for correcting multiple deletions. IEEE Transactions on In-formation Theory, 64(5):3403–3410, 2017.

[CHL+19] Kuan Cheng, Bernhard Haeupler, Xin Li, Amirbehshad Shahrasbi, and Ke Wu.Synchronization strings: highly efficient deterministic constructions over smallalphabets. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),2019.

[CR19] Mahdi Cheraghchi and Joao Ribeiro. An overview of capacity results for syn-chronization channels. arXiv preprint arXiv:1910.07199, 2019.

[GH14] Mohsen Ghaffari and Bernhard Haeupler. Optimal error rates for interactivecoding II: Efficiency and list decoding. In 2014 IEEE 55th Annual Symposiumon Foundations of Computer Science, pages 394–403. IEEE, 2014.

[GI05] Venkatesan Guruswami and Piotr Indyk. Linear-time encodable/decodablecodes with near-optimal rate. IEEE Transactions on Information Theory,51(10):3393–3400, 2005.

[GL16] Venkatesan Guruswami and Ray Li. Efficiently decodable insertion/deletioncodes for high-noise and high-rate regimes. In Information Theory (ISIT), 2016IEEE International Symposium on, pages 620–624. IEEE, 2016.

[GS18] Ryan Gabrys and Frederic Sala. Codes correcting two deletions. IEEE Trans-actions on Information Theory, 65(2):965–974, 2018.

[GW17] Venkatesan Guruswami and Carol Wang. Deletion codes in the high-noise andhigh-rate regimes. IEEE Transactions on Information Theory, 63(4):1961–1970,2017.

18

Page 22: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

[GX17] Venkatesan Guruswami and Chaoping Xing. Optimal rate list decodingover bounded alphabets using algebraic-geometric codes. arXiv preprintarXiv:1708.01070, 2017.

[Hae14] Bernhard Haeupler. Interactive channel capacity revisited. In Proceedings ofthe Annual Symposium on Foundations of Computer Science (FOCS), pages226–235, 2014.

[HF02] Albertus SJ Helberg and Hendrik C Ferreira. On multiple insertion/deletioncorrecting codes. IEEE Transactions on Information Theory, 48(1):305–308,2002.

[HRZW17] Brett Hemenway, Noga Ron-Zewi, and Mary Wootters. Local list recovery ofhigh-rate tensor codes and applications. Proceedings of the Annual Symposiumon Foundations of Computer Science (FOCS), 2017.

[HS17] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings:Codes for insertions and deletions approaching the singleton bound. In Pro-ceedings of the Annual Symposium on Theory of Computing (STOC), 2017.

[HS18] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings: Ex-plicit constructions, local decoding, and applications. In Proceedings of theAnnual Symposium on Theory of Computing (STOC), 2018.

[HSS18] Bernhard Haeupler, Amirbehshad Shahrasbi, and Madhu Sudan. Synchroniza-tion strings: List decoding for insertions and deletions. In 45th InternationalColloquium on Automata, Languages, and Programming (ICALP), 2018.

[HSV18] Bernhard Haeupler, Amirbehshad Shahrasbi, and Ellen Vitercik. Synchroniza-tion strings: Channel simulations and interactive coding for insertions and dele-tions. In 45th International Colloquium on Automata, Languages, and Program-ming (ICALP), pages 75:1–75:14, 2018.

[HY18] Tomohiro Hayashi and Kenji Yasunaga. On the list decodability of insertionsand deletions. In 2018 IEEE International Symposium on Information Theory(ISIT), pages 86–90. IEEE, 2018.

[KRRZ+19] Swastik Kopparty, Nicolas Resch, Noga Ron-Zewi, Shubhangi Saraf, and Shash-wat Silas. On list recovery of high-rate tensor codes. In Approximation, Ran-domization, and Combinatorial Optimization. Algorithms and Techniques (AP-PROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2019.

[Lev65] Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions,and reversals. Doklady Akademii Nauk SSSR 163, 4:845–848, 1965.

[Lev66] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions,and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966.

19

Page 23: Thesis Proposal Coding for Synchronization Errorsashahras/proposal.pdf · behind our highly sophisticated knowledge of codes for Hamming-type errors. This thesis investigates coding

[LTX19] Shu Liu, Ivan Tjuawinata, and Chaoping Xing. List decoding of insertion anddeletion codes. arXiv preprint arXiv:1906.09705, 2019.

[MBT10] Hugues Mercier, Vijay K Bhargava, and Vahid Tarokh. A survey of error-correcting codes for channels with symbol synchronization errors. IEEE Com-munications Surveys & Tutorials, 12(1), 2010.

[Mit09] Michael Mitzenmacher. A survey of results for deletion channels and relatedsynchronization channels. Probability Surveys, 6:1–33, 2009.

[Sch93] Leonard J. Schulman. Deterministic coding for interactive communication. InProceedings of the Annual Symposium on Theory of Computing (STOC), pages747–756, 1993.

[Sch96] Leonard J. Schulman. Coding for interactive communication. IEEE transactionson information theory, 42(6):1745–1756, 1996.

[Sin64] Richard Singleton. Maximum distance q-nary codes. IEEE Transactions onInformation Theory, 10(2):116–118, 1964.

[Slo02] Neil JA Sloane. On single-deletion-correcting codes. Codes and designs, 10:273–291, 2002.

[SW19] Alexander A Sherstov and Pei Wu. Optimal interactive coding for insertions,deletions, and substitutions. IEEE Transactions on Information Theory, 2019.

[SZ99] Leonard J. Schulman and David Zuckerman. Asymptotically good codes correct-ing insertions, deletions, and transpositions. IEEE transactions on informationtheory, 45(7):2552–2557, 1999.

[Ten84] Grigory Tenengolts. Nonbinary codes, correcting single deletion or insertion(corresp.). IEEE Transactions on Information Theory, 30(5):766–769, 1984.

[WZ17] Antonia Wachter-Zeh. List decoding of insertions and deletions. IEEE Trans-actions on Information Theory, 2017.

20