vladimir v. ufimtsev adviser: dr. v. rykov a mathematical theory of communication c.e. shannon main...

27
Vladimir V. Ufimtsev Adviser: Dr. V. Rykov

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Vladimir V. Ufimtsev

Adviser: Dr. V. Rykov

Page 2: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information obtained from a channel.

Error Detecting and Error Correcting Codes R.W. Hamming

Main result: Matrices that can be used to encode messages and provide more reliable transmission across a channel.

A structure for Deoxyribose Nucleic Acid J. D. WATSON, F. H. C. CRICK, M. H. F. Wilkins, R. E. Franklin,

Main result: Structure found for the building block of life.

There’s Plenty of Room at the Bottom R.P. Feynman

Main result: Anticipated Science at the nanoscale ( meters).910

Page 3: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

1 2( , ,..., )nx x xx i qx A

{0,1,... 1}qA q

nqFLet denote a set consisting of all vectors (codewords) of

length n built over

i.e. nqFx

Let such that: ,...}2,1,0{: nq

nq FFd

nqF zyx ,, 1)

2)

3)

yxyx 0),(d),(),( xyyx dd

),(),(),( yzzxyx ddd

Let be such that: n

qq FdMnC ),,(),,(, dMnCq yx dd ),( yx

MdMnCq ),,(

),,( dMnCq is referred to as a Code of length n, size M, and minimum distance d.

Page 4: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

ddFinSphdnV nq

d

iqq

),(,:),,(),,(0

yxyyxx

nqF

}),(,:{),,( ddFdnSph nqq yxyyx

Volume of the sphere around x, of radius d:

A sphere in centered at x having radius d:

A space is HOMOGENEOUS when the volume of a sphere does not depend on where it is centered i.e.

)),,(),,()(,0)(,( dnVdnVndF qqn

q yxyx

A space is NON - HOMOGENEOUS when the volume of a sphere does depend on where it is centered.

Page 5: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

For any code there are 3 conflicting parameters;

Length: n

Size: M

Minimum distance: d

The aim of coding theory is:

Given any 2 parameters, find the optimal value for the3rd. We need small n for fast transmission, large M foras much information as possible to be encoded and large d so that we can detect and correct many errors.

Page 6: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Exact formulas for sphere volumes and code sizes are extremely difficult to obtain sometimes. In most cases only upper and lower bounds can be obtained for these parameters.

We will be working in a NON-HOMOGENEOUS space making the obtainment of exact formulas for sphere volumes and code sizes VERY HARD.

Hamming Upper Bound on Code Size in with any metric:n

qF

nqq q

dnVdMnC

2

1,),,( min

21

,

),,(min d

nV

qdMnC

q

n

q

Varshamov-Gilbert Lower Bound on Code Size in with any metric:

nqF

1,),,( maxmax dnVdMnCq qqn

),,(1,

maxmax

dMnCdnV

qq

q

n

Page 7: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Let G be a simple graph on vertices and e edges. G contains an M-clique if:

nq

21

11

2n

Me

CLIQUES:

Page 8: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

)1,(2

)1,,(

2

1)1,,(

2

1 2

dnVqq

q

dnV

qqdnVqe avgq

nn

n

Fq

nn

Fq

nn

q

nq

x

x

x

x

)1,(221

11

2

dnVq

qq

Mavg

qn

nn

)1,(1

dnV

qM

avgq

n

If:

Then there exists a code of size M.

),(max dnCMq

Page 9: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Let

)1,( dnV

qM

avgq

n

1)1,()1,(

dnV

qM

dnV

qavg

q

n

avgq

n

Then:

Hence there exists a code of size M and so:

),()1,(

max dnCdnV

qqavg

q

n

Page 10: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

The rules of base pairing (nucleotide paring):

• A - T: adenine (A) always pairs with thymine (T) •  C - G: cytosine (C) always pairs with guanine (G)  

Page 11: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

• Each base has a bonding surface• Bonding surface of A is complementary to that of T (2

bonds)• Bonding surface of G is complementary to that of C (3

bonds) • Hybridization is a process that joins two complementary

opposite polarity single strands into a double strand through hydrogen bonds.

Page 12: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Orientation of single DNA strands is important for hybridization.

Page 13: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Direct

Shifted

Folded

Loop

Page 14: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Interest into DNA computing was sparked in 1994 by Len Adleman.

Adleman showed how we can use DNA molecules to solve a mathematical problem. (Hamiltonian path problem).

DNA computing relies on the fact that DNA strands can be represented as sequences of bases (4-ary sequences) and the property of hybridization.

In Hybridization, errors can occur. Thus, error-correcting codes are required for efficient synthesis of DNA strands to be used in computing.

Page 15: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Sequence ),...,,( 21 kzzzz),...,,( 21 nxxxx

),...,,( 21 kiii

is a subsequence of

if and only if there exists a strictly increasing sequence of indices:

Such that: jij xzj ,

is defined to be the set of longest common subsequences of

),( yxLCSx and y

),( yxL is defined to be the length of the longest common subsequence of x and y

Page 16: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

• X = ( A T C T G A T )

Z = ( T C G T ) - subsequence of X

• X = ( A T C T G A T )

Y = ( T G C A T A )( T C A T )– L (X,Y)

LCS(X ,Y) = 4

Page 17: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Original Insertion-Deletion metric (Levenshtein 1966):n

qFyx,)),((2)),(()),((),( yxyxyxyx LnLnLnd

),( yxLn

This metric results from the number of deletions and insertions that need to be made to obtain ‘ y ’ from ‘ x ’.

For vectors that have the same length:

the number of deletions that will be made is:

likewise, the number of insertions that will be made is:

),( yxLn

Page 18: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

mz,2),,...,,( 21 nmzzz m z

1,...,2,1,, 1 mizz ii

A common subsequence is called a common stacked pair subsequence of length between x and y if two elements , are consecutive in x and consecutive in y or if they are non -consecutive in x and or non-consecutive in y, then and are consecutive in x and y.

nSS ),(0),,( yxyx z

),( yxS

Let , denote the length of the longest sequence occurring as a common stacked pair subsequence subsequence z between sequences x and y. The number , is called a similarity of blocks between x and y. The metric is defined to be

ii zz ,1 21, ii zz

),(),( yxyx Snd

Page 19: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

j

jkd

kj

d

k

j

j

jdnqdnV

j

k

dn

j

davgq

11

1)1,(

1

0

2

1

1

1

The upper bound for the average sphere volume in this metric will be:

The Varshamov-Gilbert bound becomes:

),(

11

1

max

1

0

2

1

1

1

dnC

j

jkd

kj

d

k

j

j

jdn

qq

j

k

dn

j

dn

Page 20: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

A C G T

A 1.00 1.44 1.28 0.88

C 1.45 1.84 2.17 1.28

G 1.30 2.24 1.84 1.44

T 0.58 1.30 1.45 1.00

Thermodynamic weight of virtual stacked pairs.

•Can use statistical estimation of sphere volume.

Page 21: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

• There are many possibilities for metrics on the space of DNA sequences.

• All discussed metrics are non-homogeneous i.e. the sizes of the spheres in the metric spaces depend on the location of their centers.

• A universal method that will allow us to calculate lower bounds for optimal code sizes was given.

Page 22: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length (n) Min. size

15 8

16 15

17 28

18 53

19 107

20 223

21 479

22 1055

23 2386

24 5524

25 13068

26 31545

27 77600

28 1943016

29 494758

30 1279652

Minimum distance (d) = 6

Page 23: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length (n) Min. size

15 2

16 3

17 5

18 8

19 13

20 24

21 46

22 90

23 183

24 381

25 815

26 1783

27 3988

28 9102

29 21174

30 50155

Minimum distance (d) = 7

Page 24: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length (n) Min. size

20 4

21 7

22 12

23 21

24 39

25 75

26 149

27 304

28 635

29 1354

30 2946

Minimum distance (d) = 8

Page 25: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length (n) Min. size

20 1

21 2

22 2

23 4

24 6

25 10

26 18

27 33

28 62

29 121

30 243

Minimum distance (d) = 9

Page 26: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length (n) Min. size

25 2

26 3

27 5

28 8

29 15

30 27

Minimum distance (d) = 10

Page 27: Vladimir V. Ufimtsev Adviser: Dr. V. Rykov A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information

Length LCS Min dist. Size V-G bound

10 8 2   4365

14 12 2   580715

         

12 8 4 482 25

14 10 4 2683 151

16 12 4   1042

18 14 4   7989

20 16 4   66413

22 18 4   588872

24 20 4   5504930

         

14 8 6 66 1

16 10 6 204 3

18 12 6 767 13

20 14 6 2843 65

22 16 6   364

24 18 6   2279

         

16 8 8 28 1

18 10 8 50 1

20 12 8 122 1

22 14 8 345 2

24 16 8 1084 7

         

22 12 10 45 1

24 14 10 86 1