3ftfbsdi sujdmf ).&$ )fvsjtujdmhpsjuingps*oejwjevbm...

11
Hindawi Publishing Corporation ISRN Bioinformatics Volume 2013, Article ID 291741, 10 pages http://dx.doi.org/10.1155/2013/291741 Research Article HMEC: A Heuristic Algorithm for Individual Haplotyping with Minimum Error Correction Md. Shamsuzzoha Bayzid, 1 Md. Maksudul Alam, 2 Abdullah Mueen, 3 and Md. Saidur Rahman 4 1 Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA 2 Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA 3 Department of Computer Science and Engineering, Univerity of California, Riverside, CA 92521, USA 4 Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh Correspondence should be addressed to Md. Shamsuzzoha Bayzid; [email protected] Received 26 November 2012; Accepted 12 December 2012 Academic Editors: A. Bolshoy and A. Torkamani Copyright © 2013 Md. Shamsuzzoha Bayzid et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Haplotype is a pattern of single nucleotide polymorphisms (SNPs) on a single chromosome. Constructing a pair of haplotypes from aligned and overlapping but intermixed and erroneous fragments of the chromosomal sequences is a nontrivial problem. Minimum error correction approach aims to minimize the number of errors to be corrected so that the pair of haplotypes can be constructed through consensus of the fragments. We give a heuristic algorithm (HMEC) that searches through alternative solutions using a gain measure and stops whenever no better solution can be achieved. Time complexity of each iteration is 3 for an SNP matrix where and are the number of fragments (number of rows) and number of SNP sites (number of columns), respectively, in an SNP matrix. Alternative gain measure is also given to reduce running time. We have compared our algorithm with other methods in terms of accuracy and running time on both simulated and real data, and our extensive experimental results indicate the superiority of our algorithm over others. 1. Introduction A single DNA molecule is a long chain of nucleotides (base pairs). ere are four such nucleotides which are represented by the set of symbols { . It is generally accepted that genomes of two humans are almost 99% identical at DNA level. However, at certain speci�c sites, variation is observed across the human population which is commonly known as “single nucleotide polymorphism” and abbreviated as “SNP” [1]. e nucleotide involved in a SNP site is called allele. If a SNP site can have only two nucleotides, it is called biallelic. If it can have more than two alleles it is called a multiallelic SNP [2]. From now on, we will consider the simplest case where only bi-allelic SNPs occur in a speci�c pair of DNA. e single nucleotide polymorphism (SNP) is believed to be the most widespread form of genetic variation [3]. e sequence of all SNPs in a given chromosome is called hap- lotype. Haplotyping an individual deals with determining a pair of haplotypes, one for each copy of a given chromosome. A chromosome is a complicated structure of a DNA molecule bound by proteins. is pair of haplotypes completely de�ne the SNP �ngerprints of an individual for a speci�c pair of chromosomes. Given the two sequences of bases, haplotyping is straight forward and just needs to iterate through both the sequences and remove all the common alleles from them. But haplotyping becomes difficult when we want to con- struct haplotypes from sequencing data for higher reliability. Sequencing data for a genome does not contain the complete sequences of bases for a speci�c chromosome, rather it provides a set of fragments of arbitrary length for the whole genome. erefore, the actual problem of haplotyping is to �nd two haplotypes from the set of overlapping fragments

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

Hindawi Publishing CorporationISRN BioinformaticsVolume 2013 Article ID 291741 10 pageshttpdxdoiorg1011552013291741

Research ArticleHMEC A Heuristic Algorithm for Individual Haplotyping withMinimum Error Correction

Md Shamsuzzoha Bayzid1 Md Maksudul Alam2

AbdullahMueen3 andMd Saidur Rahman4

1 Department of Computer Science University of Texas at Austin Austin TX 78712 USA2Department of Computer Science Virginia Tech Blacksburg VA 24060 USA3Department of Computer Science and Engineering Univerity of California Riverside CA 92521 USA4Department of Computer Science and Engineering Bangladesh University of Engineering and TechnologyDhaka 1000 Bangladesh

Correspondence should be addressed to Md Shamsuzzoha Bayzid shamsbayzidgmailcom

Received 26 November 2012 Accepted 12 December 2012

Academic Editors A Bolshoy and A Torkamani

Copyright copy 2013 Md Shamsuzzoha Bayzid et al is is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Haplotype is a pattern of single nucleotide polymorphisms (SNPs) on a single chromosome Constructing a pair of haplotypes fromaligned and overlapping but intermixed and erroneous fragments of the chromosomal sequences is a nontrivial problemMinimumerror correction approach aims to minimize the number of errors to be corrected so that the pair of haplotypes can be constructedthrough consensus of the fragments We give a heuristic algorithm (HMEC) that searches through alternative solutions using again measure and stops whenever no better solution can be achieved Time complexity of each iteration is1198741198741198741198741198743119896119896119896 for an119874119874119898119896119896 SNPmatrix where119874119874 and 119896119896 are the number of fragments (number of rows) and number of SNP sites (number of columns) respectivelyin an SNP matrix Alternative gain measure is also given to reduce running time We have compared our algorithm with othermethods in terms of accuracy and running time on both simulated and real data and our extensive experimental results indicatethe superiority of our algorithm over others

1 Introduction

A single DNA molecule is a long chain of nucleotides (basepairs) ere are four such nucleotides which are representedby the set of symbols 119860119860119860 119860119860119860 119860119860119860 119860119860119860 It is generally accepted thatgenomes of two humans are almost 99 identical at DNAlevel However at certain specic sites variation is observedacross the human population which is commonly known asldquosingle nucleotide polymorphismrdquo and abbreviated as ldquoSNPrdquo[1] e nucleotide involved in a SNP site is called allele If aSNP site can have only two nucleotides it is called biallelic Ifit can have more than two alleles it is called amultiallelic SNP[2] From now on we will consider the simplest case whereonly bi-allelic SNPs occur in a specic pair of DNA

e single nucleotide polymorphism (SNP) is believed tobe the most widespread form of genetic variation [3] e

sequence of all SNPs in a given chromosome is called hap-lotype Haplotyping an individual deals with determining apair of haplotypes one for each copy of a given chromosomeA chromosome is a complicated structure of a DNAmoleculebound by proteins is pair of haplotypes completely denethe SNP ngerprints of an individual for a specic pair ofchromosomes Given the two sequences of bases haplotypingis straight forward and just needs to iterate through both thesequences and remove all the common alleles from themBut haplotyping becomes difficult when we want to con-struct haplotypes from sequencing data for higher reliabilitySequencing data for a genome does not contain the completesequences of bases for a specic chromosome rather itprovides a set of fragments of arbitrary length for the wholegenome erefore the actual problem of haplotyping is tond two haplotypes from the set of overlapping fragments

2 ISRN Bioinformatics

of both the chromosomes where fragments might containerrors and it is not known which copy of the chromosomea particular fragment belongs to

e problem of haplotyping has been studied extensivelye general minimum error correction (MEC) problem wasproved to be NP-hard [4] It was also proved to be NP-hardeven if the SNP matrix is gapless using a reduction fromthe MAX-CUT problem [1] A method based on geneticalgorithm has been proposed to solve this problem [5]Several heuristic methods have also been proposed to ndhaplotypes efficiently HapCUT [6] and ReFHap [7] are twoof the most accurate algorithms in this regard

In this paper we give a heuristic algorithm for indi-vidual haplotyping based on minimum error correctione complexity of each iteration is 1198741198741198741198741198743119896119896119896 for an SNPmatrix of dimension 119874119874119874119898 119896119896119896 e algorithm is inspired fromthe famous Fiduccia and Mattheyses (FM) algorithm forbipartitioning a hypergraph minimizing the cut size [8]Extensive simulations indicate that HMEC outperforms thegenetic algorithms of Wang et al [5] in terms of bothreconstruction rate and running time and it has better (inmost cases) or comparable accuracy and signicantly smallerrunning time than that of HapCUT [6] which is the mostaccurate heuristic algorithm available We also comparedHMEC with some other algorithms such as SpeedHap [9]FastHare [10] MLF [11] 2 distance MEC [12] and SHR-3 [13] using the HapMap-based instance generator andcomparison framework [14 15]

e rest of the paper is organized as follows In Section 2we present some denitions and preliminary ideas In Section3 we present our algorithm for individual haplotyping Wedescribe the data structure and complexity of our algorithmin Section 4 We report on an extensive performance studyevaluating HMEC with other available techniques in Section5 Finally we conclude in Section 6 by suggesting somefuture research directions An earlier version of this paperwasaccepted for presentation at BMEI 2008 [16]

2 Preliminaries

In this section we give some denitions and preliminaryideas

Let 119878119878 be the set of 119896119896 bi-allelic SNP sites Let 119865119865 be the set of119874119874 fragments produced from two copies of the chromosomeEach fragment contains information of nonzero number ofSNPs in 119878119878 Because the SNPs are bi-allelic let the two possiblealleles for each SNP site be 0 and 1 where they can be anytwo elements of the set 119860119860119898 119860119860119898 119860119860119898 119860119860119860 Since all the nucleotidesare the same at the sites other than SNP sites we can removethese extraneous sites from all the fragments and consider thefragments as the sequences of the SNP sites only us eachfragment 119891119891 119891 119865119865 is a string of symbols 0119898 1119898 minus119860 of length 119896119896where ldquominusrdquo denotes an undetermined SNPwhichwe call aholeAll the fragments can be arranged in an 119874119874 119898 119896119896 matrix 119872119872 119872119872119872119894119894119894119894119860 119894119894 119872 1119898119894 119898119874119874 119894119894 119872 1119898119894119896119896 where row 119894119894 is a fragment from

119865119865 and column 119894119894 is a SNP from 119878119878ismatrix is called the SNPmatrix as follows

10092161009216100923210092321009232100923210092321009232100923210092321009232100923210092321009232

10092401009240

minusminusminusminus1 1 0 1minusminusminusminusminusminusminusminusminusminusminusminusminusminusminusminusminus0 0 0 1 1 1 0 1 0 1minusminusminusminusminus1 1 0 1 0 0 1 0 0 1 1minusminusminusminusminusminusminusminusminusminusminusminus1 0 1 0 0minusminusminus0 1 0minusminusminusminusminusminusminusminusminusminusminusminusminusminusminus1 0 1 1 0 1 0 1 0 1 10 1 0 1 1minusminusminusminusminusminusminusminusminusminus0 1 0 1 1

10092161009216100923210092321009232100923210092321009232100923210092321009232100923210092321009232

10092401009240

(1)

e consecutive sequence of ldquominusrdquos that lies between twononhole symbols is called a gap A gapless SNP matrix is theone that has no gap in any of the fragments In (1) the rstsecond and third rows have no gaps while each of the fourthand sixth rows has one gap

A SNP matrix 119872119872 119872 11987211987211987211198981198721198722119898119894 119898119872119872119874119874⟩ can be viewedas an ordered set of 119874119874 fragments where a fragment 119872119872119894119894 11987211987211987211987211989411989411198981198721198721198941198942119898119894 119898119872119872119894119894119896119896⟩ is an ordered set of 119896119896 alleles A fragment119872119872119894119894 is called to cover the 119894119894th SNP if 119872119872119894119894119894119894 119891 0119898 1119860 and calledto skip the 119894119894th SNP if 119872119872119894119894119894119894 119872 minus Let 119872119872119904119904 and 119872119872119905119905 be twofragmentse distance between two fragments119863119863119874119872119872119904119904119898119872119872119905119905119896is dened as the number of SNPs that are covered by both ofthe fragments and have different alleles Hence

1198631198631007649100764911987211987211990411990411989811987211987211990511990510076651007665 119872119896119896100557610055761198941198941198721119889119889 1007650100765011987211987211990411990411989411989411989811987211987211990511990511989411989410076661007666 119898 (2)

where 119889119889119874119889119889119898 119889119889119896 is dened as

119889119889 10076491007649119889119889119898 11988911988910076651007665 119872 100768610076861119898 if119889119889119909 minus and119889119889119909 minus and1198891198891199091198891198891199090119898 otherwise

(3)

In (1) the distance between the second and the thirdfragment is two as they differ in the seventh and ninth SNPsites (columns)

Two fragments 119872119872119904119904 and 119872119872119905119905 are said to be conicting if119863119863119874119872119872119904119904119898119872119872119905119905119896 gt 0 Let 1198751198751198741198601198601119898 1198601198602119896 be a partition of 119872119872 where1198601198601 and 1198601198602 are two sets of fragments taken from 119872119872 so that1198601198601 ⋃1198601198602 119872 119872119872 and 1198601198601 ⋂1198601198602 119872 120601120601 [5] In Figure 1(b) anarbitrary partition corresponding to the SNPmatrix of Figure1(a) is shown A SNP matrix119872119872 is an error-freematrix if andonly if there exists a partition 1198751198751198741198601198601119898 1198601198602119896 of 119872119872 such that forany two fragments 119889119889119898 119889119889 119891 119860119860119894119894 119894119894 119891 1119898 2119860 119889119889 and 119889119889 are non-conicting that is 119863119863119874119889119889119898 119889119889119896 119872 0 Such a partition is calledan error-free partition e partition in the Figure 1(b) is noterror free since11986311986311987411987211987211198981198721198722119896 gt 0 in1198601198601 and11986311986311987411987211987251198981198721198726119896 gt 0 in1198601198602 For an error-free SNPmatrix a haplotype119867119867119894119894 119894119894 119891 1119898 2119860 isconstructed from its corresponding fragment class 119860119860119894119894 usingthe following formula

119867119867119894119894119894119894 1198721007618100761810076341007634100762610076261007634100763410076421007642

1119898 if at least one fragment in119860119860119894119894 has a 1 in 119894119894th SNP1199090119898 if at least one fragment in119860119860119894119894 has a 0 in 119894119894th SNP119909minus119898 if all the fragments in119860119860119894119894 skips 119894119894th SNP119909

(4)

where119860119860119894119894 is called the dening class of haplotype119867119867119894119894 and119867119867119894119894119894119894where 119894119894 119891 1119898 2119860 and 119894119894 119872 1119898119894119896119896 denotes the 119894119894th element ofthe haplotype119867119867119894119894

ISRN Bioinformatics 3

6

5

4

3

2

1

M

0 1 0 0 0 mdash mdash mdash mdash mdash

mdash mdash 0 0 1 1 1 0 0 0

mdash 1 0 0 0 0 0 1 mdash mdash

mdash mdash mdash mdash mdash 1 1 0 0 1

mdash mdash 0 0 1 1 1 0 0 0

mdash mdash mdash 1 0 0 0 0 0 1

(a)

3

1

2

4

5

6

C1 C2

P

(b)

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

(c)

F 1 SNP matrix and its partition

We now describe the general minimum error correctionproblem If a matrix 119872119872 is not error-free there will be noerror-free partition 119875119875 For such a matrix 119872119872 there will be atleast one conicting pair of fragments in each of the classes forall possible partitions erefore it is impossible to constructa haplotype that is non-conicting with all the fragments inits dening class of fragments If we are given a partition1198751198751198751198751198751 1198751198752) and two haplotypes 1198671198671 and 1198671198672 constructed from119875119875 then the number of errors 119864119864119875119875119875) that needs to be correctedcan be calculated by the following formula

119864119864 119875119875119875) =210055761005576119894119894=1

10055761005576119891119891119891119875119875119894119894

119863119863 1007649100764911989111989111986711986711989411989410076651007665 (5)

e MEC problem asks to nd a partition 119875119875 that mini-mizes the error function 119864119864119875119875119875) over all such partitions of anSNP matrix119872119872

3 A Heuristic Algorithm

In this section we give our heuristic algorithm based onminimum error correction which we call HMEC

Construction of a haplotype from an erroneous class 119875119875requires correction of SNP values that is alleles in the frag-mentsWewant tominimize the number of error correctionserefore for each SNP site the haplotype should take theallele that is present in the majority of the fragments Let1198731198730

119895119895119875119875119875) be the number of fragments of a collection119875119875 that have0 in the 119895119895th SNP Similarly 1198731198731

119895119895119875119875119875) denes the number of 1s[5] erefore to minimize the number of errors 119864119864119875119875119875) fora specic partition 119875119875 the haplotype should be constructedaccording to the following methodology

119867119867119894119894119895119895 =

100761810076181007634100763410076341007634100763410076341007626100762610076341007634100763410076341007634100763410076421007642

1 if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 gt 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665

0 if1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ge 119873119873

1119895119895 1007649100764911987511987511989411989410076651007665 and

1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ne 0

minus if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 = 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665 = 0

(6)

where 119894119894 119891 1198941 2119894 and 119895119895 = 1 2119895 119895119895 In Figure 1(c) twohaplotypes 1198671198671 and 1198671198672 associated with the partition 119875119875 inFigure 1(b) are constructed in this way

We will use a heuristic search to nd the best partitionis algorithm starts with a current partition 119875119875119888119888 = 119875119875119875119872119872 119875119875)and iteratively searches a better partition In each iterationthe algorithm performs a sequence of transfer of fragments

from their present collection to the other one so that thepartition becomes less erroneous e transfer of a fragmentfrom one collection to the other can increase or decreasethe error function 119864119864119875119875119875) Let the partition before transferringa fragment 119891119891 be 119875119875119901119901 and the partition resulted is 119875119875119899119899 Wedene the gain of the transfer as Gain119875119891119891) = 119864119864119875119875119875119901119901) minus 119864119864119875119875119875119899119899)Figure 2 demonstrates an example calculation of the gainmeasure Let 119865119865 = 119865119891119891119894119894⟩ 119894119894 = 1198941 2119895 119894119894119894 be an orderingof all the fragments in a partition 119875119875 in such a way thatfragment 119891119891119894119894 will precede fragment 119891119891119895119895 if all the fragmentsbefore 119891119891119894119894 in 119865119865 have already been transferred to form anintermediate partition 119875119875119894119894 and Gain119875119891119891119894119894) ge Gain119875119891119891119895119895) over 119875119875119894119894Hence 1198751198751 = 119875119875119888119888 at the start of each iteration We also denethe cumulative gain of a fragment ordering 119865119865 up to the 119899119899thfragment as 119875119875Gain119875119865119865 119899119899) = 119865119899119899

119894119894=1 Gain119875119891119891119894119894) Here Gain119875119891119891119894119894) =119864119864119875119875119875119894119894)minus1198641198641198751198751198751198941198941198941)emaximum cumulative gain119872119872119875119875Gain119875119865119865)is dened as

119872119872119875119875Gain 119875119865119865) = max1le119894119894le119894119894

119875119875Gain 119875119865119865 119894119894) (7)

In Section 4 we shall describe these terms with anexample

In each iteration the algorithm nds the current ordering119865119865119888119888 of 119875119875119888119888 and transfers only those fragments of 119865119865119888119888 that canachieve the119872119872119875119875Gain119875119865119865119888119888) and the fragment that is the last tobe transferred is referred as 119891119891max us the algorithm movesfrom one partition to another reducing the error function byan amount of119872119872119875119875Gain119875119865119865119888119888) Please see Algorithm 1 for a basicdescription of HMEC e algorithm continues as long as119872119872119875119875Gain119875119865119865119888119888) gt 0 and stops whenever119872119872119875119875Gain119875119865119865119888119888) le 0

4 Data Structures and Complexity

is section deals with the data structures and the complexityof our algorithm Here we also suggest some approximationto improve the performance of our algorithm

First to nd119865119865119888119888 in each iteration the algorithm repeatedlytransfers the fragment that is not transferred previously inthis iteration and has maximum gain over all such frag-ments To accomplish this we use a locking mechanismAt the beginning of each iteration all the fragments areset free e free fragment with maximum gain is foundout and tentatively transferred to the other collection Aerthe transfer the fragment is locked at the new collectionis tentative transfer creates the rst intermediate partition1198751198751 e algorithm then nds the next free fragment withmaximum gain in 1198751198751 and transfer and lock that fragment to

4 ISRN Bioinformatics

119875119875119888119888 = 119875119875119875119875119875119875 119875119875119875FREE_LOCKS()CLEAR_LOG()repeat always

while there is an unlocked fragment in 119875119875119888119888 dobegin

nd a free fragment 119891119891 so that Gain119875119891119891119875 is maximumtransfer 119891119891 to the other classupdate the haplotypes aer the transferLOCK(119891119891)LOG_RECORD(119891119891 Gain119875119891119891119875)

end doFREE_LOCKS()check the log and nd119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891maxif 119875119875119872119872Gain119875119865119865119888119888119875 gt 0

beginset new 119875119875119888119888 by rolling back the transfers that occurredaer the transfer of 119891119891maxcalculate haplotypes of 119875119875119888119888CLEAR_LOG()continue with the the loop

end ifelse

beginterminate the algorithm and output current haplotypes

end elseend repeat

A 1 e HMEC Algorithm

create the 1198751198752 us free fragments are transferred until allthe fragments are locked and the order of the transfer (119865119865119888119888)is stored in the log table along with the cumulative gains(119872119872Gain) 119875119875119872119872Gain is the maximum 119872119872Gain and 119891119891max is thefragment corresponding to119875119875119872119872Gain in the log table

Aer nishing all such tentative transfers 119875119875119888119888 becomes anundened partition To change 119875119875119888119888 to the desired ldquocurrentrdquopartition of the next iteration the algorithm checks the log tond the119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891max and rollback the transfer ofall the fragments that were transferred aer 119891119891max When therollback completes 119875119875119888119888 becomes ready for the next iteration

While tentatively transferring a free fragment the algo-rithm needs to nd the fragment with maximum gain amongthe free fragments (which are not yet transferred) isrequires calculating gains for each of them To calculatethe Gain119875119891119891119875 = 119891119891119875119875119875119901119901119875 minus 119891119891119875119875119875119899119899119875 for a fragment we needto calculate two error values of two different partitionsthe present intermediate partition and the next partitionwhich will be resulted if 119891119891 is transferred Each of theseerror function requires calculation of two new haplotypesfrom their corresponding collections (see Figure 2) Although119891119891119875119875119875119901119901119875 and the haplotypes of 119875119875119901119901 can be found from theprevious transfer calculation of 119891119891119875119875119875119899119899119875 requires constructionof haplotypes of 119875119875119899119899 Since the difference between 119875119875119901119901 and 119875119875119899119899is only one transfer we can introduce differential calculationof haplotypes 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 of next partition from thehaplotypes of 119867119867119901119901

119894119894 119894119894 119894 119894119894119875 2119894 of present partition For thispurpose the algorithm stores119873119873119894

119895119895119875119872119872119901119901119894119894 119875 and119873119873

0119895119895119875119872119872

119901119901119894119894 119875 values of

the present partition Aer a transfer these values will eitherremain same or be incremented or decremented by 1 atis why it is now possible to construct 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 in 119874119874119875119874119874119875time To compute 119891119891119875119875119875119899119899119875 from the haplotypes requires119874119874119875119874119874119874119874119875time us running time to compute the 119891119891119875119875119875119899119899119875 as well as tocompute Gain119875119891119891119875 is119874119874119875119874119874119874119874 119874 119874119874119875

For each intermediate partition119875119875119894119894 119894119894 = 119894119875119894 119875 119899119899 we need tocompute Gain measures for119874119874minus 119894119894 unlocked fragments to ndthe maximum one e transfer of this fragments requiresupdating of119873119873119894

119895119895119875119872119872119894119894119875 and1198731198730119895119895119875119872119872119894119894119875 119894119894 119894 119894119894119875 2119894 and 119895119895 = 119894119875 2119875119894 119875 119874119874

Hence it also needs 119874119874119875119874119874119875 time to run Finally there will be119874119874 such transfer in each iteration and maximum119874119874 rollbacksus each iteration will require119874119874119875119874119874119875119874119874119875119874119874119874119874119874119874119874119875 119874 119874119874119875 119874119874119874119874119874119875 1198741198741198741198751198741198743119874119874119875 running time

We now give an example illustrating a single iteration ofour algorithm Figure 3 demonstrates an example iteration ofHMEC We consider that the current partition 119875119875119888119888 = 119875119875119894 isthe partition given in Figure 1(b) for the SNP matrix 119875119875 ofFigure 1(a) All the intermediate partitions 119875119875119894119894 119894119894 119894 119894119894119875119894 119875 119894119894are shown sequentially and the gains of each fragment overthe intermediate partitions are shown on the right of eachpartition e free fragment with maximum gain is markedin each intermediate partition For example the fragmentwith the maximum gain in 1198751198752 is fragment 6 which has gaintwo Aer each transfer the transferred fragment is shownlocked by a circle Here the ordering 119865119865119888119888 of the fragmentsis ⟨2119875 6119875 5119875 119894119875 4119875 3⟩ which is also the order of locking of thefragmentsis order will be stored along with the119872119872Gains in

ISRN Bioinformatics 5

1

2

3

4

5

6

1

3

4

5

6

2

Pp Pn

C1 C2 C1 C2

E(Pp) = 8 E(Pn) = 6

Gain(2) = E(Pp)minus E(Pn) = 2

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

H1 = 0 1 0 0 0 0 0 1 mdash mdash

H2 = 0 1 0 0 1 0 1 0 0 1

F 2 An example of Gain calculation

1

2

3

4

5

6

Gain(1) = 0

Gain(2) = 2

Gain(3) = 2

Gain(4) = 1

Gain(5) = 1

Gain(6) = 1

1

3

2

4

5

6

1

3

6

2

4

5

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus1

Gain(5) = 0

1

6

3

52

Gain(1) = minus1

Gain(3) = minus3

1

2

3

5

4

6

5

6

4

2

3

1

1

6

5

32

4

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus2

Gain(5) = 1

Gain(6) = 2

Gain(1) = minus2

Gain(3) = minus3

Gain(4) = minus1

Gain(3) = minus2

Locked fragment

P1

P2

P3

P4

P5

P6

P7

4

F 3 An example iteration of HMECe current partition 119875119875119888119888 = 1198751198751 is the partition given in Figure 1(b) for the SNPmatrix119872119872 of Figure1(a) Locked fragments are indicated by circles

the log table Figure 4 demonstrates the resulting log table ofthe illustrated iteration All the tentative transfers aer 119891119891maxhave to be rolled back so that the 1198751198753 becomes the next 119875119875119888119888

We now give an approximate gain measure to make ouralgorithm faster For large SNPmatrix1198741198741198741198741198743119896119896119896 running timeis critical to the performance of the algorithmWe can use anapproximation in the calculation of theGain119874119891119891119896 by using onlythe fragment 119891119891 and not using the119874119874119898 1 other fragments eapproximate gain should be

AppxGain 1007649100764911989111989110076651007665 = 119863119863 10076501007650119867119867119901119901119894119894 11989111989110076661007666 119898 119863119863 10076501007650119867119867119899119899

119895119895 11989111989110076661007666 (8)

Here 119867119867119901119901119894119894 is the haplotype of 119891119891rsquos present collection 119862119862119901119901

119894119894 ofpartition119875119875119901119901 and119867119867

119899119899119895119895 is the haplotype of119891119891rsquos next collection119862119862

119899119899119895119895

of partition 119875119875119899119899 is function ignores the effect of fragmentsother than 119891119891 on Gain119874119891119891119896 but reduces the run time of gaincalculation to 119874119874119874119896119896119896 erefore the total run time of eachiteration will be1198741198741198741198741198742119896119896119896

5 Performance Comparison

In this section we demonstrate the performance of ouralgorithm using both real biological and simulated datasetsWe compared our algorithm with GMEC [5] and HapCUT[6] We performed the simulation using the data fromangiotensin-converting enzyme (ACE) [17] and public Dalyset [18] to compare with GMEC [5] and used the HuRef data[19] to compare with HapCUT We also compared HMEC

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

2 ISRN Bioinformatics

of both the chromosomes where fragments might containerrors and it is not known which copy of the chromosomea particular fragment belongs to

e problem of haplotyping has been studied extensivelye general minimum error correction (MEC) problem wasproved to be NP-hard [4] It was also proved to be NP-hardeven if the SNP matrix is gapless using a reduction fromthe MAX-CUT problem [1] A method based on geneticalgorithm has been proposed to solve this problem [5]Several heuristic methods have also been proposed to ndhaplotypes efficiently HapCUT [6] and ReFHap [7] are twoof the most accurate algorithms in this regard

In this paper we give a heuristic algorithm for indi-vidual haplotyping based on minimum error correctione complexity of each iteration is 1198741198741198741198741198743119896119896119896 for an SNPmatrix of dimension 119874119874119874119898 119896119896119896 e algorithm is inspired fromthe famous Fiduccia and Mattheyses (FM) algorithm forbipartitioning a hypergraph minimizing the cut size [8]Extensive simulations indicate that HMEC outperforms thegenetic algorithms of Wang et al [5] in terms of bothreconstruction rate and running time and it has better (inmost cases) or comparable accuracy and signicantly smallerrunning time than that of HapCUT [6] which is the mostaccurate heuristic algorithm available We also comparedHMEC with some other algorithms such as SpeedHap [9]FastHare [10] MLF [11] 2 distance MEC [12] and SHR-3 [13] using the HapMap-based instance generator andcomparison framework [14 15]

e rest of the paper is organized as follows In Section 2we present some denitions and preliminary ideas In Section3 we present our algorithm for individual haplotyping Wedescribe the data structure and complexity of our algorithmin Section 4 We report on an extensive performance studyevaluating HMEC with other available techniques in Section5 Finally we conclude in Section 6 by suggesting somefuture research directions An earlier version of this paperwasaccepted for presentation at BMEI 2008 [16]

2 Preliminaries

In this section we give some denitions and preliminaryideas

Let 119878119878 be the set of 119896119896 bi-allelic SNP sites Let 119865119865 be the set of119874119874 fragments produced from two copies of the chromosomeEach fragment contains information of nonzero number ofSNPs in 119878119878 Because the SNPs are bi-allelic let the two possiblealleles for each SNP site be 0 and 1 where they can be anytwo elements of the set 119860119860119898 119860119860119898 119860119860119898 119860119860119860 Since all the nucleotidesare the same at the sites other than SNP sites we can removethese extraneous sites from all the fragments and consider thefragments as the sequences of the SNP sites only us eachfragment 119891119891 119891 119865119865 is a string of symbols 0119898 1119898 minus119860 of length 119896119896where ldquominusrdquo denotes an undetermined SNPwhichwe call aholeAll the fragments can be arranged in an 119874119874 119898 119896119896 matrix 119872119872 119872119872119872119894119894119894119894119860 119894119894 119872 1119898119894 119898119874119874 119894119894 119872 1119898119894119896119896 where row 119894119894 is a fragment from

119865119865 and column 119894119894 is a SNP from 119878119878ismatrix is called the SNPmatrix as follows

10092161009216100923210092321009232100923210092321009232100923210092321009232100923210092321009232

10092401009240

minusminusminusminus1 1 0 1minusminusminusminusminusminusminusminusminusminusminusminusminusminusminusminusminus0 0 0 1 1 1 0 1 0 1minusminusminusminusminus1 1 0 1 0 0 1 0 0 1 1minusminusminusminusminusminusminusminusminusminusminusminus1 0 1 0 0minusminusminus0 1 0minusminusminusminusminusminusminusminusminusminusminusminusminusminusminus1 0 1 1 0 1 0 1 0 1 10 1 0 1 1minusminusminusminusminusminusminusminusminusminus0 1 0 1 1

10092161009216100923210092321009232100923210092321009232100923210092321009232100923210092321009232

10092401009240

(1)

e consecutive sequence of ldquominusrdquos that lies between twononhole symbols is called a gap A gapless SNP matrix is theone that has no gap in any of the fragments In (1) the rstsecond and third rows have no gaps while each of the fourthand sixth rows has one gap

A SNP matrix 119872119872 119872 11987211987211987211198981198721198722119898119894 119898119872119872119874119874⟩ can be viewedas an ordered set of 119874119874 fragments where a fragment 119872119872119894119894 11987211987211987211987211989411989411198981198721198721198941198942119898119894 119898119872119872119894119894119896119896⟩ is an ordered set of 119896119896 alleles A fragment119872119872119894119894 is called to cover the 119894119894th SNP if 119872119872119894119894119894119894 119891 0119898 1119860 and calledto skip the 119894119894th SNP if 119872119872119894119894119894119894 119872 minus Let 119872119872119904119904 and 119872119872119905119905 be twofragmentse distance between two fragments119863119863119874119872119872119904119904119898119872119872119905119905119896is dened as the number of SNPs that are covered by both ofthe fragments and have different alleles Hence

1198631198631007649100764911987211987211990411990411989811987211987211990511990510076651007665 119872119896119896100557610055761198941198941198721119889119889 1007650100765011987211987211990411990411989411989411989811987211987211990511990511989411989410076661007666 119898 (2)

where 119889119889119874119889119889119898 119889119889119896 is dened as

119889119889 10076491007649119889119889119898 11988911988910076651007665 119872 100768610076861119898 if119889119889119909 minus and119889119889119909 minus and1198891198891199091198891198891199090119898 otherwise

(3)

In (1) the distance between the second and the thirdfragment is two as they differ in the seventh and ninth SNPsites (columns)

Two fragments 119872119872119904119904 and 119872119872119905119905 are said to be conicting if119863119863119874119872119872119904119904119898119872119872119905119905119896 gt 0 Let 1198751198751198741198601198601119898 1198601198602119896 be a partition of 119872119872 where1198601198601 and 1198601198602 are two sets of fragments taken from 119872119872 so that1198601198601 ⋃1198601198602 119872 119872119872 and 1198601198601 ⋂1198601198602 119872 120601120601 [5] In Figure 1(b) anarbitrary partition corresponding to the SNPmatrix of Figure1(a) is shown A SNP matrix119872119872 is an error-freematrix if andonly if there exists a partition 1198751198751198741198601198601119898 1198601198602119896 of 119872119872 such that forany two fragments 119889119889119898 119889119889 119891 119860119860119894119894 119894119894 119891 1119898 2119860 119889119889 and 119889119889 are non-conicting that is 119863119863119874119889119889119898 119889119889119896 119872 0 Such a partition is calledan error-free partition e partition in the Figure 1(b) is noterror free since11986311986311987411987211987211198981198721198722119896 gt 0 in1198601198601 and11986311986311987411987211987251198981198721198726119896 gt 0 in1198601198602 For an error-free SNPmatrix a haplotype119867119867119894119894 119894119894 119891 1119898 2119860 isconstructed from its corresponding fragment class 119860119860119894119894 usingthe following formula

119867119867119894119894119894119894 1198721007618100761810076341007634100762610076261007634100763410076421007642

1119898 if at least one fragment in119860119860119894119894 has a 1 in 119894119894th SNP1199090119898 if at least one fragment in119860119860119894119894 has a 0 in 119894119894th SNP119909minus119898 if all the fragments in119860119860119894119894 skips 119894119894th SNP119909

(4)

where119860119860119894119894 is called the dening class of haplotype119867119867119894119894 and119867119867119894119894119894119894where 119894119894 119891 1119898 2119860 and 119894119894 119872 1119898119894119896119896 denotes the 119894119894th element ofthe haplotype119867119867119894119894

ISRN Bioinformatics 3

6

5

4

3

2

1

M

0 1 0 0 0 mdash mdash mdash mdash mdash

mdash mdash 0 0 1 1 1 0 0 0

mdash 1 0 0 0 0 0 1 mdash mdash

mdash mdash mdash mdash mdash 1 1 0 0 1

mdash mdash 0 0 1 1 1 0 0 0

mdash mdash mdash 1 0 0 0 0 0 1

(a)

3

1

2

4

5

6

C1 C2

P

(b)

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

(c)

F 1 SNP matrix and its partition

We now describe the general minimum error correctionproblem If a matrix 119872119872 is not error-free there will be noerror-free partition 119875119875 For such a matrix 119872119872 there will be atleast one conicting pair of fragments in each of the classes forall possible partitions erefore it is impossible to constructa haplotype that is non-conicting with all the fragments inits dening class of fragments If we are given a partition1198751198751198751198751198751 1198751198752) and two haplotypes 1198671198671 and 1198671198672 constructed from119875119875 then the number of errors 119864119864119875119875119875) that needs to be correctedcan be calculated by the following formula

119864119864 119875119875119875) =210055761005576119894119894=1

10055761005576119891119891119891119875119875119894119894

119863119863 1007649100764911989111989111986711986711989411989410076651007665 (5)

e MEC problem asks to nd a partition 119875119875 that mini-mizes the error function 119864119864119875119875119875) over all such partitions of anSNP matrix119872119872

3 A Heuristic Algorithm

In this section we give our heuristic algorithm based onminimum error correction which we call HMEC

Construction of a haplotype from an erroneous class 119875119875requires correction of SNP values that is alleles in the frag-mentsWewant tominimize the number of error correctionserefore for each SNP site the haplotype should take theallele that is present in the majority of the fragments Let1198731198730

119895119895119875119875119875) be the number of fragments of a collection119875119875 that have0 in the 119895119895th SNP Similarly 1198731198731

119895119895119875119875119875) denes the number of 1s[5] erefore to minimize the number of errors 119864119864119875119875119875) fora specic partition 119875119875 the haplotype should be constructedaccording to the following methodology

119867119867119894119894119895119895 =

100761810076181007634100763410076341007634100763410076341007626100762610076341007634100763410076341007634100763410076421007642

1 if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 gt 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665

0 if1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ge 119873119873

1119895119895 1007649100764911987511987511989411989410076651007665 and

1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ne 0

minus if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 = 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665 = 0

(6)

where 119894119894 119891 1198941 2119894 and 119895119895 = 1 2119895 119895119895 In Figure 1(c) twohaplotypes 1198671198671 and 1198671198672 associated with the partition 119875119875 inFigure 1(b) are constructed in this way

We will use a heuristic search to nd the best partitionis algorithm starts with a current partition 119875119875119888119888 = 119875119875119875119872119872 119875119875)and iteratively searches a better partition In each iterationthe algorithm performs a sequence of transfer of fragments

from their present collection to the other one so that thepartition becomes less erroneous e transfer of a fragmentfrom one collection to the other can increase or decreasethe error function 119864119864119875119875119875) Let the partition before transferringa fragment 119891119891 be 119875119875119901119901 and the partition resulted is 119875119875119899119899 Wedene the gain of the transfer as Gain119875119891119891) = 119864119864119875119875119875119901119901) minus 119864119864119875119875119875119899119899)Figure 2 demonstrates an example calculation of the gainmeasure Let 119865119865 = 119865119891119891119894119894⟩ 119894119894 = 1198941 2119895 119894119894119894 be an orderingof all the fragments in a partition 119875119875 in such a way thatfragment 119891119891119894119894 will precede fragment 119891119891119895119895 if all the fragmentsbefore 119891119891119894119894 in 119865119865 have already been transferred to form anintermediate partition 119875119875119894119894 and Gain119875119891119891119894119894) ge Gain119875119891119891119895119895) over 119875119875119894119894Hence 1198751198751 = 119875119875119888119888 at the start of each iteration We also denethe cumulative gain of a fragment ordering 119865119865 up to the 119899119899thfragment as 119875119875Gain119875119865119865 119899119899) = 119865119899119899

119894119894=1 Gain119875119891119891119894119894) Here Gain119875119891119891119894119894) =119864119864119875119875119875119894119894)minus1198641198641198751198751198751198941198941198941)emaximum cumulative gain119872119872119875119875Gain119875119865119865)is dened as

119872119872119875119875Gain 119875119865119865) = max1le119894119894le119894119894

119875119875Gain 119875119865119865 119894119894) (7)

In Section 4 we shall describe these terms with anexample

In each iteration the algorithm nds the current ordering119865119865119888119888 of 119875119875119888119888 and transfers only those fragments of 119865119865119888119888 that canachieve the119872119872119875119875Gain119875119865119865119888119888) and the fragment that is the last tobe transferred is referred as 119891119891max us the algorithm movesfrom one partition to another reducing the error function byan amount of119872119872119875119875Gain119875119865119865119888119888) Please see Algorithm 1 for a basicdescription of HMEC e algorithm continues as long as119872119872119875119875Gain119875119865119865119888119888) gt 0 and stops whenever119872119872119875119875Gain119875119865119865119888119888) le 0

4 Data Structures and Complexity

is section deals with the data structures and the complexityof our algorithm Here we also suggest some approximationto improve the performance of our algorithm

First to nd119865119865119888119888 in each iteration the algorithm repeatedlytransfers the fragment that is not transferred previously inthis iteration and has maximum gain over all such frag-ments To accomplish this we use a locking mechanismAt the beginning of each iteration all the fragments areset free e free fragment with maximum gain is foundout and tentatively transferred to the other collection Aerthe transfer the fragment is locked at the new collectionis tentative transfer creates the rst intermediate partition1198751198751 e algorithm then nds the next free fragment withmaximum gain in 1198751198751 and transfer and lock that fragment to

4 ISRN Bioinformatics

119875119875119888119888 = 119875119875119875119875119875119875 119875119875119875FREE_LOCKS()CLEAR_LOG()repeat always

while there is an unlocked fragment in 119875119875119888119888 dobegin

nd a free fragment 119891119891 so that Gain119875119891119891119875 is maximumtransfer 119891119891 to the other classupdate the haplotypes aer the transferLOCK(119891119891)LOG_RECORD(119891119891 Gain119875119891119891119875)

end doFREE_LOCKS()check the log and nd119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891maxif 119875119875119872119872Gain119875119865119865119888119888119875 gt 0

beginset new 119875119875119888119888 by rolling back the transfers that occurredaer the transfer of 119891119891maxcalculate haplotypes of 119875119875119888119888CLEAR_LOG()continue with the the loop

end ifelse

beginterminate the algorithm and output current haplotypes

end elseend repeat

A 1 e HMEC Algorithm

create the 1198751198752 us free fragments are transferred until allthe fragments are locked and the order of the transfer (119865119865119888119888)is stored in the log table along with the cumulative gains(119872119872Gain) 119875119875119872119872Gain is the maximum 119872119872Gain and 119891119891max is thefragment corresponding to119875119875119872119872Gain in the log table

Aer nishing all such tentative transfers 119875119875119888119888 becomes anundened partition To change 119875119875119888119888 to the desired ldquocurrentrdquopartition of the next iteration the algorithm checks the log tond the119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891max and rollback the transfer ofall the fragments that were transferred aer 119891119891max When therollback completes 119875119875119888119888 becomes ready for the next iteration

While tentatively transferring a free fragment the algo-rithm needs to nd the fragment with maximum gain amongthe free fragments (which are not yet transferred) isrequires calculating gains for each of them To calculatethe Gain119875119891119891119875 = 119891119891119875119875119875119901119901119875 minus 119891119891119875119875119875119899119899119875 for a fragment we needto calculate two error values of two different partitionsthe present intermediate partition and the next partitionwhich will be resulted if 119891119891 is transferred Each of theseerror function requires calculation of two new haplotypesfrom their corresponding collections (see Figure 2) Although119891119891119875119875119875119901119901119875 and the haplotypes of 119875119875119901119901 can be found from theprevious transfer calculation of 119891119891119875119875119875119899119899119875 requires constructionof haplotypes of 119875119875119899119899 Since the difference between 119875119875119901119901 and 119875119875119899119899is only one transfer we can introduce differential calculationof haplotypes 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 of next partition from thehaplotypes of 119867119867119901119901

119894119894 119894119894 119894 119894119894119875 2119894 of present partition For thispurpose the algorithm stores119873119873119894

119895119895119875119872119872119901119901119894119894 119875 and119873119873

0119895119895119875119872119872

119901119901119894119894 119875 values of

the present partition Aer a transfer these values will eitherremain same or be incremented or decremented by 1 atis why it is now possible to construct 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 in 119874119874119875119874119874119875time To compute 119891119891119875119875119875119899119899119875 from the haplotypes requires119874119874119875119874119874119874119874119875time us running time to compute the 119891119891119875119875119875119899119899119875 as well as tocompute Gain119875119891119891119875 is119874119874119875119874119874119874119874 119874 119874119874119875

For each intermediate partition119875119875119894119894 119894119894 = 119894119875119894 119875 119899119899 we need tocompute Gain measures for119874119874minus 119894119894 unlocked fragments to ndthe maximum one e transfer of this fragments requiresupdating of119873119873119894

119895119895119875119872119872119894119894119875 and1198731198730119895119895119875119872119872119894119894119875 119894119894 119894 119894119894119875 2119894 and 119895119895 = 119894119875 2119875119894 119875 119874119874

Hence it also needs 119874119874119875119874119874119875 time to run Finally there will be119874119874 such transfer in each iteration and maximum119874119874 rollbacksus each iteration will require119874119874119875119874119874119875119874119874119875119874119874119874119874119874119874119874119875 119874 119874119874119875 119874119874119874119874119874119875 1198741198741198741198751198741198743119874119874119875 running time

We now give an example illustrating a single iteration ofour algorithm Figure 3 demonstrates an example iteration ofHMEC We consider that the current partition 119875119875119888119888 = 119875119875119894 isthe partition given in Figure 1(b) for the SNP matrix 119875119875 ofFigure 1(a) All the intermediate partitions 119875119875119894119894 119894119894 119894 119894119894119875119894 119875 119894119894are shown sequentially and the gains of each fragment overthe intermediate partitions are shown on the right of eachpartition e free fragment with maximum gain is markedin each intermediate partition For example the fragmentwith the maximum gain in 1198751198752 is fragment 6 which has gaintwo Aer each transfer the transferred fragment is shownlocked by a circle Here the ordering 119865119865119888119888 of the fragmentsis ⟨2119875 6119875 5119875 119894119875 4119875 3⟩ which is also the order of locking of thefragmentsis order will be stored along with the119872119872Gains in

ISRN Bioinformatics 5

1

2

3

4

5

6

1

3

4

5

6

2

Pp Pn

C1 C2 C1 C2

E(Pp) = 8 E(Pn) = 6

Gain(2) = E(Pp)minus E(Pn) = 2

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

H1 = 0 1 0 0 0 0 0 1 mdash mdash

H2 = 0 1 0 0 1 0 1 0 0 1

F 2 An example of Gain calculation

1

2

3

4

5

6

Gain(1) = 0

Gain(2) = 2

Gain(3) = 2

Gain(4) = 1

Gain(5) = 1

Gain(6) = 1

1

3

2

4

5

6

1

3

6

2

4

5

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus1

Gain(5) = 0

1

6

3

52

Gain(1) = minus1

Gain(3) = minus3

1

2

3

5

4

6

5

6

4

2

3

1

1

6

5

32

4

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus2

Gain(5) = 1

Gain(6) = 2

Gain(1) = minus2

Gain(3) = minus3

Gain(4) = minus1

Gain(3) = minus2

Locked fragment

P1

P2

P3

P4

P5

P6

P7

4

F 3 An example iteration of HMECe current partition 119875119875119888119888 = 1198751198751 is the partition given in Figure 1(b) for the SNPmatrix119872119872 of Figure1(a) Locked fragments are indicated by circles

the log table Figure 4 demonstrates the resulting log table ofthe illustrated iteration All the tentative transfers aer 119891119891maxhave to be rolled back so that the 1198751198753 becomes the next 119875119875119888119888

We now give an approximate gain measure to make ouralgorithm faster For large SNPmatrix1198741198741198741198741198743119896119896119896 running timeis critical to the performance of the algorithmWe can use anapproximation in the calculation of theGain119874119891119891119896 by using onlythe fragment 119891119891 and not using the119874119874119898 1 other fragments eapproximate gain should be

AppxGain 1007649100764911989111989110076651007665 = 119863119863 10076501007650119867119867119901119901119894119894 11989111989110076661007666 119898 119863119863 10076501007650119867119867119899119899

119895119895 11989111989110076661007666 (8)

Here 119867119867119901119901119894119894 is the haplotype of 119891119891rsquos present collection 119862119862119901119901

119894119894 ofpartition119875119875119901119901 and119867119867

119899119899119895119895 is the haplotype of119891119891rsquos next collection119862119862

119899119899119895119895

of partition 119875119875119899119899 is function ignores the effect of fragmentsother than 119891119891 on Gain119874119891119891119896 but reduces the run time of gaincalculation to 119874119874119874119896119896119896 erefore the total run time of eachiteration will be1198741198741198741198741198742119896119896119896

5 Performance Comparison

In this section we demonstrate the performance of ouralgorithm using both real biological and simulated datasetsWe compared our algorithm with GMEC [5] and HapCUT[6] We performed the simulation using the data fromangiotensin-converting enzyme (ACE) [17] and public Dalyset [18] to compare with GMEC [5] and used the HuRef data[19] to compare with HapCUT We also compared HMEC

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

ISRN Bioinformatics 3

6

5

4

3

2

1

M

0 1 0 0 0 mdash mdash mdash mdash mdash

mdash mdash 0 0 1 1 1 0 0 0

mdash 1 0 0 0 0 0 1 mdash mdash

mdash mdash mdash mdash mdash 1 1 0 0 1

mdash mdash 0 0 1 1 1 0 0 0

mdash mdash mdash 1 0 0 0 0 0 1

(a)

3

1

2

4

5

6

C1 C2

P

(b)

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

(c)

F 1 SNP matrix and its partition

We now describe the general minimum error correctionproblem If a matrix 119872119872 is not error-free there will be noerror-free partition 119875119875 For such a matrix 119872119872 there will be atleast one conicting pair of fragments in each of the classes forall possible partitions erefore it is impossible to constructa haplotype that is non-conicting with all the fragments inits dening class of fragments If we are given a partition1198751198751198751198751198751 1198751198752) and two haplotypes 1198671198671 and 1198671198672 constructed from119875119875 then the number of errors 119864119864119875119875119875) that needs to be correctedcan be calculated by the following formula

119864119864 119875119875119875) =210055761005576119894119894=1

10055761005576119891119891119891119875119875119894119894

119863119863 1007649100764911989111989111986711986711989411989410076651007665 (5)

e MEC problem asks to nd a partition 119875119875 that mini-mizes the error function 119864119864119875119875119875) over all such partitions of anSNP matrix119872119872

3 A Heuristic Algorithm

In this section we give our heuristic algorithm based onminimum error correction which we call HMEC

Construction of a haplotype from an erroneous class 119875119875requires correction of SNP values that is alleles in the frag-mentsWewant tominimize the number of error correctionserefore for each SNP site the haplotype should take theallele that is present in the majority of the fragments Let1198731198730

119895119895119875119875119875) be the number of fragments of a collection119875119875 that have0 in the 119895119895th SNP Similarly 1198731198731

119895119895119875119875119875) denes the number of 1s[5] erefore to minimize the number of errors 119864119864119875119875119875) fora specic partition 119875119875 the haplotype should be constructedaccording to the following methodology

119867119867119894119894119895119895 =

100761810076181007634100763410076341007634100763410076341007626100762610076341007634100763410076341007634100763410076421007642

1 if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 gt 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665

0 if1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ge 119873119873

1119895119895 1007649100764911987511987511989411989410076651007665 and

1198731198730119895119895 1007649100764911987511987511989411989410076651007665 ne 0

minus if1198731198731119895119895 1007649100764911987511987511989411989410076651007665 = 119873119873

0119895119895 1007649100764911987511987511989411989410076651007665 = 0

(6)

where 119894119894 119891 1198941 2119894 and 119895119895 = 1 2119895 119895119895 In Figure 1(c) twohaplotypes 1198671198671 and 1198671198672 associated with the partition 119875119875 inFigure 1(b) are constructed in this way

We will use a heuristic search to nd the best partitionis algorithm starts with a current partition 119875119875119888119888 = 119875119875119875119872119872 119875119875)and iteratively searches a better partition In each iterationthe algorithm performs a sequence of transfer of fragments

from their present collection to the other one so that thepartition becomes less erroneous e transfer of a fragmentfrom one collection to the other can increase or decreasethe error function 119864119864119875119875119875) Let the partition before transferringa fragment 119891119891 be 119875119875119901119901 and the partition resulted is 119875119875119899119899 Wedene the gain of the transfer as Gain119875119891119891) = 119864119864119875119875119875119901119901) minus 119864119864119875119875119875119899119899)Figure 2 demonstrates an example calculation of the gainmeasure Let 119865119865 = 119865119891119891119894119894⟩ 119894119894 = 1198941 2119895 119894119894119894 be an orderingof all the fragments in a partition 119875119875 in such a way thatfragment 119891119891119894119894 will precede fragment 119891119891119895119895 if all the fragmentsbefore 119891119891119894119894 in 119865119865 have already been transferred to form anintermediate partition 119875119875119894119894 and Gain119875119891119891119894119894) ge Gain119875119891119891119895119895) over 119875119875119894119894Hence 1198751198751 = 119875119875119888119888 at the start of each iteration We also denethe cumulative gain of a fragment ordering 119865119865 up to the 119899119899thfragment as 119875119875Gain119875119865119865 119899119899) = 119865119899119899

119894119894=1 Gain119875119891119891119894119894) Here Gain119875119891119891119894119894) =119864119864119875119875119875119894119894)minus1198641198641198751198751198751198941198941198941)emaximum cumulative gain119872119872119875119875Gain119875119865119865)is dened as

119872119872119875119875Gain 119875119865119865) = max1le119894119894le119894119894

119875119875Gain 119875119865119865 119894119894) (7)

In Section 4 we shall describe these terms with anexample

In each iteration the algorithm nds the current ordering119865119865119888119888 of 119875119875119888119888 and transfers only those fragments of 119865119865119888119888 that canachieve the119872119872119875119875Gain119875119865119865119888119888) and the fragment that is the last tobe transferred is referred as 119891119891max us the algorithm movesfrom one partition to another reducing the error function byan amount of119872119872119875119875Gain119875119865119865119888119888) Please see Algorithm 1 for a basicdescription of HMEC e algorithm continues as long as119872119872119875119875Gain119875119865119865119888119888) gt 0 and stops whenever119872119872119875119875Gain119875119865119865119888119888) le 0

4 Data Structures and Complexity

is section deals with the data structures and the complexityof our algorithm Here we also suggest some approximationto improve the performance of our algorithm

First to nd119865119865119888119888 in each iteration the algorithm repeatedlytransfers the fragment that is not transferred previously inthis iteration and has maximum gain over all such frag-ments To accomplish this we use a locking mechanismAt the beginning of each iteration all the fragments areset free e free fragment with maximum gain is foundout and tentatively transferred to the other collection Aerthe transfer the fragment is locked at the new collectionis tentative transfer creates the rst intermediate partition1198751198751 e algorithm then nds the next free fragment withmaximum gain in 1198751198751 and transfer and lock that fragment to

4 ISRN Bioinformatics

119875119875119888119888 = 119875119875119875119875119875119875 119875119875119875FREE_LOCKS()CLEAR_LOG()repeat always

while there is an unlocked fragment in 119875119875119888119888 dobegin

nd a free fragment 119891119891 so that Gain119875119891119891119875 is maximumtransfer 119891119891 to the other classupdate the haplotypes aer the transferLOCK(119891119891)LOG_RECORD(119891119891 Gain119875119891119891119875)

end doFREE_LOCKS()check the log and nd119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891maxif 119875119875119872119872Gain119875119865119865119888119888119875 gt 0

beginset new 119875119875119888119888 by rolling back the transfers that occurredaer the transfer of 119891119891maxcalculate haplotypes of 119875119875119888119888CLEAR_LOG()continue with the the loop

end ifelse

beginterminate the algorithm and output current haplotypes

end elseend repeat

A 1 e HMEC Algorithm

create the 1198751198752 us free fragments are transferred until allthe fragments are locked and the order of the transfer (119865119865119888119888)is stored in the log table along with the cumulative gains(119872119872Gain) 119875119875119872119872Gain is the maximum 119872119872Gain and 119891119891max is thefragment corresponding to119875119875119872119872Gain in the log table

Aer nishing all such tentative transfers 119875119875119888119888 becomes anundened partition To change 119875119875119888119888 to the desired ldquocurrentrdquopartition of the next iteration the algorithm checks the log tond the119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891max and rollback the transfer ofall the fragments that were transferred aer 119891119891max When therollback completes 119875119875119888119888 becomes ready for the next iteration

While tentatively transferring a free fragment the algo-rithm needs to nd the fragment with maximum gain amongthe free fragments (which are not yet transferred) isrequires calculating gains for each of them To calculatethe Gain119875119891119891119875 = 119891119891119875119875119875119901119901119875 minus 119891119891119875119875119875119899119899119875 for a fragment we needto calculate two error values of two different partitionsthe present intermediate partition and the next partitionwhich will be resulted if 119891119891 is transferred Each of theseerror function requires calculation of two new haplotypesfrom their corresponding collections (see Figure 2) Although119891119891119875119875119875119901119901119875 and the haplotypes of 119875119875119901119901 can be found from theprevious transfer calculation of 119891119891119875119875119875119899119899119875 requires constructionof haplotypes of 119875119875119899119899 Since the difference between 119875119875119901119901 and 119875119875119899119899is only one transfer we can introduce differential calculationof haplotypes 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 of next partition from thehaplotypes of 119867119867119901119901

119894119894 119894119894 119894 119894119894119875 2119894 of present partition For thispurpose the algorithm stores119873119873119894

119895119895119875119872119872119901119901119894119894 119875 and119873119873

0119895119895119875119872119872

119901119901119894119894 119875 values of

the present partition Aer a transfer these values will eitherremain same or be incremented or decremented by 1 atis why it is now possible to construct 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 in 119874119874119875119874119874119875time To compute 119891119891119875119875119875119899119899119875 from the haplotypes requires119874119874119875119874119874119874119874119875time us running time to compute the 119891119891119875119875119875119899119899119875 as well as tocompute Gain119875119891119891119875 is119874119874119875119874119874119874119874 119874 119874119874119875

For each intermediate partition119875119875119894119894 119894119894 = 119894119875119894 119875 119899119899 we need tocompute Gain measures for119874119874minus 119894119894 unlocked fragments to ndthe maximum one e transfer of this fragments requiresupdating of119873119873119894

119895119895119875119872119872119894119894119875 and1198731198730119895119895119875119872119872119894119894119875 119894119894 119894 119894119894119875 2119894 and 119895119895 = 119894119875 2119875119894 119875 119874119874

Hence it also needs 119874119874119875119874119874119875 time to run Finally there will be119874119874 such transfer in each iteration and maximum119874119874 rollbacksus each iteration will require119874119874119875119874119874119875119874119874119875119874119874119874119874119874119874119874119875 119874 119874119874119875 119874119874119874119874119874119875 1198741198741198741198751198741198743119874119874119875 running time

We now give an example illustrating a single iteration ofour algorithm Figure 3 demonstrates an example iteration ofHMEC We consider that the current partition 119875119875119888119888 = 119875119875119894 isthe partition given in Figure 1(b) for the SNP matrix 119875119875 ofFigure 1(a) All the intermediate partitions 119875119875119894119894 119894119894 119894 119894119894119875119894 119875 119894119894are shown sequentially and the gains of each fragment overthe intermediate partitions are shown on the right of eachpartition e free fragment with maximum gain is markedin each intermediate partition For example the fragmentwith the maximum gain in 1198751198752 is fragment 6 which has gaintwo Aer each transfer the transferred fragment is shownlocked by a circle Here the ordering 119865119865119888119888 of the fragmentsis ⟨2119875 6119875 5119875 119894119875 4119875 3⟩ which is also the order of locking of thefragmentsis order will be stored along with the119872119872Gains in

ISRN Bioinformatics 5

1

2

3

4

5

6

1

3

4

5

6

2

Pp Pn

C1 C2 C1 C2

E(Pp) = 8 E(Pn) = 6

Gain(2) = E(Pp)minus E(Pn) = 2

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

H1 = 0 1 0 0 0 0 0 1 mdash mdash

H2 = 0 1 0 0 1 0 1 0 0 1

F 2 An example of Gain calculation

1

2

3

4

5

6

Gain(1) = 0

Gain(2) = 2

Gain(3) = 2

Gain(4) = 1

Gain(5) = 1

Gain(6) = 1

1

3

2

4

5

6

1

3

6

2

4

5

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus1

Gain(5) = 0

1

6

3

52

Gain(1) = minus1

Gain(3) = minus3

1

2

3

5

4

6

5

6

4

2

3

1

1

6

5

32

4

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus2

Gain(5) = 1

Gain(6) = 2

Gain(1) = minus2

Gain(3) = minus3

Gain(4) = minus1

Gain(3) = minus2

Locked fragment

P1

P2

P3

P4

P5

P6

P7

4

F 3 An example iteration of HMECe current partition 119875119875119888119888 = 1198751198751 is the partition given in Figure 1(b) for the SNPmatrix119872119872 of Figure1(a) Locked fragments are indicated by circles

the log table Figure 4 demonstrates the resulting log table ofthe illustrated iteration All the tentative transfers aer 119891119891maxhave to be rolled back so that the 1198751198753 becomes the next 119875119875119888119888

We now give an approximate gain measure to make ouralgorithm faster For large SNPmatrix1198741198741198741198741198743119896119896119896 running timeis critical to the performance of the algorithmWe can use anapproximation in the calculation of theGain119874119891119891119896 by using onlythe fragment 119891119891 and not using the119874119874119898 1 other fragments eapproximate gain should be

AppxGain 1007649100764911989111989110076651007665 = 119863119863 10076501007650119867119867119901119901119894119894 11989111989110076661007666 119898 119863119863 10076501007650119867119867119899119899

119895119895 11989111989110076661007666 (8)

Here 119867119867119901119901119894119894 is the haplotype of 119891119891rsquos present collection 119862119862119901119901

119894119894 ofpartition119875119875119901119901 and119867119867

119899119899119895119895 is the haplotype of119891119891rsquos next collection119862119862

119899119899119895119895

of partition 119875119875119899119899 is function ignores the effect of fragmentsother than 119891119891 on Gain119874119891119891119896 but reduces the run time of gaincalculation to 119874119874119874119896119896119896 erefore the total run time of eachiteration will be1198741198741198741198741198742119896119896119896

5 Performance Comparison

In this section we demonstrate the performance of ouralgorithm using both real biological and simulated datasetsWe compared our algorithm with GMEC [5] and HapCUT[6] We performed the simulation using the data fromangiotensin-converting enzyme (ACE) [17] and public Dalyset [18] to compare with GMEC [5] and used the HuRef data[19] to compare with HapCUT We also compared HMEC

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

4 ISRN Bioinformatics

119875119875119888119888 = 119875119875119875119875119875119875 119875119875119875FREE_LOCKS()CLEAR_LOG()repeat always

while there is an unlocked fragment in 119875119875119888119888 dobegin

nd a free fragment 119891119891 so that Gain119875119891119891119875 is maximumtransfer 119891119891 to the other classupdate the haplotypes aer the transferLOCK(119891119891)LOG_RECORD(119891119891 Gain119875119891119891119875)

end doFREE_LOCKS()check the log and nd119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891maxif 119875119875119872119872Gain119875119865119865119888119888119875 gt 0

beginset new 119875119875119888119888 by rolling back the transfers that occurredaer the transfer of 119891119891maxcalculate haplotypes of 119875119875119888119888CLEAR_LOG()continue with the the loop

end ifelse

beginterminate the algorithm and output current haplotypes

end elseend repeat

A 1 e HMEC Algorithm

create the 1198751198752 us free fragments are transferred until allthe fragments are locked and the order of the transfer (119865119865119888119888)is stored in the log table along with the cumulative gains(119872119872Gain) 119875119875119872119872Gain is the maximum 119872119872Gain and 119891119891max is thefragment corresponding to119875119875119872119872Gain in the log table

Aer nishing all such tentative transfers 119875119875119888119888 becomes anundened partition To change 119875119875119888119888 to the desired ldquocurrentrdquopartition of the next iteration the algorithm checks the log tond the119875119875119872119872Gain119875119865119865119888119888119875 and 119891119891max and rollback the transfer ofall the fragments that were transferred aer 119891119891max When therollback completes 119875119875119888119888 becomes ready for the next iteration

While tentatively transferring a free fragment the algo-rithm needs to nd the fragment with maximum gain amongthe free fragments (which are not yet transferred) isrequires calculating gains for each of them To calculatethe Gain119875119891119891119875 = 119891119891119875119875119875119901119901119875 minus 119891119891119875119875119875119899119899119875 for a fragment we needto calculate two error values of two different partitionsthe present intermediate partition and the next partitionwhich will be resulted if 119891119891 is transferred Each of theseerror function requires calculation of two new haplotypesfrom their corresponding collections (see Figure 2) Although119891119891119875119875119875119901119901119875 and the haplotypes of 119875119875119901119901 can be found from theprevious transfer calculation of 119891119891119875119875119875119899119899119875 requires constructionof haplotypes of 119875119875119899119899 Since the difference between 119875119875119901119901 and 119875119875119899119899is only one transfer we can introduce differential calculationof haplotypes 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 of next partition from thehaplotypes of 119867119867119901119901

119894119894 119894119894 119894 119894119894119875 2119894 of present partition For thispurpose the algorithm stores119873119873119894

119895119895119875119872119872119901119901119894119894 119875 and119873119873

0119895119895119875119872119872

119901119901119894119894 119875 values of

the present partition Aer a transfer these values will eitherremain same or be incremented or decremented by 1 atis why it is now possible to construct 119867119867119899119899

119894119894 119894119894 119894 119894119894119875 2119894 in 119874119874119875119874119874119875time To compute 119891119891119875119875119875119899119899119875 from the haplotypes requires119874119874119875119874119874119874119874119875time us running time to compute the 119891119891119875119875119875119899119899119875 as well as tocompute Gain119875119891119891119875 is119874119874119875119874119874119874119874 119874 119874119874119875

For each intermediate partition119875119875119894119894 119894119894 = 119894119875119894 119875 119899119899 we need tocompute Gain measures for119874119874minus 119894119894 unlocked fragments to ndthe maximum one e transfer of this fragments requiresupdating of119873119873119894

119895119895119875119872119872119894119894119875 and1198731198730119895119895119875119872119872119894119894119875 119894119894 119894 119894119894119875 2119894 and 119895119895 = 119894119875 2119875119894 119875 119874119874

Hence it also needs 119874119874119875119874119874119875 time to run Finally there will be119874119874 such transfer in each iteration and maximum119874119874 rollbacksus each iteration will require119874119874119875119874119874119875119874119874119875119874119874119874119874119874119874119874119875 119874 119874119874119875 119874119874119874119874119874119875 1198741198741198741198751198741198743119874119874119875 running time

We now give an example illustrating a single iteration ofour algorithm Figure 3 demonstrates an example iteration ofHMEC We consider that the current partition 119875119875119888119888 = 119875119875119894 isthe partition given in Figure 1(b) for the SNP matrix 119875119875 ofFigure 1(a) All the intermediate partitions 119875119875119894119894 119894119894 119894 119894119894119875119894 119875 119894119894are shown sequentially and the gains of each fragment overthe intermediate partitions are shown on the right of eachpartition e free fragment with maximum gain is markedin each intermediate partition For example the fragmentwith the maximum gain in 1198751198752 is fragment 6 which has gaintwo Aer each transfer the transferred fragment is shownlocked by a circle Here the ordering 119865119865119888119888 of the fragmentsis ⟨2119875 6119875 5119875 119894119875 4119875 3⟩ which is also the order of locking of thefragmentsis order will be stored along with the119872119872Gains in

ISRN Bioinformatics 5

1

2

3

4

5

6

1

3

4

5

6

2

Pp Pn

C1 C2 C1 C2

E(Pp) = 8 E(Pn) = 6

Gain(2) = E(Pp)minus E(Pn) = 2

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

H1 = 0 1 0 0 0 0 0 1 mdash mdash

H2 = 0 1 0 0 1 0 1 0 0 1

F 2 An example of Gain calculation

1

2

3

4

5

6

Gain(1) = 0

Gain(2) = 2

Gain(3) = 2

Gain(4) = 1

Gain(5) = 1

Gain(6) = 1

1

3

2

4

5

6

1

3

6

2

4

5

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus1

Gain(5) = 0

1

6

3

52

Gain(1) = minus1

Gain(3) = minus3

1

2

3

5

4

6

5

6

4

2

3

1

1

6

5

32

4

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus2

Gain(5) = 1

Gain(6) = 2

Gain(1) = minus2

Gain(3) = minus3

Gain(4) = minus1

Gain(3) = minus2

Locked fragment

P1

P2

P3

P4

P5

P6

P7

4

F 3 An example iteration of HMECe current partition 119875119875119888119888 = 1198751198751 is the partition given in Figure 1(b) for the SNPmatrix119872119872 of Figure1(a) Locked fragments are indicated by circles

the log table Figure 4 demonstrates the resulting log table ofthe illustrated iteration All the tentative transfers aer 119891119891maxhave to be rolled back so that the 1198751198753 becomes the next 119875119875119888119888

We now give an approximate gain measure to make ouralgorithm faster For large SNPmatrix1198741198741198741198741198743119896119896119896 running timeis critical to the performance of the algorithmWe can use anapproximation in the calculation of theGain119874119891119891119896 by using onlythe fragment 119891119891 and not using the119874119874119898 1 other fragments eapproximate gain should be

AppxGain 1007649100764911989111989110076651007665 = 119863119863 10076501007650119867119867119901119901119894119894 11989111989110076661007666 119898 119863119863 10076501007650119867119867119899119899

119895119895 11989111989110076661007666 (8)

Here 119867119867119901119901119894119894 is the haplotype of 119891119891rsquos present collection 119862119862119901119901

119894119894 ofpartition119875119875119901119901 and119867119867

119899119899119895119895 is the haplotype of119891119891rsquos next collection119862119862

119899119899119895119895

of partition 119875119875119899119899 is function ignores the effect of fragmentsother than 119891119891 on Gain119874119891119891119896 but reduces the run time of gaincalculation to 119874119874119874119896119896119896 erefore the total run time of eachiteration will be1198741198741198741198741198742119896119896119896

5 Performance Comparison

In this section we demonstrate the performance of ouralgorithm using both real biological and simulated datasetsWe compared our algorithm with GMEC [5] and HapCUT[6] We performed the simulation using the data fromangiotensin-converting enzyme (ACE) [17] and public Dalyset [18] to compare with GMEC [5] and used the HuRef data[19] to compare with HapCUT We also compared HMEC

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

ISRN Bioinformatics 5

1

2

3

4

5

6

1

3

4

5

6

2

Pp Pn

C1 C2 C1 C2

E(Pp) = 8 E(Pn) = 6

Gain(2) = E(Pp)minus E(Pn) = 2

H1 = 0 1 0 0 0 0 0 0 0 0

H2 = 0 1 0 0 0 0 0 0 0 1

H1 = 0 1 0 0 0 0 0 1 mdash mdash

H2 = 0 1 0 0 1 0 1 0 0 1

F 2 An example of Gain calculation

1

2

3

4

5

6

Gain(1) = 0

Gain(2) = 2

Gain(3) = 2

Gain(4) = 1

Gain(5) = 1

Gain(6) = 1

1

3

2

4

5

6

1

3

6

2

4

5

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus1

Gain(5) = 0

1

6

3

52

Gain(1) = minus1

Gain(3) = minus3

1

2

3

5

4

6

5

6

4

2

3

1

1

6

5

32

4

Gain(1) = minus1

Gain(3) = minus2

Gain(4) = minus2

Gain(5) = 1

Gain(6) = 2

Gain(1) = minus2

Gain(3) = minus3

Gain(4) = minus1

Gain(3) = minus2

Locked fragment

P1

P2

P3

P4

P5

P6

P7

4

F 3 An example iteration of HMECe current partition 119875119875119888119888 = 1198751198751 is the partition given in Figure 1(b) for the SNPmatrix119872119872 of Figure1(a) Locked fragments are indicated by circles

the log table Figure 4 demonstrates the resulting log table ofthe illustrated iteration All the tentative transfers aer 119891119891maxhave to be rolled back so that the 1198751198753 becomes the next 119875119875119888119888

We now give an approximate gain measure to make ouralgorithm faster For large SNPmatrix1198741198741198741198741198743119896119896119896 running timeis critical to the performance of the algorithmWe can use anapproximation in the calculation of theGain119874119891119891119896 by using onlythe fragment 119891119891 and not using the119874119874119898 1 other fragments eapproximate gain should be

AppxGain 1007649100764911989111989110076651007665 = 119863119863 10076501007650119867119867119901119901119894119894 11989111989110076661007666 119898 119863119863 10076501007650119867119867119899119899

119895119895 11989111989110076661007666 (8)

Here 119867119867119901119901119894119894 is the haplotype of 119891119891rsquos present collection 119862119862119901119901

119894119894 ofpartition119875119875119901119901 and119867119867

119899119899119895119895 is the haplotype of119891119891rsquos next collection119862119862

119899119899119895119895

of partition 119875119875119899119899 is function ignores the effect of fragmentsother than 119891119891 on Gain119874119891119891119896 but reduces the run time of gaincalculation to 119874119874119874119896119896119896 erefore the total run time of eachiteration will be1198741198741198741198741198742119896119896119896

5 Performance Comparison

In this section we demonstrate the performance of ouralgorithm using both real biological and simulated datasetsWe compared our algorithm with GMEC [5] and HapCUT[6] We performed the simulation using the data fromangiotensin-converting enzyme (ACE) [17] and public Dalyset [18] to compare with GMEC [5] and used the HuRef data[19] to compare with HapCUT We also compared HMEC

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

6 ISRN Bioinformatics

MCGain

2

4

4

3

2

0

CGain

2

6

5

4

1

3

Log table

Fc

F 4 e log table corresponding to the iteration illustrated inFigure 3

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 5e comparison of HMEC and GMEC on ACE From topto bottom coverage = 75 coverage = 50 coverage = 25

10 20 30 40 50 60 70 8060

65

70

75

80

85

90

95

100

Coverage

Rec

on

stru

ctio

n r

ate

Heuristic data

Genetic data

HMEC

GMEC

F 6 Reconstruction rate versus coverage value

with some other algorithms (SpeedHap [9] FastHare [10]MLF [11] 2 distance MEC [12] SHR-3 [13]) using ReHapwebsite interface [15] which is an HapMap-based instancegenerator and comparison framework [14]

51 Comparison with GMEC In this section we comparethe performance of our algorithmwith the genetic algorithmwhich we call GMEC described in [5]

We rst sample the original haplotype pair into manyfragments with different coverage and error rates Eachfragment works as a distinct sample of the same specimenHere coverage rate indicates the percentage of the totalcolumns of the SNP matrix that have been sampled out eremaining slots are gaps We then introduce some specicamount of error into these samples e simulation wascontrolled in several ways We varied the error rate whilenumber of fragments and coverage rate were kept constantAlso coverage was varied while number of fragments anderror rate were kept constant

Notice that the way we introduced error and controlledthe coverage rate is not necessarily same as that of [5]erefore the reconstruction rates of the branch and boundalgorithm described in [5] which is an exact algorithmshould not be compared with those of HMEC

511 Experiment on Angiotensin-Converting Enzyme (ACE)Angiotensin-converting enzyme catalyses the conversion ofangiotensin I to the physiologically active peptide angiotensinII which controls uid-electrolyte balance and systematicblood pressure Because it has a key function in the renin-angiotensin system many association studies have been per-formed with DCP1 (encode angiotensin-converting anzyme)[20] Rieder et al completed the genomic sequencing of theDCP1 gene from 11 individuals and reported 78 SNP sites in22 chromosomes [17]

We take six pairs of haplotypes to perform the simulationWe generate 50 fragments from each of these haplotypepairs with varying coverage and error rate We perform thesimulation for three different coverage rates (25 50 and

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

ISRN Bioinformatics 7

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

5 10 15 20 25 30 35 40 45 5070

75

80

85

90

95

100

Error

Reco

nst

ru

cti

on

rate

HMEC data

GMEC data

HMEC

GMEC

F 7 e comparison of HMEC and GMEC on Daly set Fromtop to bottom coverage = 75 coverage = 50 coverage = 25

75) For each of these coverage rates we perform oursimulation for different error rates such as 5 10 1520 25 30 and 50 In every case we compare ouralgorithm with GMEC Figure 5 illustrates the comparisonthat bears the clear testimony to the superiority of our algo-rithm For most instances the reconstruction rate achievedby our algorithm is 100 or greater than 98 Only for afew cases with very high error rate and low coverage valuethe reconstruction rate falls below 95 We also performthe experiment for different coverage rates while keepingthe error rate constant Figure 6 illustrates the performanceof these two algorithms for various coverage values Herealso our algorithm clearly outperforms the genetic algorithmexcept for very low coverage value (which is unrealistic)

T 1 Running time of HMEC andGMEC on simulated datasets

Length of haplotype HMEC GMEC(sec) (sec)

78 0002 6172156 0016 11437312 0015 22500624 0031 45328780 0031 60766936 0047 72343

512 Simulation on Data from Chromosome 5q31 In thissection we discuss our simulation results on the data frompublic Daly set Daly et al reported a high-resolution analysisof a haplotype structure across 500 kb on chromosome 5q31using 103 SNPs in a European derived population whichconsists of 129 trios [18 20]

We performed the experiment exactly in the same waythat we did for angiotensin-converting enzyme Figure 7demonstrates the results Experimental results suggest thatHMEC is much better than the genetic algorithm Again formost cases the reconstruction rate achieved by our algorithmis 100 or greater than 98 For every instance our algo-rithm exhibits better performance than that of GMEC

513 Experiment on Simulated Data We used simulateddata for further evaluation of HMEC One of the veryimportant advantages of our algorithm is that it takes veryshort time to reconstruct the haplotypes Our algorithm ismuch faster than the GMEC Table 1 shows the running timeof HMEC and GMEC Here we perform the simulation byvarying the length (length denotes the number of SNP sites inthe haplotype pair) of the haplotypes while xed the value ofthe coverage rate and the error rate at 50 Since haplotypeswith such varying lengths are not available we rely on thesimulated data Clearly HMEC is much faster than GMECFor example while HMEC can reconstruct a haplotype with936 sites in a fraction of a second GMEC takes 72 seconds

52 Comparison with HapCUT HapCUT [6] is one of themost accurate heuristic algorithms for individual haplotyp-ing HapCUT uses a random initial haplotype congurationand builds a graph It computes max-cut on the graph tond the position to ip and iterates until no improvementin MEC score is achieved We have performed extensiveexperiments to compare the performance ofHMECandHap-CUT Experimental results suggest that although HapCUT isreliable its running time is too large to be a realistic choicefor whole genome haplotyping Our algorithm computes thehaplotypes signicantly faster than HapCUT without losingaccuracy

We used the ltered HuRef data from Levy et al [19]to evaluate the performance We generated several test datasets varying the coverage and error rate and tested theperformance of HMEC and HapCUT on these data sets eresults are shown in Tables 2 3 4 and 5

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

8 ISRN Bioinformatics

T 2 Reconstruction rate and running time of HapCUT and HMEC with 5 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0986 40582 0859 011915 0913 38282 0889 012520 0875 46324 0811 012925 0902 47101 0827 013030 0840 39196 0802 012540 0694 44792 0665 0129

T 3 Reconstruction rate and running time of HapCUT and HMEC with 10 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 76081 0966 014115 0954 90206 0948 015420 0953 81169 0957 014425 0952 74924 0923 014230 0951 81828 0893 017740 0934 69748 0814 0169

T 4 Reconstruction rate and running time of HapCUT and HMEC with 25 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 184642 1000 015615 0988 171698 1000 016720 0988 134406 0998 016325 0988 133966 0998 015530 0987 136572 0996 017240 0985 171334 0960 0161

T 5 Reconstruction rate and running time of HapCUT and HMEC with 35 coverage on HuRef data

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

10 0988 181025 1000 015815 0988 132434 1000 012720 0988 231989 0999 010525 0987 210040 0998 010130 0988 217292 0996 010240 0987 215533 0995 0100

T 6 Reconstruction rate and running time of HapCUT and HMEC for ReHap samples

Error rate HapCUT HMECReconstruction rate Time (sec) Reconstruction rate Time (sec)

02 0870 20343 0985 003503 0865 21109 0970 003004 0840 20106 0945 003505 0790 19549 0865 0050

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

ISRN Bioinformatics 9

T 7 Comparison of different haplotyping techniques using ReHap Min Len = 3 Max Len = 7 coverage = 8 Highest reconstruction ratefor each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 1 091 1 1 065 1 0903 0935 07 0945 061 0685 0965 08204 0605 078 078 068 055 0535 08105 0535 048 0485 0505 0526 0525 058

T 8 Comparison of different haplotyping techniques using ReHap Min Len = 10 Max Len = 30 coverage = 8 Highest reconstructionrate for each error rate is shown in bold

Error rate SpeedHap FastHare MLF 2-Distance MEC SHR HMEC HapCUT02 098 0975 0985 0985 084 0985 08703 0985 0985 097 097 074 097 086504 0895 079 0885 0885 0675 087 08405 074 047 076 076 051 0705 079

Experimental results indicate that the reconstructionrates of both HapCUT and HMEC are reasonably goodFor very low coverage reconstruction rate of HapCUT isslightly better than HMEC However as the coverage rateincreases HMEC begins to outperform HapCUT Noticethat HapCUT is only better than HMEC for very low (andthus unrealistic) coverage values (5 and 10 coverage)However for higher coverage HMEC consistently performsbetter than HapCUT Furthermore HapCUT is signicantlyslower than HMEC For an instance with 35 coverage and40 error rate HapCUT takes 2155 seconds where HMECtakes only a fraction of a second (see Table 5) ereforealthough generally HapCUT provides reliable reconstructionrate on large dataset it is an unrealistic choice due to its timeconsuming operations On the other hand HMEC providesthe high accuracy with much less amount of time

We also used some sample data from ReHap project [1415]We created the allelematrix from the ReHap errormatrixand fed the matrix to both HMEC and HapCUT algorithmsHMEC consistently produces higher reconstruction rate thanHapCUT (see Table 6) Also the running time of HMEC isclearly much better than HapCUT

53 Comparison Using ReHap We also compared HMECwith some other well-known algorithms such as SpeedHap[9] FastHare [10] MLF [11] 2 distance MEC [12] andSHR-3 [13] using HapMap-based instance generator andcomparison framework [14 15] e results are shown inTable 7 and Table 8 that suggest that no single method clearlyoutperforms the others in all cases However reconstructionrates achieved by HMEC are the highest or very close to thehighest is bears a clear testimony to its suitability as apractical tool for individual haplotyping

6 Conclusion

In this paper we present a heuristic algorithm (HMEC) basedon minimum error correction that computes highly accuratehaplotypes signicantly faster than the known algorithmsfor haplotyping e algorithm is inspired from the famous

Fiduccia and Mattheyses (FM) algorithm for bipartitioninga hyper graph minimizing the cut size [8] We report on anextensive performance study evaluating our approach withother available techniques using both real and simulateddatasets Comprehensive performance study shows that ouralgorithm outperforms (in most cases) or matches the accu-racy of other well-known methods but runs in a fraction ofthe time needed for other techniques High accuracy and veryfast running time make our technique suitable for genome-wide scale data

e accuracy of the algorithm can be improved by incor-porating some prior knowledge For example small groupsof fragments that are declared to be in the same haplotypecan be identied Probabilistic methods like expectationmaximization (EM) also deserve some consideration oversuch optimization problems In the near future we intend toconsider these issues to make further improvements

Acknowledgments

e authors thank Rui-Sheng Wang (one of the authors of[5]) for providing them with the ACE and chromosome 5q31datasets

References

[1] R Cilibrasi L V Iersel S Kelk and J Tromp ldquoOn thecomplexity of several haplotyping problemsrdquo in Proceedings ofthe 5th International Workshop on Algorithms in Bioinformaticsvol 3692 of Lecture Notes in Computer Science pp 128ndash139Springer 2005

[2] P Bonizzoni G Della Vedova R Dondi and J Li ldquoehaplotyping problem an overview of computational modelsand solutionsrdquo Journal of Computer Science and Technology vol18 no 6 pp 675ndash688 2003

[3] A Chakravarti ldquoItrsquos raining SNPs hallelujahrdquoNature Geneticsvol 19 no 3 pp 216ndash217 1998

[4] R Lippert R Schwartz G Lancia and S Istrail ldquoAlgorithmicstrategies for the single nucleotide polymorphism haplotypeassembly problemrdquo Briengs in Bioinformatics vol 3 no 1 pp23ndash31 2002

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

10 ISRN Bioinformatics

[5] R S Wang L Y Wu Z P Li and X S Zhang ldquoHaplotypereconstruction from SNP fragments by minimum error correc-tionrdquo Bioinformatics vol 21 no 10 pp 2456ndash2462 2005

[6] V Bansal and V Bafna ldquoHapCUT an efficient and accuratealgorithm for the haplotype assembly problemrdquo Bioinformaticsvol 24 no 16 pp i153ndashi159 2008

[7] J Duitama T Huebsch G McEwen E K Suk and MR Hoehe ldquoReFHap a reliable and fast algorithm for singleindividual haplotypingrdquo in Proceedings of the ACM Interna-tional Conference on Bioinformatics and Computational Biology(ACM-BCB rsquo10) pp 160ndash169 August 2010

[8] C M Fiduccia and R M Mattheyses ldquoA linear time heuristicsfor improving network partitionsrdquo in Proceedings of the 19thDesign Automation Conference pp 175ndash181 1982

[9] L M Genovese F Geraci and M Pellegrini ldquoSpeedHap anaccurate heuristic for the single individual SNP haplotypingproblem with many gaps high reading error rate and lowcoveragerdquo IEEEACM Transactions on Computational Biologyand Bioinformatics vol 5 no 4 pp 492ndash502 2008

[10] A Panconesi and M Sozio ldquoFast hare a fast heuristic forsingle individual SNP haplotype reconstructionrdquo Workshop onAlgorithms in Bioinformatics vol 3240 pp 266ndash277 2004

[11] Y Y Zhao L Y Wu J H Zhang R S Wang and X S ZhangldquoHaplotype assembly from aligned weighted SNP fragmentsrdquoComputational Biology and Chemistry vol 29 pp 281ndash2872005

[12] YWang E Feng andRWang ldquoA clustering algorithmbased ontwo distance functions forMECmodelrdquoComputational Biologyand Chemistry vol 31 no 2 pp 148ndash150 2007

[13] Z Chen B Fu R Schweller B Yang Z Zhao and B ZhuldquoLinear time probabilistic algorithms for the singular haplotypereconstruction problem from SNP fragmentsrdquo Journal of Com-putational Biology vol 15 no 5 pp 535ndash546 2008

[14] F Geraci ldquoA comparison of several algorithms for the singleindividual SNP haplotyping reconstruction problemrdquo Bioinfor-matics vol 26 no 18 Article ID btq411 pp 2217ndash2225 2010

[15] ReHap httpbioalgoiitcnritrehap[16] A Al Mueen M S Bayzid M M Alam and M S Rahman ldquoA

heuristic algorithm for individual haplotyping with minimumerror correctionrdquo in Proceedings of the 1st International Confer-ence on BioMedical Engineering and Informatics (BMEI rsquo08) pp792ndash796 IEEE Computer Society Press May 2008

[17] M J Rieder S L Taylor A G Clark and D A NickrsonldquoSequence variation in the human angiotensin convertingenzymerdquo Nature Genetics vol 22 pp 59ndash62 1999

[18] M J Daly J D Rioux S F Schaffner T J Hudson and ES Lander ldquoHigh-resolution haplotype structure in the humangenomerdquo Nature Genetics vol 29 no 2 pp 229ndash232 2001

[19] S Levy G Sutton P C Ng L Feuk and A L Halpernldquoe deploid genome sequence of an individual humanrdquo PLOSBiology vol 5 no 10 article e254 2007

[20] X S Zhang R S Wang L Y Wu and W Zhang ldquoMinimumconict individual haplotyping fromSNP fragments and relatedgenotyperdquo Evolutionary Bioinformatics vol 2 pp 261ndash2702006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: 3FTFBSDI SUJDMF ).&$ )FVSJTUJDMHPSJUINGPS*OEJWJEVBM ...downloads.hindawi.com/archive/2013/291741.pdf · *43/ #jpjogpsnbujdt ! Á '3&&@-0$,4 $-&"3@-0( sfqfbubmxbzt xijmf uifsf jt bo

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology