Download - Barcodes Genomes
-
8/3/2019 Barcodes Genomes
1/29
This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formattedPDF and full text (HTML) versions will be made available soon.
Barcodes for genomes and applications
BMC Bioinformatics 2008, 9:546 doi:10.1186/1471-2105-9-546
Fengfeng Zhou ([email protected])Victor Olman ([email protected])
Ying Xu ([email protected])
ISSN 1471-2105
Article type Research article
Submission date 16 June 2008
Acceptance date 17 December 2008
Publication date 17 December 2008
Article URL http://www.biomedcentral.com/1471-2105/9/546
Like all articles in BMC journals, this peer-reviewed article was published immediately uponacceptance. It can be downloaded, printed and distributed freely for any purposes (see copyright
notice below).
Articles in BMC journals are listed in PubMed and archived at PubMed Central.
For information about publishing your research in BMC journals or any BioMed Central journal, go to
http://www.biomedcentral.com/info/authors/
BMC Bioinformatics
2008 Zhou et al. , licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
mailto:[email protected]:[email protected]:[email protected]://www.biomedcentral.com/1471-2105/9/546http://www.biomedcentral.com/info/authors/http://creativecommons.org/licenses/by/2.0http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/info/authors/http://www.biomedcentral.com/1471-2105/9/546mailto:[email protected]:[email protected]:[email protected] -
8/3/2019 Barcodes Genomes
2/29
-1-
Barcodesforgenomesandapplications
FengfengZhou,VictorOlmanandYingXu*
DepartmentofBiochemistryandMolecularBiologyandInstituteofBioinformatics,
andBioEnergyScienceCenter(BESC),UniversityofGeorgia,Athens,GA30602,
USA.
Theseauthorscontributedequallytothiswork*Correspondingauthor
Emailaddresses:
-
8/3/2019 Barcodes Genomes
3/29
-2-
Abstract
Background
Eachgenomehasastabledistributionofthecombinedfrequencyforeachk-merand
itsreversecomplementmeasuredinsequencefragmentsasshortas1000bpsacross
thewholegenome,for1
-
8/3/2019 Barcodes Genomes
4/29
-3-
Background
The challenges being faced in sorting out short genomic fragments generated by
metagenome sequencing projects [1] pose a fundamental question: does each
genome have a unique signature imprinted on its short sequence fragments so that
fragmentsfromthesamegenomesinametagenomecanbeidentifiedaccurately?A
positiveanswertothisquestioncouldhavesignificantimplicationstomanyimportant
genomeandmetagenomeanalysisproblemssuchasidentificationofgeneticmaterial
transferredfromotherorganisms[2]orthroughvirusinvasions[3,4],separationof
short sequence fragments generated by metagenome sequencing into individual
genomes[5]andphylogeneticanalysesofgenomes[6].
Understandingthe intrinsic properties of genomesequences,either generalto allor
specifictosomeclassesofgenomes,hasbeenthefocusofmanystudiesinthepast
twodecades.EarlierworkincludesthediscoveryoftheperiodicitypropertyofDNA
sequencesacrossbothprokaryoticandeukaryoticgenomes[7]andtherealizationthat
codingsequencesfollowMarkovchainproperties[8-10].Karlinandcolleagueshave
studiedvariousgenomepropertiesbasedonanalysesofk-merfrequencydistributions,
and have observed that the di-nucleotide relative abundance, a normalized di-mer
frequency with respect to the mono-mer frequencies, is generally stable across a
genomemeasuredon50Kbase-pair(bp)fragments[11-13].Theyevensuggestedthat
such normalized di-mer frequency distributions can possibly serve as signatures of
genomes.
Inthispaper,wepresentabarcodingschemeforallsequencedgenomes,andillustrate
a number of interesting and useful properties of the barcodes, which we can take
advantagetosolvechallenginggenomeanalysisproblems.Wehighlightthepowerof
this barcoding scheme through addressing two application problems: metagenome
binningproblemandidentificationofhorizontallytransferredgenes.
-
8/3/2019 Barcodes Genomes
5/29
-4-
Results
Barcodesandtheirproperties
We have calculatedthe barcodefor each sequenced prokaryotic genome, using the
followingprocedure.Foreachgenome,wepartitionitssequenceintoaseriesofnon-
overlappingandequal-sizedfragmentsofMbps;thenforeachk-mer(1
-
8/3/2019 Barcodes Genomes
6/29
-5-
givingrisetotheverticalbandswithconsistentgreylevelsacrosseachbarcode;(b)
thesmallfractionofthefragmentswithclearlydifferent,abnormal,barcodes
(horizontalstripesinthebarcodes)thantherestofthegenometypicallyrepresent2-3
specialclassesofgenes(seediscussionlater);(c)multiplechromosomesofthesame
organismsgenerallyhavehighlysimilarbarcodes(Figure2(a))buttheyeachhave
theiruniquepatternsofabnormalfragments;and(d)thebarcodessimilaritiestendto
begenerallyproportionaltothegenomesphylogeneticcloseness(Figure2(b)).
Tounderstandwhyagenomicsequencehasthebarcodeproperty,wehaveexamined
randomnucleotidesequencesgeneratedusingdifferentmodels,includingMarkov
chainmodelsoforderfrom0to6.Weobservedthatbarcodesforrandomnucleotide
sequencesgeneratedusingathird-orderMarkovchainmodelaretheclosesttothe
barcodesofgenomicsequencesintermsoftheirappearances(Additionalfile1),and
higherorderMarkovchainmodelsdonotseemtoaddmuchtothisproperty.Hence
webelievethatthebarcodepropertyofprokaryoticgenomesismainlyduetothe
third-orderMarkovchainpropertyofthecodingsequencesinthegenomes,which
countfor80-90%ofatypicalprokaryoticgenome.Itisworthnotingthatbarcodesfor
codingandnon-codingsequencesofthesamegenomearegenerallydifferentthough
theyshareaweaklysimilarbackbonestructurewhileeachofthesetwoclassesof
(composite)regionsgenerallyhashighlysimilarbarcodes(Figure3).
Extensiontoothergenomes
Inadditiontoprokaryotes,wehavealsocalculatedthebarcodesfortheotherclasses
of sequenced genomes, namely eukaryotic, mitochondrial, plastid and plasmid
genomes. For eukaryotes, we studied the barcodes for four key components in
eukaryotic genomes, namely the (composite) regions of repetitive sequences,
promotersequences(the1000-bpupstreamregionfromeachtranslationstart),coding
-
8/3/2019 Barcodes Genomes
7/29
-6-
regions and introns, respectively (Figure 3(b)-(e)). We observed that (i) different
regions in a high-level eukaryotic genome (e.g., human) have similar backbone
structuresintheirbarcodes,and(ii)thebarcodesforthefourtypesofregionshave
increasinglyhighercomplexity,goingfromrepetitivesequencestocodingregionsto
introns and promoter sequences. This is consistent with the belief that introns and
promotersequencesareprobablythemostinformationrichamongthefourbecauseof
thepossiblylargenumbersofregulatoryelementstheyencode.
Thebarcodesofthemitochondrialgenomesaregenerallyuniquecomparedtothe
barcodesoftheothergenomesastheyhaveadistinctoverallappearance(e.g.,Figure
3(h)-(i)).Theirsimilarappearancemaybetheresultofallmitochondriaoriginating
fromProteobacteria.Thebarcodesofallplasmidgenomesalsotendtohavesimilar
characteristicsamongthemselves,possiblyduetobeingundersimilarselection
pressurecausedbytheirfrequenttransferringamongcellcultures.Thebarcodesofall
theplastidgenomesarealsogenerallyuniquecomparedtothebarcodesoftheothers
(e.g.,Figure3(j)-(k)).Forexample,amajorityofthemeachconsistoftwodark
horizontalbendstowardoneendintheirbarcodesalongthegenomeaxis,whose
correspondinggenomicregionsconsistofRNAgenessuchasribosomalRNAsand
tRNAs,plusribosomalproteins.Thefuzzierappearanceoftheplastidbarcodes
indicatesthattheirk-merfrequenciesalongthegenomeaxisarenotasstableasinthe
othergenomes.Theoverallsimilarappearancesoftheplastidbarcodesmaybedueto
alloriginatingfromtheCyanobacteria.
Oneinterestingquestionisdodifferentclassesofgenomeshavetheirunique
characteristicsintheirbarcodes?Ouranswerisyes,basedontheirhighlyseparable
distributions in the feature space defined by two particular features, as shown in
Figure4,oneofwhichmeasurestheoverallfrequencyvariationforall4-mersacross
-
8/3/2019 Barcodes Genomes
8/29
-7-
thegenomesbarcode,andtheothermeasurestheoverallsimilaritylevelamongall
theM-bpfragmentsofthegenome,eachconsideredasavectorof4-merfrequencies.
While Figure 2(b) indicates that barcodes generally preserve sequence-level
similarities, Figure 4 suggests that barcodes also capture a higher-level similarity
beyondindividualgenomesequencesimilaritiesthroughthetexturesoftheirimages,
which are the common and unique characteristics of different classes of genomes.
Thispropertyindicatesthatbarcodesarenotjustasimplevisualizationtool,instead
they havecapturedsome fairly basic informationabout genomes!Fromapplication
point of view, we believe that this feature will prove to be useful to metagenome
analyses as fragments from different classes of genomes such as eukaryotes,
prokaryotes or different organelle genomes, have different characteristics in their
barcodeimages.
Identificationofabnormalsequencefragments
Our procedure for identifying sequence fragments with abnormal barcodes in a
genome employs a clustering strategy to divide all the sequence fragments in a
genomeintotwogroups:(a)alargegroupoffragmentswiththeirbarcodesallsimilar
toeachotherand(b)therest(seeMETHODSsection).
Usingthisprocedure,wehaveidentified30,582abnormalfragments,covering
30,889genesacrossall thecompleteprokaryoticgenomes.Specifically28,460such
fragments are identified in the 542 bacterial genomes, covering 28,562 genes, and
2,122suchfragmentsareidentifiedinthe46archaealgenomes,covering2,327genes.
We found that the percentage of fragments with abnormal barcodes ranges from
9.40% to 32.32% across all thebacterialgenomes,withtheaveragebeing 12.85%.
Among the 46 sequenced archaeal genomes, the percentage of fragments with
abnormal barcodes ranges from 9.86% to 23.14%, with the average being 13.58%.
Further information can be found from Additional file 1. The detailed frequency
-
8/3/2019 Barcodes Genomes
9/29
-8-
informationforabnormalfragmentsacrossdifferentgenomesisinAdditionalfile2
[15].
While we found that it is generally more challenging to study the abnormal
fragments in eukaryotes, we did apply the same procedure to different human
chromosomes, and found that the percentage of abnormal fragments ranges from
10.08%to31.32%,withtheaveragebeing12.10%.
Wehaveanalyzedtheabnormalfragmentsacrosstheprokaryoticgenomes,and
foundthefollowing:~30%oftheabnormalfragmentscanbeexplainedintermsof(a)
horizontalgenetransfers,(b)phageinvasionsand(c)highlyexpressedgenes,based
on PHX-PA [16, 17] and Prophinder [18],respectively.Amongthe genes that fall
intothis30%,6.99%arehorizontallytransferredgenes,4.97%bacteriophagegenes
and18.90%highlyexpressedgenes,basedontheabovetwopredictionprograms
notethatthesenumbersdonotadduptoexactly30%sincethereareoverlapsamong
them.The genesfalling into different categories are given inAdditional file 3[15].
We have carried out an enrichment analysis of such elements in regions with
abnormal versus normal barcodes. We found that the highly expressed genes are
enriched in the abnormal fragments, with the enrichment ratio >1 across all the
genomes and the average enrichment ratio being 1.90. Similar results hold for the
horizontallytransferredgenesandbacteriophagegenes.Allthedetaileddatacanbe
foundinAdditionalfile2
We noted that our estimate of the percentages of foreign fragments in
bacterial genomes (after deducting the highly expressed genes) is in general
agreement with the previous estimates though different information and techniques
areusedtoderivetheestimates[19].
-
8/3/2019 Barcodes Genomes
10/29
-9-
We do not yet have an explanation for the remaining ~70% of abnormal
fragmentsinprokaryoticgenomes ,althoughwesuspectthattheymostlyfallintothe
samethree categoriesone reason that wecould not explain themnow ispossibly
duetothelimitedcoverageofthecurrentdatabasesforhorizontallytransferredgenes,
bacteriophagegenomesandhighlyexpressedgenes.Webelievethatbyusingmore
sophisticated computational procedures, one may be able to derive the level of
abnormalityofafragmentsbarcodeinagenome,andpossiblylinksuchinformation
towhensuchfragmentswerehorizontallytransferred[20].
Binningmetagenomesequence
Theabilitytosequenceamicrobialcommunityhasledtothesequencingofatleast
7.04 Giga bps of metagenome sequences, already 2.22 times the total complete
genome sequences accumulated in the past two decades [21]. These metagenome
sequences have opened many doors to new research possibilities, and have posed
some challenging problems.Onesuch problem is determining which fragments are
from the same organisms in a large pool ofmetagenomicfragments [22], typically
~1000 bps in lengths after the initial assembly using the Sanger sequencing
techniques.
We have applied a clustering algorithm (see METHODS section) forbinning
sequence fragments together based on their barcode similarities, and tested the
clustering strategy on three sets of simulated metagenome data created by cutting
actualbacterialgenomesintofragmentsandmixingthemtogether.Thethreetestsets
consistofall sequencefragmentsfromthreesetsofgenomes,respectively,extracted
fromtheGenBank.Thefirstsetconsistsof11genomesrandomlyselectedfromthe
samegenusbutfrom11differentspecies(thegenushasonly11sequencedspecies)
whilethelasttwosetseachconsistof30and100genomesrandomlyselectedfrom30
and 100 different bacterial genera, respectively. The genome names are given in
Additionalfile4[15].
-
8/3/2019 Barcodes Genomes
11/29
-10-
Toassessthebinningabilityofouralgorithmasafunctionofthefragmentsize,
wehaveconsideredfragmentsizeM=1000,2000,5000and10000.Totestthelimit
ofourbinningalgorithm,wehavealsoconsideredM=500.Foreachsetofgenomes,
wepartitionedeachgenomeintofragmentsofsizeM,andthenmixedthefragments
ofthesamelengthintoonepool.Wethencalculatedthebarcodeforeachfragment,
anddida clusteringanalysis,assumingthatthenumberofgenomesineachpoolis
known (this information is derivable from the 16S rRNAs). We have carried out
binningpredictions,onedirectlyonthegeneratedfragmentsandoneonareducedset
ofgeneratedfragments,inwhichweremove10%ofthefragmentsfromeachgenome
whose barcodes are most different from the average barcode of the genome. The
consideration is that each bacterial genome has ~13% of fragments with abnormal
barcodesonaverage,whicharenotexpectedtobebinnedcorrectlywiththerestof
theirhostgenome.Thiswaywecanmoreaccuratelyassessthebinningabilityofour
algorithm.Table1givesthebinningresultsonthethreesetsofsyntheticmetagenome
data,boththeoriginalsetandthereducedset.
Fromthetable,wecanseethatthebinningaccuracy(intothecorrectgenomes)
ishighforfragmentsizeM=1000andabove,atboththespeciesandthegenuslevel.
Fromthetable,weseethatthereisadropinthebinningaccuracywhenthenumberof
the underlying genomes is increased from 30 to 100. This indicates the increased
complexityoftheproblemasafunctionofthenumberofunderlyinggenomes.
Wehavecomparedourbinningperformancewiththepublishedresultsbythe
bestavailablealgorithmPhyloPythia[5].Ourcomparisonindicatesthatouralgorithm
gives consistently more accurate and more specific binning results across different
fragmentsizes.Forexample,atthespecieslevel,ouralgorithmhasbetterthan50%
accuracyonourtestsetwhenthefragmentsizeisatleast2000bpswhilenobinning
results at the species level is given in McHardy et al. [5]. At the genus level, the
accuracy (the average of binning specificity and sensitivity, extracted from Figure
-
8/3/2019 Barcodes Genomes
12/29
-11-
1(a)and(b)inMcHardyetal.[5])byPhyloPythiais45.5%forfragmentsize1000,
56%for2000,74%for5000and82.5%for10000(nodataisprovidedfor500),all
measuredin termsof putting fragments into the correctgenerawhileours isto the
correctgenomesandwithmoreaccuratebinningresults.Itshouldbenotedthatthe
testsetusedbyPhyloPythiaisdifferentthanours,whichmayaffecttheperformance
statisticssomewhatthoughwesuspectthatwillbeinsignificant,consideringthesizes
of the test sets. Another key difference between the two algorithms is that while
PhyloPythia is a supervised learning algorithm, which requires a training set, our
algorithmdoesnotrequireatrainingset,andhenceitismoregeneral.
Onethingworthnotingisthataprokaryoticgenome,onaverage,has~13-14%
ofabnormalsequencefragments, when the fragmentsize isM = 1,000, suggesting
that the theoretical limit for binning accuracy should be no better than 86-87%.
Similarlywe expect that thetheoretical limitsof binning accuracy for2,000, 5,000
and 10,000 fragments-based binning should, in general, be no better than 87.36%,
87.58%and88.4%,respectively.
DiscussionandConclusion
A natural question is do all nucleotide sequences have the barcode property like
genomesequenceshave?Theanswerisno,basedonthelargenumberofrandomly
generatedsequencesthatwehaveexamined.Figure1(e)showsatypicalbarcodeofa
randomsequencegeneratedusingazeroth
orderMarkovchainmodel.Wefoundthat
noneofthesogeneratednucleotidesequenceshastheverticalbandstructuresasin
genomes barcodes. More generally, barcodes for genomes and the randomly
generatednucleotide sequences have different characteristics as shown in Figure 5.
Thebarcodeanalysesinthispaperaremainlybasedondatafromprokaryotes.
Thoughwehaveappliedthesamebarcodemodeltoeukaryotesandmadeinteresting
observations,wesuspectthatthecurrentbarcodingschemeisrichenoughtocapture
-
8/3/2019 Barcodes Genomes
13/29
-12-
all the complexity of eukaryotes. Further studies along this direction are clearly
needed.
We believe that for many genome analysis problems, particularly for
prokaryoticgenomes,thebarcodesprovideanatural,intuitive,information-richand
unified framework for studying them. Further applications of this capability to
numerousgenomeanalysisproblemscanbe envisioned,such asphylogenystudies,
particularly for genomes without obvious marker genes such as viruses, more
thorough examination of different types of genomic regions in eukaryotes, their
structures andorganization, further studies of horizontalgene transfers,assistingin
genome assembly of higher-order organisms (e.g.,populus which we are currently
workingon)andpossiblymanymore.Webelievethatwehaveonlybegunexploring
thetruepowerofthisnewcapabilityforgenomestudies.
MethodsMappingfrequenciestogreylevels
Thefrequencyofeach k-merismappedtoagreylevelasfollows.Wefirstcountthe
frequencyofeachk-meracrossallprokaryoticgenomes,andsortthefrequencylist
S[1:N(k)]intheincreasingorderofthefrequencieswithN(k)beingthenumberofk-
mers.WethenfindanintegerL,L>0,andpartitionS[1:N(k)]intoLsub-listssothat
the following function is minimized:=
=
Li
i
i SS1
)( , where iS is the sum of all
frequencies in the ith
sub-list, S is the average of S, andL is a parameter to be
determined bytheminimizationresult. ForM=1000and k=4,wefoundL=14
gives the best valuefor the above objective function. The computedpartition of S
givesamappingoffrequenciestothegreylevels.Notethatthismappingisgenome-
independent so each grey level in the barcodes has the same meaning in different
genomes.
-
8/3/2019 Barcodes Genomes
14/29
-13-
Barcodesimilaritycalculation
We define the distance (or dissimilarity) between two barcodes based on their
simplified representations, each of which is a matrix having the same number of
columnsofthebarcodeandthenumberofgreylevels,L,usedinbarcodeimagesas
the number of rows; each element in the matrix represents the frequency of the
correspondinggreylevelacrosseachcolumninthebarcode.Fortwosuchmatrices
M1andM2withKcolumnsandLrows,wedefinetheirbarcodedistanceas
2
2
1 1
1 )),(),(( jiMjiML
i
K
j
= =
.
ClearlythisisageneralizationoftheEuclideandistancebetweentwovectorsofthe
averaged k-mer frequencies across each genome, widely used for genome
comparisonsasintheworkofKarlinandcolleagues[13,16,23,24]andmanyothers.
ThisisequivalenttothespecialcaseofourbarcodedistancewhenL=1.FigureS4in
Additionalfile1providesacomparisonbetweenthetwodistances.
Identificationofabnormalfragmentsinagenome
We have used the following procedure to identify fragments in a genome with
substantially different barcodes than the average barcode of the genome. The
procedureconsistsoftwokeysteps.First,foreachk-mer,weselectthefragmentsin
thegenomethathavethehighestorthelowestX%ofthisk-mersfrequencyamong
all fragments, with X being a parameter. Then we sort all the fragments in the
increasing order of the number of times they are selected in the first step, termed
function F(p), withp being the index of a fragment. Let 0p be the fragment index
havingthehighestsecond-orderderivativeof F(p).Weconsiderallfragmentspwith
F(p)>F( 0p )tobethenon-nativefragmentsofthegenomeastheyhaveusedthemost
numberofk-merswithfrequenciesthataresubstantiallydifferentthanthetypicalk-
mer frequencies throughout the genome. We found that the abnormal fragment
-
8/3/2019 Barcodes Genomes
15/29
-14-
predictionisnotverysensitivetothedetailedvalueofXwithintherangefrom5to
20.SowehavechosenX=10asthedefaultvalueofourprogram.
The rationale for this procedure is that fragments with higher F(p) values
represent fragments that have more abnormal k-mer frequencies compared to the
average k-mer frequenciesin the genome, and hence are more probable tobenon-
native fragments. By examining the curveof theF(p) function, wefoundthat it is
convex with one sharp transition point 0p , indicating a transition point from the
typicalfragmentstotheabnormalfragmentsinthegenome(seeAdditionalfile1).
Hencewehaveusedthispointastheseparationpointbetweenthenormal(ornative)
fragmentsandtheabnormalfragments.
Metagenomebinningalgorithm
Our binningprocedure startswithanapplicationof the CLUMPprogram [25] toa
givenpooloffragments(notnecessarilyofthesamelengths)tobeclusteredbasedon
theirbarcodesimilarities.AuniquefeatureofCLUMPisthatitisquiteaccuratein
identifyingthecoreelementsofeachclusteraswehavepreviouslydemonstrated[25],
though a weakness of the algorithm could be that it does not always handle the
boundary elements accurately. Hence we have combined CLUMP with a K-means
basedclusteringapproachthatweimplemented.Afteridentifyingtheinitialclusters
formedbyCLUMPbasedonbarcodesimilarities,assumingthatweknowthenumber
of clusters to be identified, we pick a seed from each predicted cluster randomly
according to the density distribution of the cluster. Then we run the K-means
algorithm,usingtheselectedseeds.Foreachpooloffragments,werunthistwo-step
clustering algorithm multiple times, using a different set of seeds for each run. In
decidingthenumberofruns,ourruleofthumbbasedonourexperienceworkingon
themetagenomedataistouse500*(thenumberofclusters),.Foreachgivensetof
seeds, we run the K-means algorithm 10000 iterations. Among all the computed
-
8/3/2019 Barcodes Genomes
16/29
-15-
clustering results for each pool, we choose the clustering result 1 2, , ..., KC C C that
minimizesthefollowingfunctionasthefinalbinningresult:
=
K
i CX i
XiX1
2)( ,
where 1 2, , ..., KC C C isapartitionofagivenpoolofmetagenomicfragmentswitheach
Cibeingasubsetofthepooland iX beingtheaverageofthebarcodesofallfragments
in , 1,..., .iC i K=
Authors'contributionsY.X.conceivedtheproject.F.Z.andV.O.analyzedthedataandperformedthe
experiments.Y.X.supervisedthisprojectasP.I.andwrotethemanuscript.
AcknowledgementsThisworkwas supportedby National Science Foundation (DBI-0354771, ITR-IIS-
0407204, DBI-0542119, CCF0621700), a U.S. Department of Energy BioEnergy
ResearchCentergrantfromtheOfficeofBiologicalandEnvironmentalResearchin
the DOE Office of Science, and a Distinguished Scholar grant from the Georgia
Cancer Coalition. We would like to thankthe two anonymous reviewers for their
helpful comments on our work. We would also like to thank Ms Joan Yantko for
preparingthemanuscript.
-
8/3/2019 Barcodes Genomes
17/29
-16-
References
1. BackhedF,LeyRE,SonnenburgJL,PetersonDA,GordonJI:Host-
bacterialmutualisminthehumanintestine.Science2005,
307(5717):1915-1920.
2. JainR,RiveraMC,LakeJA:Horizontalgenetransferamonggenomes:
thecomplexityhypothesis.ProceedingsoftheNationalAcademyof
SciencesoftheUnitedStatesofAmerica1999,96(7):3801-3806.
3. FreyTK:Neurologicalaspectsofrubellavirusinfection.Intervirology
1997,40(2-3):167-175.
4. RybchinVN,SvarchevskyAN:TheplasmidprophageN15:alinearDNA
withcovalentlyclosedends.MolMicrobiol1999,33(5):895-903.
5. McHardyAC,MartinHG,TsirigosA,HugenholtzP,RigoutsosI:
Accuratephylogeneticclassificationofvariable-lengthDNAfragments.
NatMethods2007,4(1):63-72.
6. YangE,BinW,PengJ,ZhangX,WangJ,YangJ,DongJ,ChuY,Zhang
J,JinQ:ComparativegenomicsandphylogeneticanalysisofS.
dysenteriaesubgroup.SciChinaCLifeSci2005,48(4):406-413.
7. TrifonovEN,SussmanJL:ThepitchofchromatinDNAisreflectedinits
nucleotidesequence.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1980,77(7):3816-3820.
8. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical
patternsinprimarystructuresoffunctionalregionsintheE.coligenome.
I.Oligonucleotidefrequenciesanalysis.MolecularBiology1986,20:826-
833.
9. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical
patternsinprimarystructuresoffunctionalregionsintheE.coligenome.
II.Non-homogeneousMarkovmodels.MolecularBiology1986,20:833-
840.
10. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical
patternsinprimarystructuresoffunctionalregionsintheE.coligenome.
III.Computerrecognitionofcodingregions.MolecularBiology1986,
20:1145-1150.
11. KarlinS,BurgeC:Dinucleotiderelativeabundanceextremes:agenomic
signature.TrendsGenet1995,11(7):283-290.
12. KarlinS,ZhuZY,KarlinKD:Theextendedenvironmentof
mononuclearmetalcentersinproteinstructures.Proceedingsofthe
-
8/3/2019 Barcodes Genomes
18/29
-17-
NationalAcademyofSciencesoftheUnitedStatesofAmerica1997,
94(26):14225-14230.
13. KarlinS,BrocchieriL,MrazekJ,CampbellAM,SpormannAM:A
chimericprokaryoticancestryofmitochondriaandprimitiveeukaryotes.
ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1999,96(16):9190-9195.
14. Computed_barcodes:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/].
15. Supplementary_material:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/supp/].
16. MrazekJ,BhayaD,GrossmanAR,KarlinS:Highlyexpressedandalien
genesoftheSynechocystisgenome.NucleicAcidsRes2001,29(7):1590-
1601.
17. KarlinS,MrazekJ:Predictedhighlyexpressedgenesofdiverse
prokaryoticgenomes.JBacteriol2000,182(18):5238-5250.
18. Lima-MendezG,HeldenJV,ToussaintA,LeplaeR:Prophinder:a
computationaltoolforprophagepredictioninpro-karyoticgenomes.
Bioinformatics2008.
19. OchmanH,LawrenceJG,GroismanEA:Lateralgenetransferandthe
natureofbacterialinnovation.Nature2000,405(6784):299-304.
20. LawrenceJG,OchmanH:Ameliorationofbacterialgenomes:ratesof
changeandexchange.JMolEvol1997,44(4):383-397.
21. LioliosK,MavromatisK,TavernarakisN,KyrpidesNC:TheGenomes
OnLineDatabase(GOLD)in2007:statusofgenomicandmetagenomic
projectsandtheirassociatedmetadata.NucleicAcidsRes2008,
36(Databaseissue):D475-479.
22. McHardyAC,RigoutsosI:What'sinthemix:phylogeneticclassification
ofmetagenomesequencesamples.Currentopinioninmicrobiology2007,
10(5):499-503.
23. KarlinS,MrazekJ,MaJ,BrocchieriL:Predictedhighlyexpressedgenes
inarchaealgenomes.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica2005,102(20):7303-7308.
24. MrazekJ,KarlinS:Detectingaliengenesinbacterialgenomes.AnnNY
AcadSci1999,870:314-329.
25. OlmanV,MaoF,WuH,XuY:ParallelClusteringAlgorithmforLarge
DataSetswithapplicationsinBioinformatics.IEEE/ACMTransactionson
ComputationalBiologyandBioinformatics2007,Accepted.
26. DeSantisTZ,Jr.,HugenholtzP,KellerK,BrodieEL,LarsenN,Piceno
YM,PhanR,AndersenGL:NAST:amultiplesequencealignmentserver
-
8/3/2019 Barcodes Genomes
19/29
-18-
forcomparativeanalysisof16SrRNAgenes.NucleicAcidsRes2006,
34(WebServerissue):W394-399.
27. CormenTH,LeisersonCE,RivestRL,SteinC:Introductionto
Algorithms,SecondEdition.Cambridge,MATheMITPress;2001.
-
8/3/2019 Barcodes Genomes
20/29
-19-
Figures
Figure1Barcodesforfiveprokaryoticgenomes.
(a)E.coliK-12;(b)E.coliO157;(c)chromosome1ofB.pseudomalleiK96243;(d)
archaeanP.furiosusDSM3638;and(e)arandomnucleotidesequencegenerated
usingazeroth
orderMarkovchainmodel.Thex-axisforeachbarcodeisthelistofall
4-mersarrangedinthealphabeticalorder,andthey-axisisthegenomeaxiswitheach
pixelrepresentingafragmentofMbplong.
Figure2Basicfeaturesofbarcodes.
(a)Barcodedistancedistributionamongchromosomesfromthesameorganisms,
acrossallprokaryoticandeukaryoticchromosomalgenomes.Thex-axisisthe
barcodedistanceandthey-axisisthefrequencyofchromosomepairsofthesame
organismhavingaparticularbarcodedistance.(b)Genomebarcodedistancesversus
sequencesimilaritiesamongthecorresponding16SrRNAs(basedonthemultiple
sequencealignmentgiveninDeSantisTZetal.[26]).They-axisrepresentsthe
barcodedistance,andthex-axisisthesequenceidentityaxisbetweentwo16SrRNAs
groupedintoninebins,wherethesequenceidentityiscalculatedastheaverage
sequenceidentityoverall16Spairsineachbin.
Figure3Barcodesofsomeorganisms.
Barcodesof(a)Humanchromosome1(226.21Mbps);majorcomponentsofhuman
chromosomeinacompositeform:(b)repetitivesequence,(c)promotersequence,(d)
codingregionsand(e)introns;and(f)codingand(g)non-codingregionsofE.coli
K-12.Onlya639-Kbpregionofeachsequencein(b)(g)isdisplayedsoeachpixel
representsthesamesequencelength.639Kbpsisusedsincethisisthelengthofthe
shortestregionamongthemall,i.e.thetotalnon-codingregionofE.coliK-12.
Mitochondrialgenomebarcodesof(h)C.elegans(13794bps)and(i)Drosophila
melanogaster(19517bps).Plastidgenomebarcodesof(j)aquaticplant
-
8/3/2019 Barcodes Genomes
21/29
-20-
Ceratophyllumdemersum(156252bps)and(k)landplantPopulustrichocarpa
(157033bps).
Figure4Barcodesinfeaturespace.Thex-axisistheaverageofvariationsofthe4-merfrequenciesacrossawhole
genomeacrossall4-mers,andthey-axismeasuresthesimilaritylevelamongall
1000-bppartitionedfragmentsofthegenome,eachrepresentedasa136-dimensional
vectorof4-merfrequencies;Specifically,foreachgenome,webuildaminimum
spanningtree[27]basedonthe4-merfrequencyvectorsforitssequencefragments
andtheirdistances.They-axisistheaveragedweight(distance)ofalledgesinthe
minimumspanningtree.Thegreendotsrepresentprokaryotes(586genomes),the
blueonesforeukaryotes(83chromosomes),theredonesforplastids(101genomes
withlengths>20,000bps),thebrownonesforplasmidsofprokaryoticgenomes(237
plasmids>20,000bps)andtheblackformitochondria(120genomeswithlengths>
20,000bps).
Figure5Distributionofratiosbetweenbarcodevariationsofallprokaryoticgenomesandtheircorrespondingrandomlygeneratednucleotidesequences.
Foreachgenome,acorrespondingrandomnucleotidesequenceisdefinedasa
randomsequenceofthesamelengthandwiththesamemono-nucleotidefrequencies
asthoseofthegenome,generatedusingazeroth
orderMarkovchainmodel.The
variationofabarcodeisdefinedasthestandarddeviationofthelistoftheaveraged
frequenciesofallthek-mersalongthegenome.Thex-axisistheratioofthebarcode
variationsbetweenagenomeandacorrespondingrandomsequence,andthey-axis
representsthefrequencyofcaseswithaparticularvariationratio.
-
8/3/2019 Barcodes Genomes
22/29
-21-
Tables
Table1Binningaccuraciesofourbarcode-basedclusteringalgorithm.
11genomes 30genomes 100genomes
Originalgenomes
Filteredgenomes
Originalgenomes
Filteredgenomes
Originalgenomes
Filteredgenomes
FS=500bps 71.10% 77.30% 51.6% 55.70% 40.50% 41.10%
FS=1000bps 79.90% 85.90% 65.30% 70.30% 51.10% 52.60%
FS=2000bps 86.30% 91.70% 74.80% 80.60% 61.00% 68.53%
FS=5000bps 91.10% 98.10% 86.60% 93.20% 79.40% 81.90%
FS=10000bps 95.80% 99.30% 91.90% 97.50% 86.60% 89.18%
Thebinningaccuracyisdefinedas(predictionspecificity+predictionsensitivity)/2,
andFSisforfragmentsize,whereboththe specificityandsensitivityaremeasuredin
terms of putting the fragments into the correct bin corresponding to each genome,
definedbythemajorityofthefragmentsinthebin.ThecolumnOriginalgenomes
liststhebinning accuracy of ouralgorithmonall thenon-overlapping fragmentsin
eachgroupof genomes,andthecolumnFilteredgenomesgivestheaccuracyafter
removingthe10%fragmentswiththemostabnormalbarcodesfromeachgenome.
-
8/3/2019 Barcodes Genomes
23/29
-22-
AdditionalfilesAdditionalfile1
Fileformat:DOC
Title:Supplementarymaterial.
Description:Supplementarymaterial1-3.
Additionalfile2
Fileformat:XLS
Title:SupplementaryTable1.
Description:HXishighlyexpressedgene,HTishorizontallytransferredgene,and
PHisthephagegene.UnknownGeneconsistsofgeneswithinfragmentsofabnormal
barcodesbutdonotbelongtotheabovethreecategories.
Additionalfile3
Fileformat:ZIPcompressedTXT
Title:SupplementaryTable2.
Description:Geneclassificationsofalltheprokaryoticgenomes.
Additionalfile4
Fileformat:XLS
Title:SupplementaryTable3
Description:Genomesusedinthebinningsection.
-
8/3/2019 Barcodes Genomes
24/29
-
8/3/2019 Barcodes Genomes
25/29
-
8/3/2019 Barcodes Genomes
26/29
-
8/3/2019 Barcodes Genomes
27/29
-
8/3/2019 Barcodes Genomes
28/29
-
8/3/2019 Barcodes Genomes
29/29
Additional files provided with this submission:
Additional file 1: barcode-suppl.doc, 737Khttp://www.biomedcentral.com/imedia/7848742892070350/supp1.docAdditional file 2: stable1.xls, 185K
http://www.biomedcentral.com/imedia/5579318302405972/supp2.xlsAdditional file 3: stable2.zip, 4115Khttp://www.biomedcentral.com/imedia/2665914822070350/supp3.zipAdditional file 4: stable3.xls, 27Khttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xls
http://www.biomedcentral.com/imedia/7848742892070350/supp1.dochttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/7848742892070350/supp1.doc