phylogenetic reconstruction in the order nymphaeales: its2

16
PROCEEDINGS Open Access Phylogenetic reconstruction in the Order Nymphaeales: ITS2 secondary structure analysis and in silico testing of maturase k (matK) as a potential marker for DNA bar coding Devendra Kumar Biswal, Manish Debnath, Shakti Kumar, Pramod Tandon * From Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics (InCoB2012) Bangkok, Thailand. 3-5 October 2012 Abstract Background: The Nymphaeales (waterlilly and relatives) lineage has diverged as the second branch of basal angiosperms and comprises of two families: Cabombaceae and Nymphaceae. The classification of Nymphaeales and phylogeny within the flowering plants are quite intriguing as several systems (Thorne system, Dahlgren system, Cronquist system, Takhtajan system and APG III system (Angiosperm Phylogeny Group III system) have attempted to redefine the Nymphaeales taxonomy. There have been also fossil records consisting especially of seeds, pollen, stems, leaves and flowers as early as the lower Cretaceous. Here we present an in silico study of the order Nymphaeales taking maturaseK (matK) and internal transcribed spacer (ITS2) as biomarkers for phylogeny reconstruction (using character-based methods and Bayesian approach) and identification of motifs for DNA barcoding. Results: The Maximum Likelihood (ML) and Bayesian approach yielded congruent fully resolved and well- supported trees using a concatenated (ITS2+ matK) supermatrix aligned dataset. The taxon sampling corroborates the monophyly of Cabombaceae. Nuphar emerges as a monophyletic clade in the family Nymphaeaceae while there are slight discrepancies in the monophyletic nature of the genera Nymphaea owing to Victoria-Euryale and Ondinea grouping in the same node of Nymphaeaceae. ITS2 secondary structures alignment corroborate the primary sequence analysis. Hydatellaceae emerged as a sister clade to Nymphaeaceae and had a basal lineage amongst the water lilly clades. Species from Cycas and Ginkgo were taken as outgroups and were rooted in the overall tree topology from various methods. Conclusions: MatK genes are fast evolving highly variant regions of plant chloroplast DNA that can serve as potential biomarkers for DNA barcoding and also in generating primers for angiosperms with identification of unique motif regions. We have reported unique genus specific motif regions in the Order Nymphaeles from matK dataset which can be further validated for barcoding and designing of PCR primers. Our analysis using a novel approach of sequence-structure alignment and phylogenetic reconstruction using molecular morphometrics congrue with the current placement of Hydatellaceae within the early-divergent angiosperm order Nymphaeales. The results underscore the fact that more diverse genera, if not fully resolved to be monophyletic, should be represented by all major lineages. * Correspondence: [email protected] Bioinformatics Centre, North Eastern Hill University, Shillong 793022, Meghalaya, India Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26 http://www.biomedcentral.com/1471-2105/13/S17/S26 © 2012 Biswal et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 04-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

PROCEEDINGS Open Access

Phylogenetic reconstruction in the OrderNymphaeales: ITS2 secondary structure analysisand in silico testing of maturase k (matK) as apotential marker for DNA bar codingDevendra Kumar Biswal, Manish Debnath, Shakti Kumar, Pramod Tandon*

From Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics(InCoB2012)Bangkok, Thailand. 3-5 October 2012

Abstract

Background: The Nymphaeales (waterlilly and relatives) lineage has diverged as the second branch of basalangiosperms and comprises of two families: Cabombaceae and Nymphaceae. The classification of Nymphaealesand phylogeny within the flowering plants are quite intriguing as several systems (Thorne system, Dahlgrensystem, Cronquist system, Takhtajan system and APG III system (Angiosperm Phylogeny Group III system) haveattempted to redefine the Nymphaeales taxonomy. There have been also fossil records consisting especially ofseeds, pollen, stems, leaves and flowers as early as the lower Cretaceous. Here we present an in silico study of theorder Nymphaeales taking maturaseK (matK) and internal transcribed spacer (ITS2) as biomarkers for phylogenyreconstruction (using character-based methods and Bayesian approach) and identification of motifs for DNAbarcoding.

Results: The Maximum Likelihood (ML) and Bayesian approach yielded congruent fully resolved and well-supported trees using a concatenated (ITS2+ matK) supermatrix aligned dataset. The taxon sampling corroboratesthe monophyly of Cabombaceae. Nuphar emerges as a monophyletic clade in the family Nymphaeaceae whilethere are slight discrepancies in the monophyletic nature of the genera Nymphaea owing to Victoria-Euryale andOndinea grouping in the same node of Nymphaeaceae. ITS2 secondary structures alignment corroborate theprimary sequence analysis. Hydatellaceae emerged as a sister clade to Nymphaeaceae and had a basal lineageamongst the water lilly clades. Species from Cycas and Ginkgo were taken as outgroups and were rooted in theoverall tree topology from various methods.

Conclusions: MatK genes are fast evolving highly variant regions of plant chloroplast DNA that can serve aspotential biomarkers for DNA barcoding and also in generating primers for angiosperms with identification ofunique motif regions. We have reported unique genus specific motif regions in the Order Nymphaeles from matKdataset which can be further validated for barcoding and designing of PCR primers. Our analysis using a novelapproach of sequence-structure alignment and phylogenetic reconstruction using molecular morphometricscongrue with the current placement of Hydatellaceae within the early-divergent angiosperm order Nymphaeales.The results underscore the fact that more diverse genera, if not fully resolved to be monophyletic, should berepresented by all major lineages.

* Correspondence: [email protected] Centre, North Eastern Hill University, Shillong 793022,Meghalaya, India

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

© 2012 Biswal et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

BackgroundThe Basal angiosperm Order Nymphaeales is a group ofwater-living flowering plants. Though the group is taxo-nomically small, it has great significance in understand-ing the early evolutionary pattern of angiosperms.Classification of this Order varies from recognition oftwo to four families. A lot of progress has been made inrecent years in understanding both the taxonomic posi-tion of Nymphaeales in the angiosperm tree and therelationship within the water lily clade [1-3].Usually, two families, Cabombaceae and Nymphaeaceae

are recognised. The Cabombaceae comprises the generaCabomba and Brasenia and Nymphaeaceae comprise sixgenera: Euryale, Ondinea, Victoria, Barclaya, Nupharand Nymphaea, the largest and most cosmopolitan innature. Until recently, Hydatellaceae was placed amongthe monocots in previous systems and was placed withPoales, but a recent study with multi-marker plastid data-set found that the family belongs to Nymphaeales andincludes two genera (Hydatella and Trithuria), which isrestricted to Australiasia and India [4].The Order Nymphaeales was considered to include the

genera Nelumbo and Ceratophyllum as per earlier taxo-nomic treatments based on morphology [5-8]. However,in recent times with the use of modern molecular bio-markers, Nelumbo and Ceratophyllum are excluded,thereby, substantiating the monophyly of Nymphaeles[8-10]. This provided an impetus for revaluation of mor-phological characters that revealed the presence of certainfeatures such as tricolpate pollen or epicuticular waxtubules in Nelumbo thereby further substantiating itsexclusion from Nymphaeales [5,11].Hydatellaceae as it represents the single exception in an

otherwise relatively harmonious congruence between thetraditional and molecular circumscription of the mono-cot clade, the structural diversity of this remarkablefamily is of considerable interest. They are small andinconspicuous plants that received little attention frombotanists prior to their taxonomic reassignment to thebasal angiosperms. It would be really interesting toreview our current knowledge on this species-poor butinteresting family that has only recently been discoveredin India [4].Morphological and molecular data generally indicate a

close association of Cabomba and Brasenia therebyaffirming the monophyly of the family Cabombaceae[12,13], whereas the monophyly of the family Nymphaea-ceae is yet to get much support from the taxonomiccommunity.DNA barcoding has become an indispensable tool in

identifying biological specimens using a short standardizedregion of both genomic as well as extra-chromosomalDNA very much in the way what universal product codesdo for identification of consumer goods. Research

community interested in DNA barcodes want to placetheir query sequences within the taxonomic hierarchywhich is achieved by conventional sequence similaritysearch methods viz., Basic Local Alignment Search Tool(BLAST), Fast Alignment (FASTA) etc. that are oftentwitched to overcome biological mutations or samplingbias and this, in turn, poses tricky issues like successfultracking of minuscule sequence variations observedamong closely related species. A step further, characterbased similarity relying on common ancestry is alsoemployed in the form of phylogenetic trees or in the formof implicit hierarchic taxonomic descriptors [14]. Thesemethods heavily depend on multiple sequence alignments(MSA) which in fact, is a challenge as the barcodingrequirements are contradictory to the very objective ofMSA, i.e., looking for hyper variable regions to delineatethe closely related species and yet be highly conserved forallowing design of universal PCR primers. Keeping thesein mind, selecting a core barcode abiding the three impor-tant barcoding principles (standardization, minimalismand scalability) still remains a challenge for plant DNAbarcoding unlike animal DNA barcoding. The standardanimal Cytochrome oxidase (COI) DNA barcode being ahaploid and uniparentally inherited with a single locusexhibiting high levels of discriminatory power fits well intothe above barcoding criteria [15].COI is a protein coding marker with high copy num-

bers per cell devoid of microinversions (frequent mono-nucleotide repeats) and drastic length variation withdeveloped primer sets that aid in routine recovery of highquality sequence from animal clades and sequence recov-ery from poorly preserved samples as well [15]. Finding astandard plant barcode analogue to COI in animals hasproved difficult and COI from plant mitochondrial DNA(mtDNA) generally exhibits low nucleotide substitutionrates thereby making it unsuccessful for universal plantbarcoding initiatives. There are core research groups whohave worked both in silico and in vitro suggesting multi-ple plastid markers but eventually couldn’t arrive at aconclusion [15] and thus maturase K (matK) still holdsgood as a suitable substitute plant barcode that can beconsidered the animal barcode COI analogue [16].matK is one of the most rapidly evolving coding

regions in the plastid genome but unfortunately posesdifficulty in PCR amplification with already existing uni-versal primer sets especially in non-angiosperms contraryto another barcode region ribulose-bisphosphate carbox-ylase (rbcl) gene which is easy to amplify, sequence andalign despite having modest discriminatory power [17].Hence, two marker plastid barcodes (rbcl+matK) are sug-gested as core barcodes until further works on matK uni-versal primer development are a success. With these twojoint challenges (matk primers in want of improvementand uncertainty in discriminatory powers of two plastid

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 2 of 16

marker (rbcl+matk) system), continued sequencing andexploration of new possibilities in non-coding markersviz. trnH-psbA and internal transcribed spacers (ITS1 &ITS2) are harnessed to formalize the routine incorpora-tion of other potential non-coding markers into plantbarcoding design systems [17].Officially rbcl+matK combination has been approved

by Consortium for the Barcode of Life (CBOL) as a globalDNA barcode for land plants while trnH-psbA are stillunder scrutiny as a backup barcoding locus. There havebeen studies for ferns with matK+rbcl and trnH+psbAloci with the former providing high discriminatorypower, supporting their use as the official DNA barcode[17]. Another research study has validated use of ITS2 asnovel DNA barcode for medicinal plant identification asITS2 sequences are considered potential phylogeneticmarkers at genus and species levels. Six parameters viz.average interspecific distance (K2P) between all speciesin each genus, average theta prime (θ’), where θ’ is themean pair wise distance within each genus with morethan one species, smallest interspecific distance i.e., theminimum interspecific genetic distance within eachgenus with at least two species, average intra specificdivergence (K2P difference), theta (θ) where θ is themean pairwise distance within each species with at leasttwo representatives and average coalescent depth (i.e.,maximum intra-specific distance within each specieswith at least two representatives) were determined takingseveral plastid and ribosomal intergenic marker regionswhere ITS2 scored high exhibiting highest level of varia-tion with all the parameters thereby accounting for ITS2as a suitable marker with authentication ability [18].Looking into these intriguing questions about phyloge-

netic relationships in Nymphaeales, we designed an insilico study using matK and ITS2 sequences available onthe public domain covering all genera of Nymphaeales.Till date there are no reports on plant DNA barcodingapproach where both matK and ITS2 are taken togetherand phylogenetic studies made. In case of water lilies,molecular identification and barcodes have been reportedonly for the genus Nymphaea and that too takingsequences from the rpoC1 gene and trnH-psbA spacerregions (which are still under assessment as backup lociby CBOL) and use of inter-simple sequence repeat (ISSR)for species identification and differentiation of Nymphaeacultivars and natural populations [19]. The present studyaims at using both matK and ITS2 as markers for eluci-dating the plant species of the order Nymphaeales usingcombined fusion matrix of both the markers, capturingthe phylogenetic signals through molecular morpho-metrics for the ITS2 region and finding novel motifs thatcan be tested as PCR primers in design of potential bar-codes at genus level for rapid and accurate plant

identification across the three different families of Nym-phaeaceae, Cabombaceae and Hydatellaceae withoutmorphological characters.ITS2 has common core secondary structures across

eukaryotes that serve as a double-edged tool. The ITS2region of the nuclear rDNA cistrons is widely used forphylogenetic analyses at the genus and species levels andalso at the higher taxonomic ranks using comparisons ofprimary sequence. Although potential transcript second-ary structure homology is often utilized to aid alignmentin comparisons of ribosomal gene sequences, such con-sideration has rarely been applied to ITS primarilybecause secondary structures for its transcript were notavailable. Hence, the value of applying ITS2 RNA tran-script secondary structure information to improve align-ments, that in turn, allows comparisons at even deepertaxonomic levels harnessing the evolutionarily conservedsubportions of ITS2 has become apparently necessary forpositioning of the multimolecular transcript processingmachinery amongst eukaryotes and thus makes ITS2 avaluable tool both for primary sequence analysis andmolecular morphometrics [20].Although individualistic approach for different bar-

codes exist in addressing several issues in plant barcodedesigning, there is a general need for integrating a rangeof analytical routine into a common work flow to providecomparable informatics support for existing moleculardata on the public domain.For the present in silico study three principal objec-

tives have been envisaged:1. Phylogenetic reconstruction of Nymphaeales based on

matK genes and ITS2 using a combined approach of genetrees and species trees (using super matrix of concatenatedloci) and testing monophyly of the genus Nymphaea.2. Evaluation of the phylogenetic utility of matK as a

potential marker for motif hunting and DNA barcoding.3. ITS2 secondary structure prediction in the order

Nymphaeales and alignment of secondary structures toproduce a consensus Nymphaeale phylogeny.

ResultsSequence analysis and phylogeny reconstructionsThe sequences of ITS2 and matk were aligned separatelywith clustalW program [21] and manually edited and theresultant aligned files were concatenated using FAScon-CAT version 1.0 [22] (Additional File 1). For the insertionof gaps, attention was given to both the potentiallyinserted sequence and its neighbouring sequences. A gapwas inserted only when it prevented the inclusion of morethan two substitutions among closely adjacent nucleotidesin the alignment. For the placement of gaps, the recogni-tion of sequence motifs was given priority as per align-ment rules for length-variable DNA sequences [23].

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 3 of 16

Giving priority to a motif can result in insertions that arecorrectly aligned as non-homologous (i.e. with differentpositional extensions) although sequence similarity wouldwarrant their inaccurate placement under the same col-umn [24]. Individual positions in homo-nucleotide stringsof different lengths (poly-As or -Ts) are considered to beof uncertain homology [25] and are therefore excluded.Slipped strand mispairing [26] is likely to have led tonumerous length mutational events involving one to sev-eral nucleotides. As only nucleotides of the same kind areinvolved, accurate motif recognition is not possible. Entireindels of the same positional extension and of completesequence similarity were very easily assessed as primaryhomologues and consequently placed in the same column(s) of the alignment. During primary homology assess-ment, no inference had to be made regardless of whetherthe length mutational event occurred in a common ances-tor of all taxa sharing it or in parallel in different lineages.This is analogous to the fact that the synapomorphic sta-tus of a substitution in a particular position is not inferredin the alignment process. Recognition of a repeat motifwas regarded as further evidence for correctly recognizinga length mutational event. The final concatenated super-matrix included 51 taxa with 1875 characters.

Maximum Likelihood (ML) analysesPhylogenetic tree analysis was carried out using PhyML3.0 [27] with approximate likelihood ratio test (aLRT)which is much faster than bootstarp and is close toBayesian posteriors. We then implemented a Shimo-daira-Hasegawa-like procedure [28], which is non-parametric and resembles well with bootstrap outcomes.The default substitution model HKY85 with gammashape parameter of 2.716 and transition/transversionratio of 3.064 was considered for computing the MLtree (Figure 1) that showed several groupings of thefamily Nymphaeaceae and contributed to the monophylyof the different genera therein viz., Euryale, Barclaya,Nuphar, Nymphaea, and Victoria except Ondinea thatgrouped with Nymphaea. There were slight variations inthe placing of some species of Nymphaeaceae especiallyfrom the genus Nymphaea that clustered with othergroups thereby accounting to its genetic variability.Ginkgo and Cycas representatives were taken as out-group and were rooted in the overall tree topology withstrong bootstrap values. Primarily three clades wereresolved: Cabombaceae (with the genera Cabomba andBrasenia), Nymphaeaceae and Hydatellaceae (with thegenera Trithuria). The grouping of Trithuria sps andplacing of the family in the basal grade close to the out-groups reflected Hydatellaceae and Nympheaceae to besister groups and that Hydatellaceae belonged to a moreprimitive basal angiosperm lineage.

Bayesian analyses and split networksThe supermatrix dataset of ITS2 and matK was exportedin nexus format for MrBayes [29] in the Mesquite pro-gram V2.75 [30]. Bayesian analysis retained the sametopology and supported the branches with a consensus50 majority rule (Figure 2) though the basal lineage tothe Nympheales group were represented by both theHydatellaceae and Cycas, ginkgo outgroup. Our analysesshowed that an exclusion of randomised sectionsimproved the resolution between the different genera ofthe family Nymphaeaceae. The monophyly of the orderNympheales has been favoured by earlier studies [4-6].Therefore, we conclude that more genes are necessary torobustly resolve Nymphaeale clade as well as relation-ships between Nymphaeaceae and Hydatellaceae. Theabove observations on the genetic variability in the familyNymphaeaceae prompted us for a median-joining andnetwork analysis, which was performed using SplitsTree4[31] with the variable positions in the aligned concate-nated ITS2 and matK data. The median network treeexhibited primarily three groups accounting for themonophyly of Nymphaeaceae, Cabombaceae and Hyda-tellaceae with Cycas and Ginkgo as outgroups (Figure 3).Though the network analysis strongly corroborated theresults of the MrBayes and ML analyses (Figures 1 and2), Neighbor Network graphs give an indication of noise,signal-like patterns and conflicts within a super matrixaligned dataset.

Molecular clock rates, dS/dN analysisThe molecular clock based on the molecular clockhypothesis (MCH)) is a technique in molecular evolutionthat uses fossil constraints and rates of molecular changeto deduce the time in geologic history when two speciesor other taxa diverged and estimates the time of occur-rence of events called speciation or radiation. Likelihoodratio test of the molecular clock where the ML value fora given tree assuming the rate uniformity among lineagesis compared. The test rejects the null hypothesis whenapplied to data sets containing many sequences or longsequences as the strict equality of evolutionary ratesamong lineages is frequently violated. Conversely, theestimates of branch lengths, and thus interior nodedepths, in a tree obtained under the assumption of amolecular clock can be useful to generate a rough ideaabout the relative timing of sequence divergence events[24]. Comparing the ML value in Jukes-Cantor model[32] performed the molecular clock test and a MaximumParsimony (MP) tree (Figure 4) was generated for thematK dataset. The molecular clock test output is outlinedin Table 1.The codon-based z-test was carried out by setting the

model to Syn-Nonsynonymous and Nei-gojobori test. The

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 4 of 16

Figure 1 ML topology of Nymphaeales from the aligned concatenated super matrix dataset using PhyML 3.0. Phylogeny reconstructionof the Order Nymphaeales based on concatenated dataset of two different loci (ITS2+ matK) using a taxon set of 51 taxa (including threeoutgroup taxa, Cycas revoluta, Cycas siamensis and Ginkgo biloba) with aLRT values (best ML tree, majority rule, aLRT values similar to 100bootstrap replicates)

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 5 of 16

resulting matrix displayed (dS-dN) values above the diago-nal and p values below the diagonal (Additional Files 2 &3). The test was carried out for both the hypotheses ofpositive selection and purifying selection. When (dS-dN)value is positive it exhibits purifying selection and to testthat in reality p value less than 0.05 supports significantpurifying selection. Selecting the p-value of 1.0 and thenlooking for the corresponding (dS-dN) value exhibitedpositive values thus rejecting the hypothesis of positive

selection as dS > dN, i.e., silent mutations or purifyingselection outnumbered non-synonymous mutations.Hence, we can conclude that the evolution of matK geneshas been under strong purifying selection, suggesting theirrole in the evolution of Nymphaeales.

Motif identification and matchingA total of 27 unique matK motifs are identified by theMEME software [33] and subsequently validated by the

Figure 2 Bayesian Phylogram (majority rule consensus tree) inferred from the aligned supermatrix dataset (ITS2 + matk). Nymphaealephylogeny reconstruction using sequence evolution model using GTR with 10 million generations, sample frequency, 1000, burn-in: 10%discarded in MrBayes 3.2. The third family Hydatellaceae represented by the genus Trithuria sps formed a sister basal lineage to Cabombaceaeand Nymphaeaceae and clustered with the outgroup (Cycas and Ginkgo).

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 6 of 16

MAST tool [34]. We have reported three motifs each forthe genera Brasenia, cabomba, Barclaya, Euryale, Nuphur,Nymphaea, Ondinea, Victoria and Trithuria along withtheir E-value, p-value [35] and similarity among them-selves as outlined in (Additional File 4). In the proposedmotif analysis that can be further tested for designing bar-codes the same sets of sequences were used both to

generate databases and as query sequences for bothBLAST [36] and MAST. BLAST queries were run withoutfiltering. Before generating the database with MAST thesequences were run through a PERL script that added areverse complement for each sequence in order to ensurethat query sequences would match the database in eitherthe forward or the reverse orientation.

Figure 3 Median-joining network graphs with uncorrected p distances inferred with Splitstree version 4.10 from the supermatrix (ITS2+ matK).

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 7 of 16

Figure 4 Maximum Parsimonious tree of Nymphaeales using molecular clock test of matK sequence. Molecular clock test performed bycomparing the ML value for the given topology with and without the molecular clock constraints under Jukes-Cantor (1969) model (+G).Differences in evolutionary rates among sites modeled using a discrete Gamma (G) distribution. The null hypothesis of equal evolutionary ratethroughout the tree was rejected at a 5% significance level (P < 1.20575200719741E-58). The analysis involved 64 nucleotide sequences.Evolutionary analyses were conducted in MEGA 5.

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 8 of 16

ITS2 secondary structure and analysis: a doubleedged toolIn the present study representative sequences from ITS2(Additional File 5 &9) were analyzed in RNAz [37] sec-ondary structure alignment web server program withdefault parameters to assess the overall secondary struc-ture analysis that were carried out through several compu-tational approaches. The ITS2 dataset was first aligned inclustalW [21] and then subjected to RNA structure foldinggenus wise in the three families (Nymphaeaceae, Cabom-baceae and Hydatellaceae). As can be followed from thefigures arrow pointing to the right indicates forward read-ing direction related to the uploaded alignment (Figure 5,Additional File 6). In alignments with P > 0.5 the func-tional RNA is predicted. The higher this value, the moreconfident is the prediction. In standard analysis mode theresults are outlined in several windows probability value sboth forward and reverse reading frames. Here we havetaken the results of those window predictions that have ahigh probability value among all the predicted windowoutputs. The location, length, number of sequences in thealignment, reading direction, consensus minimum freeenergy (MFE) structure values, mean z-score etc. are givenin a tabular format for each group along with their con-sensus alignment and structures (Figure 5, AdditionalFile 6). The consensus MFE is the average folding energyfrom the standard energy model. The second term of theconsensus MFE i.e. covariance contribution indicates“bonus” or “penalty” energies for compensatory/consistentand inconsistent mutations, respectively. ‘Combinations/Pair’ is a value that helps quantifying compensatory/con-sistent mutations. It is the number of different base paircombinations in the consensus structure divided by theoverall number of pairs in the consensus structure.Z-score was calculated by RNAz. A z-score is calculated asz = (m-μ)/s, where μ and s are the mean and standarddeviations, respectively, of the MFEs with comparable ran-dom samples. Negative z-scores indicate that a sequence ismore stable than expected by chance. All the representa-tive structures spanning the family of Nymphaeaceae andCabombabceae show negative values thereby indicatingstable secondary structures (Figure 5, Additional File 6).To further validate the conservedness of ITS2 regions inthe Order Nymphaeales we subjected the ITS2 dataset toLocaRNA [38] prediction tool that simply takes rawsequences rather than an aligned file. LocaRNA itself

computed for global consensus regions and gave an align-ment file along with the common core secondary struc-tures across different genera in the order Nymphaeales(Additional File 7). Compatible base pairs are colored,where the hue shows the number of different types C-G,G-C, A-U, U-A, G-U or U-G of compatible base pairs inthe corresponding columns that reflects sequence conser-vation of the base pair. The saturation decreases with thenumber of incompatible base pairs and hence, indicatesthe structural conservation of the base pair. All the con-sensus structures clearly exhibit the monophyletic natureat the genus level in both the families of Nymphaeales.

Primary sequence-secondary structure alignmentTo further extend our analysis and compare the multigene supermatrix dataset species tree with ITS2 second-ary analysis of the species in the order Nymphaeales, wecarried out sequence-structure alignment using 4SALE1.7 [39] and (Profile-) Distance based phylogeny onsequence-structure alignments (ProfDistS) [40] andNJplot [41]. The tree reconstructing algorithm operatedon a 12 letter alphabet comprised of the four nucleotidesin three structural states (unpaired, paired left, pairedright, e.g. ‘A.’, ‘A(’, ‘A)’, ‘U.’, etc.) and combined a generaltime reversible (GTR) model [42] on the sequence levelwith a substitution model on morphological features ofthe structures. Based on the GTR RNA sequence-struc-ture specific substitution model [39] evolutionary dis-tances between sequence-structure pairs were estimatedby maximum likelihood and are also extended onthe profile level. The secondary structure alignment tree(Figure 6) was then achieved on the RNA sequence-structure level with the help of the pipeline consisting ofthe ITS2 database, the sequence structure alignment edi-tor 4SALE [39] and the phylogentic reconstruction toolProfDistS [40]. The secondary structure alignment treecould resolve the monophyletic nature of the threefamilies Nymphaeaceae, Hydatellaceae and Cabombaceaewithin the order Nymphaeales with supportive bootstrapvalues (Figure 6). Cycas and Gingko were rooted as out-groups. The members of the Hydatellaceae family clus-tered together with the members of Cabombaceae andthis indicates Hydatellaceae to be a part of a largerancient lineage with more evolved and diverse modifica-tions for aquatic life habitat than previously recognized.The overall tree topology congrued with the earlierresults of ML, Network analysis and Bayesian phylogeny.Further mountain graphs for RNA secondary structurediagrams for ITS2 were computed in MATLAB R2012aenvironment. Each base is represented by a dot in a two-dimensional plot, where the base position is in theabscissa (x-axis) and the number of base pairs enclosinga given base is in the ordinate (y-axis). The mountainpeaks with blue dots (paired) and red dots (unpaired) are

Table 1 Results from a test of molecular clocks using theMaximum Likelihood method of Nymphaeales matKsequence.

lnL (+G) (+I)

With Clock -4525.281640.931 n/a n/a

Without Clock -4304.5181262.42 n/a n/a

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 9 of 16

plotted across the Nympheales taking 3 representativesequences from each genus (Additional File 8) and theresults were in agreement with that of LocaRNA results.

DiscussionPopulations are relatively isolated from one anotherwhere species dispersal is poor thereby resulting in slow

individual neutral mutational variants spreading through-out a species range and thus for a species to attain mono-phyly for a particular loci it will be comparatively slowerthan species whose populations are connected with a reg-ular gene flow. Hence, species-specific barcodes are lit-erally difficult with poorly dispersed species. Since plastidmarkers in water lilies are paternally inherited, and travel

Figure 5 ITS2 Consensus secondary structures of Nymphaeales with color legend using RNAz and LocaRNA. Validation of conserved ITS2secondary structures across the three Nymphaeale families (Cabombaceae, Nymphaeaceae and Hydatellaceae). The three families are representedby the genera A. Brasenia, B. Cabomba, C. Euryale, D. Nuphur, E. Nymphaea F. Victoria G. Trithuria. Standard nucleotide ambiguity codes are used.

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 10 of 16

in pollen, they potentially cover larger distances and havea better resolution power at species level delineationexhibiting consistently greater congruence with morpho-logical species boundaries than maternally inheritedmitochondrial markers [15]. There are also instanceswhere multiple species are reported to share plastid DNAhaplotypes yet remain distinct for nuclear markers likenrITS which is again explained by their dispersal ability i.e., plastid DNA is poorly dispersed compared to nrITSand thus a combined approach of marker selection withvaried dispersal ability provide an optimal choice of aug-menting plant barcodes with nuclear markers [18].

The markers used in our study are from both plastid andnrDNA with matK and ITS2 combination that were sub-jected to multiple sequence alignment and refined withMesquite. Postulated indels were treated as missing dataand prealigned marker datasets were concatenated to pro-duce a fusion matrix and a supertree was generated. Thecombo approach of ITS2+matK had the combined effectof idiosyncratic behaviour of both the markers that poten-tially contributed to species grouping across differentclades of the order Nymphaeales.This study represents the exclusive molecular dataset

for matK genes as potential markers for motif discovery

Figure 6 Profile Neigbour Joining (PNJ) tree from primary sequence- secondary structure alignment of Nymphaeale ITS2 data using4SALE and ProfDistS. Simple correction Jukes and Cantor formula (Jukes and Cantor, 1969) operated on sequence-structure alignments. Basedon the GTR RNA sequence-structure specific substitution model evolutionary distances between sequence-structure pairs are estimated bymaximum likelihood and are also extended on the profile level. The group Hydatellaceae clustered with Cabombaceae and emerged as a sisterclade to Nymphaeaceae. Consensus bootstrap values with 100 replicates are shown next to branches and ProfDistS output tree file viewed inNJplot. Tree viewing Profiles are marked by “Pi” (profile generated by identity threshold), “Pb” (profiles generated by bootstrap threshold) and“Po” (old profile generated in a previous iteration).

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 11 of 16

till date. Due to a relatively high percentage of variableand informative characters, our dataset not only com-prises a high number of informative characters for Nym-phaeales but also characterised by low degrees ofhomoplasy and a strong phylogenetic signal. The MLmethod as well as the bayesian approach yielded thesame results with an exactly matching topology and wellsupported nodes. The results confirm several earlierhypotheses on phylogenetic relationships of the OrderNymphaeales and corroborate the monophyly ofNymphaeaceae and Cabombaceae, which has been con-vincingly mentioned before based on integrated morpho-logical, anatomical and molecular characters [43].Barclaya serves as an outgroup to the monophyleticgrouping of Nymphaea, Ondinea, Victoria and Euryale. Italso supports the Victoria-Euryale grouping that waslong predicted based on seed morphology and presenceof spines [43]. Though Nymphaeales is a monophyleticgroup within the basal angiosperms, the monophyly ofNymphaeaceae is not fully convincing owing to Victoria-Euryale and Ondinea grouping. The classification ofNymphaea in India has been reported to be confusing,molecular taxonomic revision of four Indian representa-tives of the genus namely N. nouchali, N. pubescens, N.rubra and N. tetragona based on ITS, trnK intron andmatK gene have been carried out by us earlier. Molecularevidence was in disagreement about the taxonomic iden-tity of one specimen of N. nouchali and indicated a prob-able misidentification of N. tetragona. Interestingly,sequence analysis had revealed lack of or low sequencedivergence between N. pubescens and N. rubra [44-46].Further in the present study we tried to track down evo-lutionary relationships among the genera of the orderNymphaeales by comparing the nucleotide sequences ofthe plant genomic and chloroplast DNA. For the firsttime, we have banked upon a large dataset from publiclyavailable matK and ITS2 markers for discussing Nym-phaeale phylogeny with a molecular morphometricsapproach. Several authors [48,49] considered assigningBarclaya to a separate family, Barclayaceae as they wereof the argument that the genus Barclaya is quite distinctin terms of its palinological features, the structure of theovule and the karyotype and in the present study our sec-ondary structure alignment data (Figure 6) indicate thatthe region analysed in these studies is too short to enableverification of a phylogenetic hypothesis though we havegot favourable results for considering Barclaya to beplaced in a different family and with more diverse datasetwe can target appropriate phylogenetic signals for consid-ering Barclaya in a separate family.The fact that Nymphaea and Victoria are the sister

genera in our study is quite expected as both are highlyevolved representatives of Nymphaeales. The recentstudy on Hydatellaceae [4] that identified it as a new

branch near the angiosperm basal phylogeny was alsoreflected in our molecular morphometric analysis.Earlier ideas on the relationships of Hydatellaceae withthe monocot family Centrolepidaceae and their currentplacement within the early-divergent angiosperm orderNymphaeales has been of considerable interest to taxono-mists. In general, the view of monocots as a well-definedmonophyletic unit derived from within the paraphyleticgroup of basal dicots [50] is one of the morphology-basedtheories that are most readily supported by moleculardata. Extensive molecular phylogenetic studies haveallowed only one refinement to the classical circum-scription of monocots, with a total complement of 65000species and 3000 genera [51]. Specifically, the familyHydatellaceae (twelve species in a single genus) [52] hasbeen transferred from monocots to the early-divergentangiosperms. In the present study our data supported thisfinding with aid of ITS2 secondary structure alignment,Bayesian network, supermatrix trees from concatenateddifferent loci (ITS2+matk) (Figures 1, 2, 3, 6), analysis ofmatK dataset and molecular clock MP trees (Figure 4).For the first time we have used an extensive molecularmorphometrics phylogeny to support Hydatellaceae as asister group to Nymphaeales.Besides, the genera Nuphar emerges as a monophyletic

group with all the Nuphar species forming a single clusterwith well supported boot strap values. Nuphar takes themiddle position between these two genera. However,according to the molecular data Nuphar (possessing manyspecialised synapomorphic features) is basal in the clade,thus making Victoria and Nymphaea closer to each other[53].However, despite the high amount of characters

sampled, the monophyly of the Nymphaeaceae is notconvincingly supported. More strikingly, the present dataset does not give support for the monophyly of the genusNymphaea. Nymphaea alba emerges out group to theorder Nymphaeales based on the molecular data as wellsecondary structure data (Figures 1, 2, 3, 5 and 6) andhad a parallel evolution with other representatives fromthe genus Nuphar. In contrast to all previous phyloge-netic studies and classifications, it is inferred to be para-phyletic with respect to the Victoria-Euryale clade and toOndinea. A reason for the scarcity of informative charac-ters at the base of Nymphaeales could be a rapid, earlydiversification into the three major lineages. Our resultssupport the opinion that a high rate of evolution withinthis taxon can be explained by the rapid specialisation ofthese plants for stepwise adaptation to the aquaticenvironment.The other objective in our study was to generate motifs

for barcode designing. matK genes yielded unique motifregions and thus may provide more variations than otherregions in the plant chloroplast genomes. The nr Plant

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 12 of 16

database from European Molecular Biology laboratory(EMBL) was used to test for unique species-specific bar-codes that could be used for a species level identification.For this, the sequence belonging to each species wasretrieved from the database and used as a querysequence. If the query sequence returned an exact matchonly to itself, this was scored as a positive identificationat the species level. If the query sequence returned anexact match to itself and other members of the samegenus, this was scored as a negative identification at thespecies level, but a positive identification to the genuslevel. For BLAST, an additional constraint was added topositively score the identification at genus level i.e., thebest match as well as the next most similar sequence hadto match the genus of the query sequence. If any othergenus was included in the top two hits, the result was notconsidered genus specific. The results are exemplary inthe current scenario of plant barcoding. We havereported unique genus specific motif regions in theOrder Nymphaeles from matK dataset which can befurther validated for barcoding and designing of PCRprimers.

ConclusionsThe increased application of molecular data in plant sys-tematics has led to an avalanche of sequence profilesflooding the public domain. With a judicious use ofthese data as phylogenetic signals, the goal of findinguniversal primer pairs for studying plant genomes won’tbe troublesome anymore. The unique motifs reportedmay further be validated for designing barcodes. WithNymphaeales as a case study, it is quite surprising toobserve how stepwise adaptation to an aquatic life stylehas had an impact on water lilies evolution, with thegeneration of morphological complexity. For the firsttime we have reported an ITS2 secondary structurealignment and a phylogeny based on the molecular mor-phometrics that strongly congrued with the current pla-cement of the family Hydatellaceae within the early-divergent angiosperm order Nymphaeales. Though weare far off from completely understanding the selectiveforces behind these transformations, nevertheless, thephylogenetic signals belied in the comparatively smallmarker datasets imbibes a source of inspiration tobroaden our views on water lily origin and evolution intime.

MethodsTaxon sampling and sequence analysisThe dataset used in the present study comprises 64matK and 67 ITS2 sequences from species representingthe three families Cabombaceae, Nymphaeaceae andHydatellaceae of the order Nymphaeales retrieved from

GenBank [54] (via Ebot http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi, an open sourceinteractive tool that generates a Perl script implement-ing an E-utility pipeline for retrieving large datasetsfrom National Centre for Biotechnology Information(NCBI) with key words and boolean operators. Informa-tion on all the species along with GenBank accessions,sequence length and AT, GC content of both the mar-kers are summarized in (Additional Files 5). Thesequences were subjected to alignment and manual edit-ing by clustalW [21] and were concatenated for generat-ing a supermatrix using FAsconcat [22]. Subsequentlythe concatenated files were subjected to mesquite forvarious file format conversions to be readable by MLand Bayesian methods.

Phylogenetic reconstructionThe supermatrix dataset (matK + ITS2) covering theOrder Nymphaeales were first analysed separatelythrough ML and Bayesian inference [55]. MP analyseswere conducted with PhyMl 3.0 [27]. Node support wassubstantiated through aLRT and bootstrapping.For Bayesian inference [55] the best models of molecu-

lar evolution were determined with aid of MrModeltestversion 2.2 [56]. Hence, a Bayesian analysis usingMrBayes [23] was carried out for tree construction usinga general time reversible substitution model (GTR) withsubstitution rates estimated by MrBayes. Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) sam-pling was performed with two incrementally heatedchains that were combinatorially run for 20,000 genera-tions. The convergence of MCMCMC was then moni-tored by examining the value of the marginal likelihoodthrough generations. Coalescence of substitution rate andrate model parameters were also examined. Average stan-dard deviation of split frequencies was checked and thegenerations were kept on adding until the standard devia-tion value was below 0.01. For analysis we ran 10,000,000generations with a sample frequency of 1000. The valuesslightly differed because of stochastic effects. The sampleof substitution model parameters and samples of treesand branch lengths were summarized by the “sump bur-nin” and “sumt burnin” commands, respectively. Thevalues in the following commands were adjusted as perthe 25% of our samples. A cladogram with the posteriorprobabilities for each split and a phylogram with meanbranch lengths were generated and subsequently read byMesquite [30]. An alternative method using network ana-lysis was performed using SplitsTree4 [31] with the vari-able positions in the aligned supermatrix dataset. Thealignment file was converted to nexus with READSEQ[57] at Eurpean Bioinformatics Institute (EBI) serverreadable by SplitsTree4 [31].

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 13 of 16

Estimation of molecular clock ratesThe molecular clock test was performed by comparingthe ML value for the given topology with and withoutthe molecular clock constraints under Jukes-Cantormodel [32]. The null hypothesis of equal evolutionaryrate throughout the tree was rejected at a 5% signifi-cance level (P < 1.20575200719741E-58). The analysisinvolved 64 matK sequences and was computed usingMEGA5 [58].

Analysis of synonymous and non-synonymoussubstitution ratesNon-synonymous mutations to a DNA sequence cause achange in the translated amino acid sequence, whereassynonymous mutations do not. The comparison betweenthe number of non-synonymous mutations (dn or Ka),and the number of synonymous mutations (ds or Ks),can suggest whether, at the molecular level, natural selec-tion is acting to promote the fixation of advantageousmutations (positive selection) or to remove deleteriousmutations (purifying selection). In general, when positiveselection dominates, the Ka/Ks ratio is greater than 1; inthis case, diversity at the amino acid level is favoured,likely due to the fitness advantage provided by the muta-tions. Conversely, when negative selection dominates, theKa/Ks ratio is less than 1; in this case, most amino acidchanges are deleterious and, therefore, are selectedagainst. When the positive and negative selection forcesbalance each other, the Ka/Ks ratio is close to 1. The dS/dN ratio was computed on matK sequences only inMEGA5 [58] for testing positive and purifying selectionhypothesis.

Motif identification and testingThe matK sequence motifs were identified from alignedsequences using the PRATT software [59]. Besides, thedataset in fasta format were fed to MEME [33] for deter-mining highly significant motifs without any gaps andpatterns with variable length gaps if any, were split byMEME into one or more separate motifs. The motif siteswere listed in order of increasing statistical significance(p-value) [35]. The p-value of a site is computed from thematch score of the site with Position Specific ScoringMatrix (PSSM) for the motif. Further individual datasetsfor Nymphaea and Nuphur were subjected to MEME foranalyzing the best motifs. The MEME output is subse-quently analyzed by MAST [34] for depicting the bestscoring matches and similarity to other motifs. Thematch score are computed if the match completely fitswithin the sequence and are reported in terms of P-valueof the match. MAST takes into account four types ofevents for calculating the P-value namely the position P-value, sequence P-value, combined P-value and the E-value [35].

ITS2 secondary structure prediction and analysisRNA secondary structure prediction for ITS2 sequenceswere carried out in MATLAB 2012a rnafold [60] andrnaplot [61] functions that uses the nearest-neighbormodel and minimizes the total free energy associatedwith an RNA structure. The minimum free energy wasestimated by summing individual energy contributionsfrom base pair stacking, hairpins, bulges, internal loopsand multi-branch loops. The energy contributions ofthese elements are sequence- and length-dependent andhave been experimentally determined. The rnafold func-tion uses the nearest-neighbor thermodynamic model topredict the minimum free-energy secondary structure ofan RNA sequence. More specifically, the algorithmimplemented in rnafold was used for dynamic program-ming to compute the energy contributions of all possibleelementary substructures and then the secondary struc-tures were predicted by considering the combination ofelementary substructures whose total free energy wereminimum. In this computation, the contribution ofcoaxially stacked helices is not accounted for, and theformation of pseudoknots (non-nested structural ele-ments) is forbidden. Rnaplot (RNA2ndStruct) was usedfor drawing RNA secondary structures with specifiedformat values ‘Mountain’ for ITS2. The secondary struc-tures were computed in form of mountain graphs inMATLAB R2012a environment.Besides, consensus structures of ITS2 regions were pre-

dicted using the RNAz server and LocARNA from Frei-burg RNA tools server that outputs a multiple alignmenttogether with a consensus structure. For the folding a veryrealistic energy model for RNAs was used that featuresRIBOSUM-like similarity scoring and realistic gap cost.The high performance of LocARNA [38] was mainlyachieved by employing base pair probabilities during thealignment procedure. Results of the various species werecompared to unravel the folding pattern common to themall for establishing the conserved structural models acrossseveral genera of Nymphaeales using 4SALE [39] and sub-sequently incorporated in ProfDistS [40] for generatingmolecular morphometrics phylogeny. The ProfDistS [40]output was read by NjPlot [41].

Additional material

Additional file 1: Concatenated aligned Supermatrix dataset of ITS2and matK generated using FASconCAT version 1.0. The concatenatedaligned supermatrix file is in nexus format and can be viewed inMesquite.

Additional file 2: Codon-based Test of Positive Selection (dS/dN)analysis for matK sequences.

Additional file 3: Codon-based Test of Purifying Selection (dS/dN)analysis for matK sequences.

Additional file 4: Top scoring unique motif sequence matchesshown for each of the matK sequences in the order Nymphaeales.

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 14 of 16

Additional file 5: Nucleotide composition and GC content of ITS2sequences of Nymphaeales.

Additional file 6: Consensus alignment of ITS2 sequences showingconserved regions for secondary structure prediction acrossNymphaeales. The three families are represented by the generaBrasenia, Cabomba, Euryale, Nuphur, Nymphaea, Victoria and Trithuria.Standard nucleotide ambiguity codes are used.

Additional file 7: Overall summary of secondary structures for ITS2multiple alignment of Nymphaeales (Brasenia, Cabomba, Euryale,Nuphur, Nymphaea and Victoria) showing detailed information (z-score, structure conservation index, RNAz P-value, etc.) along witha Dot Plot graph.

Additional file 8: Matlab generated mountain graph plots ofNymphaeales (ITS2 sequences).

Additional file 9: Nucleotide composition and GC content of matKsequences of Nymphaeales

AcknowledgementsThe work is supported by the Department of Biotechnology, Government ofIndia sponsored Bioinformatics Centre at North-Eastern Hill University,Shillong, Meghalaya, India.This article has been published as part of BMC Bioinformatics Volume 13Supplement 17, 2012: Eleventh International Conference on Bioinformatics(InCoB2012): Bioinformatics. The full contents of the supplement areavailable online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.

Authors’ contributionsPT and DKB conceived of the study and participated in its design,coordination and manuscript writing. DKB performed the computationalanalysis, participated in the design of the study and manuscript preparation.MD participated in the computational analysis and literature screening. SKcarried out the secondary structure analysis and developed perl scripts forthe present study. All authors have read and approved the final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Published: 13 December 2012

References1. Borsch T, Hilu KW, Wiersema JH, Löhne C, Barthlott W, Wilde V: Phylogeny

of Nymphaea (Nymphaeaceae): evidence from substitutions andmicrostructural changes in the chloroplast trnT-trnF region. Int J Pl Sci2007, 168:639-671.

2. Qiu YL, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F,Rest JS, Davis CC, Borsch T, Hilu KW, Renner SS, Soltis DE, Soltis PE,Zanis MJ, Cannone JJ, Powell M, Savolainen V, Chatrou LW, Chase MW:Phylogenetic analyses of basal angiosperms based on nine plastid,mitochondrial, and nuclear genes. Int J Pl Sci 2005, 166(5):815-842.

3. Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M,Zimmer EA, Chen Z, Savolainen V, Chase MW: The earliest angiosperms:evidence from mitochondrial, plastid and nuclear genomes. Nat 1999,402(6760):404-407.

4. Saarela JM, Rai HS, Doyle JA, Endress PK, Mathews S, Marchant AD,Briggs BG, Graham SW: Hydatellaceae identified as a new branch nearthe base of the angiosperm phylogenetic tree. Nat 2007,446(7133):312-325.

5. Löhne C, Borsch T, John H, Wiersema : Phylogenetic analysis ofNymphaeales using fast-evolving and noncoding chloroplast markers.Bot J Linn Soc 2007, 154(2):141-163.

6. Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny,Classification and Floral Evolution of Water Lilies (Nymphaeaceae;Nymphaeales): A Synthesis of Non-molecular, rbcL, matK, and 18S rDNAData. Syst Bot 1999, 24(1):28-46.

7. Les DH, Garvin Dk, Wimpee CF: Molecular evolutionary history of ancientaquatic angiosperms. Natl Acad Sci USA 1991, 88:10119-10123.

8. Ito M: Phylogenetic systematics of the Nymphaeales. Botanical Magazine(Tokyo) 1987, 100:17-35.

9. Cronquist A: The evolution and classification of flowering plants. Bronx,NY: The New York Botanical Garden; 1988.

10. Chase MW, Soltis D, Olmstead RG, Morgan D, Les DH, Mishler BD,Duvall MR, Price R, Hills HG, Qiu Y-L, Kron KA, Rettig JH, Conti E, Palmer JD,Manhart JR, Sytsma KJ, Michaels HJ, Kress JW, Karol KG, Clark WD,Hédren M, Gaut BS, Jansen RK, Kim K, Wimpee CF, Smith JF, Furnier GR,Strauss SH, Xiang Q-Y, Plunkett GM, Soltis PS, Swensen SM, Williams SE,Gadek PA, Quinn CJ, Eguiarte LE, Golenberg EM, Learn GH Jr, Graham SW,Barrett SCH, Dayanandan S, Albert VA: Phylogenetics of seed plants: ananalysis of nucleotide sequences from the plastid gene Rbcl. Annals ofthe Missouri Botanical Garden 1993, 80:528-580.

11. Savolainen V, Chase MW, Morton CM, Soltis DE, Bayer C, Fay MF, DeBruijn A, Sullivan S, Qiu Y-L: Phylogenetics of flowering plants basedupon a combined analysis of plastid atpB and rbcL gene sequences. SystBiol 2000, 49:306-362.

12. Nandi OI, Chase MW, Endress PK: A combined cladistics analysis ofangiosperms using rbcL and non-molecular data sets. Annals of theMissouri Botanical Garden 1998, 85:137-212.

13. Williamson PS, Schneider EL: Cabombaceae. In The families and genera ofvascular plants II. Berlin, Springer;Kubitzki K, Rohwer JG, Bittrich V1993:157-161.

14. Little DP: DNA barcode sequence identification incorporating taxonomichierarchy and within taxon variability. PLoS One 2011, 6(8):e20552.

15. Hollingsworth PM, Graham SW, Little DP: Choosing and using a plant DNAbarcode. PLoS One 2011, 6(5):e19254.

16. Lahaye R, van der Bank M, Bogarin D, Warner J, Pupulin F: DNA barcodingthe floras of biodiversity hotspots. Proc Natl Acad Sci 2008, 105:2923-2928.

17. Li FW, Kuo LY, Rothfels CJ, Ebihara A, Chiou WL, Windham MD, Pryer KM:rbcL and matK earn two thumbs up as the core DNA barcode for ferns.2011, 6(10):e26597.

18. Chen S, Yao H, Han J, Liu C, Song J, Shi L, Zhu Y, Ma X, Gao T, Pang X,Luo K, Li Y, Li X, Jia X, Lin Y, Leon C: Validation of the ITS2 region as anovel DNA barcode for identifying medicinal plant species. PLoS One2010, 5(1):e8613.

19. Chaveerach A, Tanee T, Sudmoon R: Molecular identification andbarcodes for the genus Nymphaea. Acta Biol Hung 2011, 62:328-340.

20. Coleman AW: ITS2 is a double-edged tool for eukaryote evolutionarycomparisons. Trends Genet 2003, 7:370-375.

21. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving thesensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weight matrixchoice. Nucleic Acids Res 1994, 22:4673-80.

22. Kück P, Meusemann K: FASconCAT: Convenient handling of datamatrices. Mol Phylogenet Evol 2010, 56(3):1115-1158.

23. Gatesy J, De Salle R, Wheeler W: Alignment- ambiguous nucleotide sitesand the exclusion of systematic data. Mol Phylogenetics Evol 1993,2:152-157.

24. Takezaki N, Rzhetsky A, Nei M: Phylogenetic test of the molecular clockand linearized trees. Mol Biol and Evol 2004, 12:823-833.

25. Hoot SB, Douglas AW: Phylogeny of the Proteaceae based on atpB andatpB-rbcL intergenic spacer region sequences. Aust J Syst Bot 1998,11:301-320.

26. Levinson G, Gutman G: Slipped-strand mispairing: a major mechanism forDNA sequence evolution. Mol Biol Evol 1987, 4:203-221.

27. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: Newalgorithms and methods to estimate Maximum-Likelihood phylogenies:assessing the performance of PhyML 3.0. Sys Biol 2010, 59(3):307-321.

28. Shimodaira H, Hasegawa M: Multiple comparisons of log-likelihoods withapplications to phylogenetic inference. Mol Biol Evol 1999, 16(8):1114.

29. Ronquist F, Huelsenbeck JP: MRBAYES 3: Bayesian phylogenetic inferenceunder mixed models. Bioinformatics 2003, 19:1572-1574.

30. Maddison WP, Maddison DR: Mesquite: a modular system forevolutionary analysis. Version 2.75 [http://mesquiteproject.org].

31. Huson DH, Bryant D: Application of phylogenetic networks inevolutionary studies. Mol Biol Evol 2006, 23(2):254-267.

32. Jukes TH, Cantor CR: Evolution of protein molecules. In MammalianProtein Metabolism. Academic Press, New York;Munro HN 1969:21-132.

33. Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzingDNA and protein sequence motifs. Nucleic Acids Res 2006, 34:W369-W373.

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 15 of 16

34. Timothy BL, Bodén M, Buske FA, Frith M, Grant EC, Clementi L, Ren J,Li WW, Noble William S: MEME SUITE: tools for motif discovery andsearching. Nuc Acids Res 2009, 37:W202-W208.

35. Bailey TL, Gribskov M: Combining evidence using p-values: application tosequence homology searches. Bioinformatics 1998, 14:48-54.

36. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignmentsearch tool. J Mol Biol 1990, 3:403-10.

37. Gruber AR, Neuböck R, Hofacker IL, Washietl S: The RNAz web server:prediction of thermodynamically stable and evolutionarily conservedRNA structures. Nucleic Acids Res 2007, 35:335-8.

38. Smith C, Heyne S, Richter SA, Will S, Backofen R: Freiburg RNA Tools: aweb server integrating IntaRNA, ExpaRNA and LocARNA. Nucleic Acids Res2010, 38:373-7.

39. Seibel PN, Müller T, Dandekar T, Wolf M: Synchronous visual analysis andediting of RNA sequence and secondary structure alignments using4SALE. BMC ResNotes 2008, 1:91.

40. Wolf M, Ruderisch B, Dandekar T, Müller T: ProfdistS: (Profile-) Distancebased phylogeny on sequence-structure alignments. Bioinformatics 2008,24:2401-2402.

41. Perrière G, Gouy M: WWW-Query: An on-line retrieval system forbiological sequence banks. Biochimie 1996, 78:364-369.

42. Waddell PJ, Steel MA: General time-reversible distances with unequalrates across sites: mixing gamma and inverse Gaussian distributionswith invariant sites. Mol Phylogenet Evol 1997, 8(3):398-414.

43. Thorne R: Classification and geography of the flowering plants. Bot Rev1992, 58:225-248.

44. Dkhar J, Kumaria S, Rao SR, Tandon P: Molecular phylogenetics andtaxonomic reassessment of four Indian representatives of the genusNymphaea. Aquatic Botany 2010, 93:135-139.

45. Dkhar J, Kumaria S, Tandon P: Nymphaea alba var. rubra is a hybrid ofN. alba and N.odorata as evidenced by molecular analysis. Ann BotFennici 2011, 48:317-324.

46. Dkhar J, Kumaria S, Tandon P: Molecular adaptation of the chloroplastmatK gene in Nymphaea tetragona, a critically rare and endangeredplant of India. Plant Genetic Resources 2011, 9:193-196.

47. Takhtajan AL: The System of Magnolyophyta. Nauka, Leningrad 1987.48. Cronquist A: An integrated system of classification of flowering plants,

Columbia University Press. New York 1981, 1262.49. Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny,

classification and floral evolution of water lilies (Nymphaeaceae;Nymphaeales): a synthesis of non-molecular, rbcL, matK, and rDNA Data.Syst Bot 1999, 24:2428-2446.

50. Takhtajan A: Systema Magnoliophytorum Nauka, Leningrad; 1987.51. Takhtajan A: Flowering Plants Springer, New York; 2009.52. Sokoloff DD, Remizowa MV, Macfarlane TD, Rudall PJ: Classification of the

early-divergent angiosperm family Hydatellaceae: one genus instead oftwo, four new species and sexual dimorphism in dioecious taxa. Taxon2008, 57:179-200.

53. Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny,classification and floral evolution of water lilies (Nymphaeaceae;Nymphaeales): a synthesis of non-molecular, rbcL, matK, and 18S rDNAdata. Syst Bot 1999, 24:28-46.

54. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank.Nucleic Acids Res 2011, 39:D32-7.

55. Posada D, Buckley TR: Model selection and model averaging inphylogenetics: advantages of the AIC and Bayesian approaches overlikelihood ratio tests. Syst Biol 2004, 53:793-808.

56. Nylander JA: MrModeltest v2. Program distributed by the author UppsalaUniversity: Evolutionary Biology Centre; 2004.

57. Gilbert D: Sequence file format conversion with command-line readseq.Curr Protoc Bioinformatics 2003, Appendix 1:Appendix 1E.

58. Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular EvolutionaryGenetics Analysis (MEGA) software version 4.0. Mol Biol Evol 2007,24:1596-1599.

59. Jonassen I: Efficient discovery of conserved patterns using a patterngraph. Comput Appl Biosci 1997, 13(5):509-522.

60. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P: Fastfolding and comparison of RNA secondary structures. Monatshefte fChemie 1994, 125:167-188.

61. Lorenz R, Bernhart SH, Hoener zu Siederdissen C, Tafer H, Flamm C,Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Algorithms Mol Biol 2011,6:26.

doi:10.1186/1471-2105-13-S17-S26Cite this article as: Biswal et al.: Phylogenetic reconstruction in theOrder Nymphaeales: ITS2 secondary structure analysis and in silicotesting of maturase k (matK) as a potential marker for DNA bar coding.BMC Bioinformatics 2012 13(Suppl 17):S26.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Biswal et al. BMC Bioinformatics 2012, 13(Suppl 17):S26http://www.biomedcentral.com/1471-2105/13/S17/S26

Page 16 of 16