online supplementary material  · web viewthese databases were clustered at the 50% identity level...

44
Online Supplementary Material On the origin of MADS-domain transcription factors Lydia Gramzow, Markus S. Ritz and Günter Theißen * Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena, Germany *Corresponding author: Theißen, G. ([email protected] ) 1 1 2 3 4 5 6 7 8 9 1 2

Upload: others

Post on 23-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Online Supplementary Material

On the origin of MADS-domain transcription factors

Lydia Gramzow, Markus S. Ritz and Günter Theißen*

Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena,

Germany

*Corresponding author: Theißen, G. ([email protected])

1

1

2

3

4

5

6

7

8

9

12

Page 2: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Methods

Datasets

A list of sampled eukaryotic species, together with information about classification, numbers of

retrieved MADS domains, type and source of data is provided in Table S1. For remote homology

detection, the non-redundant databases for microbial and plants available at National Centre for

Biotechnology Information (NCBI) [1] were downloaded. These databases were clustered at the

50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

parameters, except that word size was reduced to three. For all clusters, alignments were created

using Clustal W [3], and Hidden Markov Models (HMMs) were constructed using the HMMer

package [4].

To study the distribution of the MADS domain in eukaryotes, queries on the entrez protein database

of NCBI [1] and the corresponding annotation databases were carried out for 40 whole genomes

and five EST data sets. All sequences were also translated in the six possible reading frames.

Representative SERUM RESPONSE FACTOR (SRF) - like and MYOCYTE ENHANCER

FACTOR 2 (MEF2) - like MADS domains from plants, animals and fungi were chosen such that

both types of MADS domains and sequences from all major group of eukaryotes for which MADS

domains have been found are included. These sequences were then aligned manually (Figure 1b).

The alignment was used to create an HMM with the HMMer package [4] which was used as a

search pattern. For sequences yielding an HMMer E-value lower than 1 the occurrence of a MADS

domain was confirmed by scanning against the NCBI conserved domains database [1]. For

sequences that were not present in NCBI or the corresponding genome databases, GlimmerHMM

[5] was used to predict genes in the regions where the MADS domain was identified with all

available training sets.

Remote homology detection

The HMMer package and HHSearch (version 1.5.1) [6] were used to find putatively homologous 2

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

3434

Page 3: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

sequences to the MADS domain in the non-redundant microbial database. For both methods we

used default parameters except that the E-value was increased to 80 in case of HMMer search and

that we wanted to identify global matches with HHSearch. The results were counterchecked by

reverse searches of the plant non-redundant database with an HMM for the six topoisomerases A,

subunit A (TOPOIIA-A) fragments identified using HMMer and the identified TOPOIIA-A cluster

(HHSearch) as query.

Character state evolution

The type of MADS domain (SRF-like or MEF2-like) was determined for all MADS domains

identified by scanning against the NCBI conserved domains database [1]. The SRF-like and the

MEF2-like MADS domain were then examined separately. Each of them was scored absent only if

none of the searches of a complete eukaryotic genome gave positive results. Character

transformations were reconstructed via likelihood ancestral states and an asymmetric 2-parameter

Markov model with a forward rate of 0.1, a backward rate of 0.9 and equal root state frequencies on

trees corresponding to both rooting hypotheses in Baldauf [7], by using Mesquite, version 2.5 [8].

Alignments and phylogenetic analysis

A dataset of 75 MADS-domain sequences, including the 57 sequences identified before and a

representative sequence of each of the major clades of MADS-domain proteins in Arabidopsis

thaliana, was aligned using Muscle, version 3.6 [9] with default settings (Figure S1). Phylogenetic

analyses were carried out by the maximum likelihood method using the RaxML program [10], with

the WAG [11] model of amino acid substitutions and 1000 bootstrap replicates. The best-fitting

model was determined using ProtTest, version 1.4 [12].

TOPOIIA-A sequences from the non-redundant microbial database were aligned using Clustal W

[3] with default settings. The position of partial sequences which were identified to be putatively

homologous to the MADS domain was assigned and pairwise Clustal W scores of this partial

alignment were determined.

3

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

56

Page 4: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Structural Analysis

The Phyre protein structure prediction server [13] was used to model the structure of DNA

topoisomerase IV, subunit A of A. variabilis. The server identified the solved structure of gyrase A,

C-terminal domain from B. burgdorferi (PDB identifier: 1SUU) to be a good template to model the

structure (E-value 1.7e-33). The modeled structure was compared to the solved structure of SRF

core from H. sapiens (PDB identifier: 1SRS).

Acquisitions and losses of SRF-like and MEF2-like MADS domains

We used a 2-parametric Markov model with a forward rate for a gain of the MADS domain of 0.1

and a backward rate for a loss of 0.9 (meaning that it is 9 times more likely to lose a MADS domain

than to gain one). Under this model, the likelihood that the SRF-like and the MEF2-like MADS

domain, respectively, were present in the most recent common ancestor (MRCA) of extant

eukaryotes, is 0.60 and 0.70 for rooting hypothesis I, and 0.84 and 0.92 for rooting hypothesis II

(Figure 2a). The MADS domain is assumed to be of monophyletic origin [14], and the convergent

evolution of a defined DNA-binding domain two or more times independently appears extremely

unlikely also to us. A forward rate of 0.1 was chosen to account for possible events of horizontal

gene transfer (HGT) and is still comparably high. Nevertheless, the probabilities that SRF-like and

MEF2-like MADS domains were present in the MRCA of extant eukaryotes are well above 50%

and they only decrease below 50% in rooting hypothesis I when the forward rate is set higher than

0.18 in the case of SRF-like MADS domains and 0.24 for MEF2-like MADS domains. Note that

two independent HGT events would be required to explain the origin of the MADS domain after the

diversification of discicritates, namely the HGT of the SRF-like and the HGT of the MEF2-like

MADS domain to the lineage that led to N. gruberi. Assuming that at least one of the two trees used

here is largely correct, one can conclude that the MADS domain originated early during eukaryote

evolution, either already in the lineage that led to the MRCA of extant eukaryotes, or after excavates

4

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

78

Page 5: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

had branched-off, at the latest.

Phylognetic tree reconstruction corroborates the presence of two types

of MADS domains in the MRCA of extant eukaryotes

To critically test our conclusions concerning the early duplication of MADS-box genes during the

evolution of eukaryotes, we reconstructed a maximum likelihood tree with representative MADS

domain sequences of A. thaliana and the identified MADS domain sequences in the analyzed

genomes (Figures S1 and S2). The overall resolution of the tree is low, but it shows two branches

containing the vast majority of SRF-like and MEF2-like MADS domains, respectively. Some

sequences annotated as SRF-like or MEF2-like MADS domains appear on the other branch, but the

classification of those is usually not well supported by E-values (Tables S2, sequences in bold), so

that their classification is questionable. All in all, our tree supports the idea that both types of

MADS domains have been present in the MRCA of extant eukaryotes and thus must have been

generated by a gene duplication that occurred in the lineage that led to the MRCA of extant

eukaryotes.

Details on the similarity between TOPOIIA-A and MADS-domain

sequences

At positions two and four of the alignment of TOPOIIA-A and MADS-domain sequences (Figure

1b and c), positively charged residues are found that could be important for contact with the

negatively charged backbone of the DNA. The sequence of three positively charged amino acids in

a row at positions 23 to 25 is conserved in all MADS-domain proteins but is interrupted by a

hydrophobic residue in TOPOIIA-A fragments. These residues have frequently been identified to be

part of a nuclear localization signal in the MADS domain [15-17], which would have no function in

prokaryotes. Hydrophobic residues are found in all sequences at positions 11, 21, 35, 46 and 48 of

5

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

910

Page 6: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

the alignment. Hydrophobic residues are generally important for the structural stability of proteins

[18].

Higher order structures of TOPOIIA-A and MADS-domain sequences

Generally, a similar tertiary protein structure is seen as an argument for homology even if proteins

have a low level of sequence identity [19, 20]. As there are no solved structures available for any of

the TOPOIIA-A proteins identified as being homologous to the MADS domain in our study, we

modeled the structure of the identified TOPOIIA-A protein with the lowest E-value, DNA

topoisomerase IV, subunit A of Anabaena variabilis. The solved structure of gyrase A, C-terminal

domain of Borrelia burgdorferi was used as a template (as suggested by the Phyre protein structure

prediction server [13]). The modeled structure folds into a β-pinwheel, similar to the structure

formed by the B. burgdorferi protein. The region that is putatively homologous to the MADS

domain includes five β-strands and one α-helix (Figure S3a). The MADS domain of SRF adopts a

structure with an N-terminal extension, a long α-helix and two β-strands (Figure S3b) [21]. At first

glance, these two structures appear quite dissimilar. However, the C-terminal two β-strands in the

region of topoisomerase IV that is putatively homologous to the MADS domain overlap with the

two β-strands in the MADS domain (Figure S4). If the predicted structure of DNA topoisomerase

IV is correct, there has been an elongation of the α-helix in the part putatively homologous to the

MADS domain in the evolution of this predicted structure from the template structure of B.

burgdorferi. Thus, a change of structure towards a long α-helix during the evolution of the MADS

domain from an ancestral TOPOIIA-A protein seems feasible. On that account also note that

changes in protein secondary structure have been shown to be induced with few or even without any

changes in amino acid sequence as in the case of so-called chameleon sequences [22], prions or the

Arc repressor [23-26].

The three residues identical in the 15 TOPOIIA-A fragments identified by homology searches and

the seed alignment are located in loop regions on the surface of DNA topoisomerase IV and in a

6

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

1112

Page 7: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

loop region and in the α-helix of the MADS domain. These residues contact DNA in the solved

structure of SRF [21]. The fact that these residues are located in loop regions in the predicted

structure of topoisomerase IV indicates that there are few structural constraints, to maintain α-

helical or β-strand properties, on these residues. We hence assume that there have been functional

constraints during evolution of these residues possibly due to DNA binding.

7

136

137

138

139

140

1314

Page 8: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Supplementary figures

Figure S1 – Alignment for the phylogenetic tree shown in Figure S2. Species abbreviations are

as in Figure 1 and Mb – Monosiga brevicollis, Ng – Naegleria gruberi, Um – Ustilago maydis, Nc –

Neurospora crassa, Ps – Phytophthora sojae, Ptri – Phaeodactylum tricornutum. In XP001461257

of Paramecium tetraurelia the sequence NVNLLFQLLILLFLEPLYNLNYYLILC was omitted

between alignment positions 16 and 17 to simplify the presentation. The alignment is colored

according to the Clustal X color scheme. Sequences are named as in Figure 1, the accession number

is given or the common name is used.

8

141

142

143

144

145

146

147

148

149

1516

Page 9: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Figure S2 – Phylogenetic tree of 75 MADS domains constructed using the Maximum Likelihood

method as implemented in the program RAxML [10]. The two clusters of MEF2-like and SRF-like

MADS domains are indicated. Sequence names in red indicate domains that were classified as a

different type than the majority of the domains in the branch they belong to. For readability, not for

all domains that were annotated sequence names are shown, whereas all domains of questionable

classification are specified (also see Table S2). The branches are color coded such that the

respective sequences are from the following groups of organisms: green – plants, red –

ophisthokonts, black – alveolates, blue – amoebozoans, yellow – discicristates, cyan – chromistans.

Species abbreviations and sequence names are the same as in Figure 1 and Figure S1.

9

150

151

152

153

154

155

156

157

158

159

160

1718

Page 10: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Figure S3 – Presence or absence of SRF-like and MEF2-like MADS domains in the evolution of

eukaryotes under two alternative rooting hypotheses and reconstructed using the parsimony

principle. Black donates presence of the MADS domain while white indicates absence of the

MADS domain.

10

161

162

163

164

165

1920

Page 11: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Figure S4 – Structure comparison. Comparison of the predicted structure of DNA topoisomerase

IV, subunit A of Anabaena variabilis, a cyanobacterium (a), and the partial structure of DNA-bound

SERUM RESPONSE FACTOR of Homo sapiens (b; SRF, PDB: 1SRS). For visibility reasons, only

the sequence stretch representing amino acids 225 to 282 of the structure of DNA topoisomerase IV,

subunit A are shown. The MADS domain of SRF and the region putatively homologous to the

MADS domain in TOPOIIA-A are colored in dark blue. Residues identical between 15 TOPOIIA-A

fragments identified by homology searches and the MADS domain are shown in spacefill

representation and colored green.

11

166

167

168

169

170

171

172

173

174

175

176

177

178

179

1802122

Page 12: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Figure S5 – Secondary structure alignment of a part of SERUM RESPONSE FACTOR of Homo

sapiens (SRF, PDB: 1SRS) and the predicted structure of the corresponding part of DNA

topoisomerase IV, subunit A of Anabaena variabilis, a cyanobacterium. α-helices are shown as red

boxes and β-strands are shown as green boxes with an arrowhead. An amino acid alignment of the

corresponding amino acid sequences is indicated.

12

181

182

183

184

185

186

2324

Page 13: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Supplementary tables

Table S1 – MADS domains in six domains of life. The number of MADS domains found in the

corresponding genome is listed in the column “#MADS”. A star in the column “#MADS” indicates

that some of the recovered MADS domains are not annotated in the corresponding databases. The

number in brackets states how many MADS domains are not annotated. Alternating shading was

used to facilitate distinction between major groups of eukaryotes.

Taxonomy Species #MADS Data set Data source

Ophisthokonta

Fungi

Ascomycota Neurospora crassa 2 Complete genome Broad Institute

Yarrowia lipolytica 2 Complete genome Center for Bioinformatics,

Bordeaux

Saccharomyces

cerevisiae

4 Complete genome Saccharomyces Genome

DB

Schizosaccharomyces

pombe

4*(1) Complete genome Sanger

Basidiomycota Ustilago maydis 2 Complete genome Broad Institute

Cryptococcus

neoformans

3*(1) Complete genome Broad Institute

Microsporidia Encephalitozoon

cuniculi

1 Complete genome IMG

Antonospora locustae 2* Complete genome Antonospora GDB

Metazoa Drosophila

melanogaster

2 Complete genome Berkeley DGP

Choanoflagellata Monosiga brevicollis 3 Complete genome JGI

13

187

188

189

190

191

192

2526

Page 14: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Taxonomy Species #MADS Data set Data source

Proterospongia 0 ESTs TbestDB

Amoebozoa

Myxogastrida Physarum

polycephalum

2* Complete genome Washington University

GSC

Dictyostelida Dictyostelium

discoideum

4 Complete genome IMG

Acanthamoebidae Acanthamoeba

castellanii

3* Complete genome BCM

Hartmannellidae Hartmannella

vermiformis

2 ESTs TbestDB

Pelobionta Entamoeba histolytica 3*(1) Complete genome IMG

Plantae

Chlorophyta

Embryophyta Arabidopsis thaliana 122 Annotated Proteins TAIR

Chlorophyta Chlamydomonas

reinhardtii

2 Complete genome JGI

Rhodophyta

Cyanidiales Cyanidioschyzon

merolae

1* Complete genome C.m. Genome Project

Glaucocystophyta Glaucocystis

nostochinearum

0 ESTs TbestDB

Rhizaria

Cercozoa Bigelowiella natans 0 Nucleomorph NCBI

Alveolata

142728

Page 15: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Taxonomy Species #MADS Data set Data source

Apicomplexa Toxoplasma gondii 1* Complete genome ToxoDB

Theileria annulata 0 Complete genome Sanger

Theileria parva 0 Complete genome IMG

Plasmodium

falciparum

0 Complete genome IMG

Plasmodium yoelii

yoelii

0 Complete genome IMG

Cryptosporidium

parvum

0 Complete genome IMG

Cryptosporidium

hominis

0 Complete genome IMG

Dinoflagellata

Dinophycea Heterocapsa triquetra 0 ESTs TbestDB

Ciliophora Tetrahymena

thermophila

1 Complete genome TIGR

Paramecium

tetraurelia

8 Complete genome Paramecium Genome

Browser

Chromista

Heterokonta

Oomycota Phytophthora sojae 1 Complete genome JGI

Phytophthora ramorum 1 Complete genome JGI

Bacillariophyta Thalassiosira

pseudonana

0 Complete genome JGI

Phaeodactylum

tricornutum

1* Complete genome JGI

152930

Page 16: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Taxonomy Species #MADS Data set Data source

Cryptophyta Guillardia theta 0 Nucleomorph IMG

Hemiselmis andersenii 0 Nucleomorph NCBI

Discicristates

Euglenozoa

Kinetoplastida Leishmania major 0 Complete genome IMG

Leishmania infantum 0 Complete genome Sanger

Trypanosoma cruzi 0 Complete genome TIGR

Trypanosoma brucei 0 Complete genome IMG

Heterolobosae

Schizopyrenida Naegleria gruberi 2 Complete genome JGI

Excavata

Diplomonadina Giardia lamblia 0 Complete genome IMG

Parabasalia Trichomonas vaginalis 0 Complete genome TIGR

Oxymonadina Streblomastix strix 0 ESTs TbestDB

Table S2 – E-values for classification of MADS domains as SRF-like or MEF2-like according

to the NCBI conserved domains database. The lower one of the two E-values, used for

classification in Figures S1 and S2, is shaded. Species abbreviations and sequence names are the

same as in Figure 1 and Figure S1. Bold writing indicates questionable classifications. n.a., not

available.

Sequence SRF-like MEF2-like

JGI_EGW1.7.155.1_Ng 3.00E-14 1.00E-18

JGI_EEGNGPG.C520029_Ng 3.00E-19 3.00E-16

JGI_EEG1PG.C1440013_Ps 5.00E-16 3.00E-24

16

193

194

195

196

197

198

3132

Page 17: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

SCAFFOLD37000080_Pr 4.00E-16 1.00E-24

chr11_37909-37967_Ptri n.a. 5.3E-02

XP001013498_Tt 2.00E-12 1.00E-09

scaffold129_50735-50800_Pt 2.00E-04 6.00E-05

scaffold2_81330-81385_Pt 7.00E-04 6.00E-04

XP001438374_Pt 5.00E-10 9.00E-09

XP001429517_Pt 2.00E-09 1.00E-08

XP001434362_Pt 7.00E-06 2.00E-07

scaffold157_37788-37869_Pt 1.00E-09 2.00E-09

XP001461257_Pt 1.00E-06 1.00E-07

scaffold91_69496-69580_Pt 5.00E-11 3.00E-11

TGG995082_Tg 8.00E-17 1.00E-26

AP006483_98690-98746_Cm 9.00E-12 8.00E-16

SCAFFOLD_3000406_Cr 1.00E-03 6.00E-04

SCAFFOLD66000005_Cr 2.00E-06 1.00E-07

IMG640321549_Eh 4.00E-17 1.00E-19

IMG640313519_Eh 2.00E-12 2.00E-14

NW665827_18595-18632_Eh 1.00E-07 3.00E-07

contig9912_5278-5374_Ac 6.00E-18 2.00E-16

contig4434_456-542_Ac 2.00E-15 3.00E-14

contig14208_1287-1381_Ac 3.00E-14 3.00E-18

IMG639614547_Dd 1.00E-17 3.00E-15

IMG639615340_Dd 1.00E-18 6.00E-16

IMG639614226_Dd 1.00E-12 3.00E-15

IMG639621525_Dd 1.00E-15 4.00E-23

HVL00004593_Hv 4.00E-08 6.00E-10

173334

Page 18: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

HVL00000978_Hv 6.00E-17 2.00E-13

Contig6957_233-283_Pp 2.00E-18 1.00E-16

Contig5814_226-273_Pp 9.00E-16 2.00E-24

IMG638215947_Ec 8.00E-15 8.00E-13

contig39_1287-1340_Al 4.00E-03 n.a.

contig175_1617-1674_Al 2.00E-14 9.00E-14

XP772022_Cn 6.00E-18 2.00E-14

XP777518_Cn 3.00E-14 3.00E-21

AACO02000072_75987_76036_Cn 8.00E-08 2.00E-13

XP757371_Um 3.00E-10 8.00E-10

XP761470_Um 9.00E-16 3.00E-20

XP501533_Yl 3.00E-10 4.00E-10

XP505594_Yl 1.00E-16 6.00E-24

fgenesh1pg.C150048_Mb 2.00E-13 1.00E-15

gw1.4.553.1_Mb 1.00E-05 7.00E-04

scaffold_15000047_Mb 9.00E-12 2.00E-12

XP964617_Nc 1.00E-09 6.00E-10

XP965689_Nc 1.00E-15 4.00E-21

NP013756_Sc 3.00E-17 5.00E-16

NP013757_Sc 7.00E-08 5.00E-11

NP015236_Sc 3.00E-18 2.00E-24

NP009741_Sc 6.00E-17 3.00E-21

NP596507_Sp 2.00E-16 2.00E-15

chr2_786512-786567_Sp 1.00E-11 2.00E-11

NP594931_Sp 8.00E-13 3.00E-11

NP595972_Sp 6.00E-11 6.00E-11

183536

Page 19: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

NP726438_Dm 2.00E-18 8.00E-14

NP995789_Dm 1.00E-16 1.00E-25

Table S3 – Results of searching the MADS domain in the non-redundant microbial database

using HMMer. Raw “Score” and empirical “E-value” as calculated by HMMer are shown.

Sequence ID Description Score E-value

gi|75909066| DNA gyrase, subunit A -1.4 6.2

gi|17227937| DNA topoisomerase chain A -2.8 8.7

gi|113953878| DNA gyrase subunit A -6.8 23

gi|33862278| DNA gyrase/topoisomerase IV, subunit A -9.2 41

gi|33239457| Type IIA topoisomerase, A subunit, ParC -11.6 73

gi|124021719| DNA gyrase/topoisomerase IV, subunit A -11.8 78

Table S4 – Microbial clusters identified with HHsearch using the MADS-domain HMM as

query. Clusters are numbered according to the cd-hit clustering procedure. Columns “Query” and

“Template” indicate which amino acid positions of the MADS domain and the identified cluster,

respectively, show sequence similarity. Shading indicates clusters containing TOPOIIA-A

sequences.

Hit Description E-value Query Template

cluster21422 DNA gyrase/topoisomerase, subunit A 0.035 1-58 630-685

cluster276140 Proteasome subunit alpha 1 38-58 1-21

19

199

200

201

202

203

204

205

206

207

208

3738

Page 20: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

cluster466954 Predicted nucleic acid-binding protein 1 33-58 1-26

cluster72291 Hypothetical proteins 2.5 28-58 1-30

cluster224995 Proteins of unknown function DUF147 2.8 21-58 1-37

cluster170941 PSP1 proteins 4.2 39-58 1-20

cluster7568 Endo-beta-N-acetylglucosaminidase 4.6 22-58 1-35

cluster218999 3-oxoacyl synthases III 7.3 1-58 35-89

cluster493787 Transcriptional regulators 13 22-58 1-37

cluster7640 Chromosome segregation proteins SMC 14 1-58 856-911

cluster372191 Hypothetical proteins 15 1-25 175-199

cluster270568 Membrane proteins/Hypothetical proteins 19 28-58 1-31

cluster175139 Hypothetical proteins 24 28-58 1-34

cluster593245 Hypothetical proteins 33 1-19 37-55

cluster122934 Hypothetical proteins 33 1-20 408-427

cluster155000 Fibronectin-attachment family proteins 33 14-58 1-44

cluster191774 Fe3+ ABC transporters 38 28-58 1-36

cluster517725 Hypothetical proteins 41 28-58 1-31

cluster133636 Hydroxymethylglutaryl-coenzyme A synthases 43 1-21 393-412

cluster455816 Hypothetical proteins 44 35-58 1-23

cluster160652 Glycosyl transferases 45 31-58 1-27

cluster19843 DNA gyrase/topoisomerase, subunit A 49 1-58 663-721

20

209

3940

Page 21: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Table S5 - Results of HMMer searches of the non-redundant plant database using a HMM of

the six previously identified TOPOIIA-A sequences. Red shading indicates clusters containing

TOPOIIA-A sequences while blue shading indicates clusters containing MADS-domain

sequences. Raw “Score” and empirical “E-value” as calculated by HMMer are shown.

Sequence Description Score E-value

gi|145355547| predicted protein [Ostreococcus lucimarinus] 4.9 0.015

gi|115441497| Os01g0886200 [Oryza sativa (japonica cultivar-group)] -4.6 0.22

gi|30694601| AGL16 (AGAMOUS-LIKE 16); transcription factor -7.0 0.43

gi|145332879| AGL16 (AGAMOUS-LIKE 16); transcription factor -7.0 0.43

gi|15238067| FLC; transcription factor [Arabidopsis thaliana] -7.6 0.52

gi|145334363| FLC (FLOWERING LOCUS C) [Arabidopsis thaliana] -7.6 0.52

gi|42568779| MAF4 (MADS AFFECTING FLOWERING 4) -8.8 0.73

gi|115487796| Os12g0207000 [Oryza sativa (japonica cultivar-group)] -9.7 0.93

gi|15230284| AGL18 (AGAMOUS-LIKE 18); transcription factor -10.2 1.1

gi|115483150| Os10g0536100 [Oryza sativa (japonica cultivar-group)] -10.5 1.1

gi|42566942| AG (AGAMOUS); transcription factor -10.9 1.3

gi|115467168| Os06g0223300 [Oryza sativa (japonica cultivar-group)] -11.4 1.5

gi|115456153| Os03g0812000 [Oryza sativa (japonica cultivar-group)] -12.3 1.9

gi|115446901| Os02g0579600 [Oryza sativa (japonica cultivar-group)] -13.2 2.5

gi|115466584| Os06g0162800 [Oryza sativa (japonica cultivar-group)] -13.5 2.7

gi|115448477| Os02g0731200 [Oryza sativa (japonica cultivar-group)] -14.0 3.1

gi|115451551| Os03g0215400 [Oryza sativa (japonica cultivar-group)] -14.1 3.2

21

210

211

212

213

4142

Page 22: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

gi|30681440| DNA gyrase subunit A family protein [Arabidopsis thaliana] -14.3 3.3

gi|115457632| Os04g0304400 [Oryza sativa (japonica cultivar-group)] -14.5 3.6

gi|115458790| Os04g0461300 [Oryza sativa (japonica cultivar-group)] -14.7 3.8

gi|15220084| MADS-box protein (AGL100) [Arabidopsis thaliana] -14.7 3.8

gi|115439679| Os01g0726400 [Oryza sativa (japonica cultivar-group)] -15-2 4-3

gi|115451205| Os03g0186600 [Oryza sativa (japonica cultivar-group)] -15.5 4.8

gi|15218456| MADS-box protein (AGL60) [Arabidopsis thaliana] -15.8 5.2

gi|42562154| AGL65; DNA binding / transcription factor [A. thaliana] -16.7 6.7

gi|115468584| Os06g0565900 [Oryza sativa (japonica cultivar-group)] -17.0 7.2

gi|115455401| Os03g0753100 [Oryza sativa (japonica cultivar-group)] -17.3 7.8

gi|15233857| AGL24 (AGAMOUS-LIKE 24); transcription factor -17.4 8.5

gi|30698092| AGL31; transcription factor [Arabidopsis thaliana] -17.5 8.3

gi|145334905| AGL31 [Arabidopsis thaliana] -17.5 8.3

gi|115469428| Os06g0667200 [Oryza sativa (japonica cultivar-group)] -17.5 8.4

gi|42568781| AGL68/MAF5 (MADS AFFECTING FLOWERING 5) -17.6 8.5

gi|145334907| AGL68/MAF5 (MADS AFFECTING FLOWERING 5) -17.6 8.5

gi|145350260| predicted protein [Ostreococcus lucimarinus CCE9901] -17.9 9.1

gi|15234874| STK (SEEDSTICK); transcription factor [A. thaliana] -17.9 9.3

gi|145332997| STK (SEEDSTICK) [Arabidopsis thaliana] -17.9 9.3

gi|30681253| STK (SEEDSTICK); transcription factor [A. thaliana] -17.9 9.3

gi|115476540| Os08g0431900 [Oryza sativa (japonica cultivar-group)] -18.0 9.6

224344

Page 23: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

gi|79376490| AGL94; DNA binding / transcription factor [A. thaliana] .18.1 9.7

Table S6 – Results of reverse HHsearch using an HMM of the previously identified TOPOIIA-

A cluster as a query to search the non-redundant plant database. Clusters are numbered

according to the cd-hit clustering procedure. Blue shading indicates clusters containing

MADS-domain sequences while red shading indicates clusters containing TOPOIIA-A

sequences. Columns “Query” and “Template” indicate which amino acid positions of the

MADS domain and the identified cluster show sequence similarity.

Hit Description E-value Query Template

cluster28707 MADS AFFECTING FLOWERING proteins 0.29 1-56 2-59

cluster25045 AGAMOUS-LIKE 18 proteins 1.9 1-56 2-59

cluster16375 AGAMOUS-LIKE 65 proteins 2.9 1-56 2-59

cluster2431 DNA gyrase subunit A family proteins 3.8 1-56 715-770

cluster25305 AG/SHP/STK proteins 5.7 1-56 2-59

cluster14854 zinc finger family proteins 6.7 1-19 410-429

cluster22457 2-dehydro-3-deoxyphosphooctonate aldolases 7 1-56 87-140

cluster16544 AGAMOUS-LIKE 30 proteins 7.8 1-56 2-59

cluster20126 unknown proteins 9.9 1-27 299-326

cluster25478 SVP proteins 10 1-56 20-78

23

214

215

216

217

218

219

220

221

4546

Page 24: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

Table S7 – Clustal W similarity scores between identified and non-identified sequences of TOPOIIA, subunit A. Summarized are Clustal W

scores of the partial TOPOIIA, subunit A identified by HMM- and/or HHSearch-queries to other TOPOIIA, subunit A sequences in the non-

redundant database. Maximum scores are shown except for the identified cyanobacterial sequences where minimum scores are shown. The

column titled “max” shows the overall maximum score and the column titled “avg” shows average scores of sequences not identified by

HMM- and/or HHSearch-queries to the identified sequences.

Acidobacteria Actinobacteria Aquificae Bacteroidetes Chlamydiae Chlorobi Deinococcus Euryarchaeota Firmicutes

gi|17227937| 32 37 30 35 32 33 33 32 33

gi|75909066| 32 37 32 33 28 32 32 32 33

gi|86604741| 30 33 21 35 33 39 39 35 35

gi|86607511| 32 33 34 37 33 41 41 30 35

gi|37521619| 33 33 20 32 30 30 37 33 42

gi|78211563| 28 28 23 21 23 19 23 26 30

gi|33864544| 26 40 27 25 25 21 23 25 28

gi|148238341| 26 29 18 21 25 23 23 19 25

gi|113953878| 23 29 25 23 25 19 26 25 30

gi|124021719| 28 32 30 25 21 28 28 28 28

gi|33862278| 26 30 29 23 19 25 26 28 28

gi|148241104| 17 29 14 23 14 17 26 23 26

gi|72383169| 17 28 10 21 16 14 16 25 28

24

222

223

224

225

226

4748

Page 25: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

gi|124024717| 17 28 10 21 16 14 16 25 28

gi|33239457| 19 33 18 21 19 19 23 26 28

Fusobacteria Planctomycetes Proteobacteria Spirochaetes Tenericutes Thermotogae Cyano. (min) max avg (nonid.)

gi|17227937| 30 32 33 28 32 26 32 53 23.28846154

gi|75909066| 28 30 33 28 30 26 33 51 23.453125

gi|86604741| 20 33 37 35 31 24 23 55 19.12980769

gi|86607511| 20 33 39 32 33 34 26 57 20.44591346

gi|37521619| 39 28 37 32 37 28 21 44 23.58173077

gi|78211563| 10 23 32 28 26 19 35 67 18.15264423

gi|33864544| 10 25 33 25 28 25 37 75 19.32572115

gi|148238341| 10 19 30 25 23 30 26 64 16.75

gi|113953878| 12 23 35 28 23 17 33 62 17.89783654

gi|124021719| 12 28 37 25 33 16 30 64 19.10096154

gi|33862278| 14 25 35 25 30 16 26 62 18.23317308

gi|148241104| 10 21 32 23 25 21 33 60 18.26201923

gi|72383169| 14 16 25 19 30 25 21 44 15.04927885

gi|124024717| 14 16 25 19 30 25 21 42 14.86298077

gi|33239457| 12 23 33 23 26 26 23 51 17.02163462

254950

Page 26: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

References

1. Sayers, E.W., et al. (2009) Database resources of the National Center for Biotechnology

Information. Nucleic Acids Res 37, D5-15

2. Li, W.Z., and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large

sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659

3. Larkin, M.A., et al. (2007) Clustal W and clustal X version 2.0. Bioinformatics 23, 2947-

2948

4. Eddy, S.R. (1996) Hidden Markov models. Curr Opin Struct Biol 6, 361-365

5. Majoros, W.H., et al. (2004) TigrScan and GlimmerHMM: two open source ab initio

eukaryotic gene-finders. Bioinformatics 20, 2878-2879

6. Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics

21, 951-960

7. Baldauf, S.L. (2003) The deep roots of eukaryotes. Science 300, 1703-1706

8. Maddison, W.P., and Maddison, D.R. (2009) Mesquite: a modular system for evolutionary

analysis. Version 2.5 http://mesquiteproject.org.

9. Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and

space complexity. BMC Bioinformatics 5, 1-19

10. Stamatakis, A. (2006) RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses

with thousands of taxa and mixed models. Bioinformatics 22, 2688-2690

11. Whelan, S., and Goldman, N. (2001) A general empirical model of protein evolution derived

from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18,

691-699

12. Abascal, F., et al. (2005) ProtTest: selection of best-fit models of protein evolution.

Bioinformatics 21, 2104-2105

13. Kelley, L.A., and Sternberg, M.J.E. (2009) Protein structure prediction on the Web: a case

study using the Phyre server. Nat Protoc 4, 363-37126

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

2525152

Page 27: Online Supplementary Material  · Web viewThese databases were clustered at the 50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all

14. Alvarez-Buylla, E.R., et al. (2000) An ancestral MADS-box gene duplication occurred

before the divergence of plants and animals. Proc Natl Acad Sci USA 97, 5328-5333

15. Gauthierrouviere, C., et al. (1995) The Serum Response Factor Nuclear-Localization Signal

- General Implications for Cyclic-Amp-Dependent Protein-Kinase Activity in Control of

Nuclear Translocation. Mol Cell Biol 15, 433-444

16. McGonigle, B., et al. (1996) Nuclear localization of the Arabidopsis APETALA3 and

PISTILLATA homeotic gene products depends on their simultaneous expression (vol 10, pg

1812, 1996). Genes Dev 10, 2235-2235

17. Immink, R.G.H., et al. (2002) Analysis of MADS box protein-protein interactions in living

plant cells. Proc Natl Acad Sci USA 99, 2416-2421

18. Kellis, J.T., et al. (1988) Contribution of Hydrophobic Interactions to Protein Stability.

Nature 333, 784-786

19. Chi, S.W., et al. (1999) Solution structure of a conserved C-terminal domain of p73 with

structural homology to the SAM domain. EMBO J 18, 4438-4445

20. Hellman, M., et al. (2004) Solution structure of coactosin reveals structural homology to

ADF/cofilin family proteins. FEBS Lett 576, 91-96

21. Pellegrini, L., et al. (1995) Structure of Serum Response Factor Core Bound to DNA. Nature

376, 490-498

22. Tan, S., and Richmond, T.J. (1998) Crystal structure of the yeast MAT alpha 2/MCM1/DNA

ternary complex. Nature 391, 660-666

23. Guo, M.X., et al. (2008) PrPC interacts with tetraspanin-7 through bovine PrP154-182

containing alpha-helix 1. Biochem Biophys Res Commun 365, 154-157

24. Harrison, P.M., et al. (1997) The prion folding problem. Curr Opin Struct Biol 7, 53-59

25. Cordes, M.H.J., et al. (1999) Evolution of a protein fold in vitro. Science 284, 325-327

26. Anderson, T.A., et al. (2005) Sequence determinants of a conformational switch in a protein

structure. Proc Natl Acad Sci USA 102, 18344-18349

27

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

5354