from cdna to integrative protein annotation and beyond: application to alvinella pompejana cdna...

1
From cDNA to integrative protein annotation and beyond: application to Alvinella pompejana cDNA collection Gagnière, N. 1 , Bigot, Y. 2 , Gaill, F. 3 , Higuet, D. 4 , Jollivet, D. 5 , Leize, E. 6 , Perrodou, E. 1 , Rees, J.F. 7 , Weissenbach, J. 8 , Zal, F. 9 , Poch, O. 1 , Lecompte, O. 1 gills pygidium dorsal face with epibiotic bacteria Phare 2002, IFREMER © Full-length enriched cDNA libraries were generated at the Genoscope (http://www.genoscope.cns.fr/) for: • whole animal (Cloneminer method) • gills (Oligo-capping method) • ventral tissue (Oligo-capping method) • pygidium (Cloneminer method, sequencing in progress) Whole animals as well as dissected tissues were been collected during the oceanographic Biospeedo cruise on the Pacific Ridge in 2004. The sequencing of the 5’ ends is ongoing at Genoscope on a ABI 3730 sequencer using dye- terminator fluorescent DNA sequencing technology. A total of 200,000 reads will be achieved. We will select about 10,000 full-length cDNA using the sequence data and the entire sequence of the selected clones will be determined. Cleaning and assembling process chromatograms PHRED: sequence and quality extraction Cross-match: vector masking ad hoc script: polyA masking PHRED: low-quality region trimming File synchronization eliminated sequences (<100 bp, chimera) ad hoc scripts: sequence trimming and parsing For the 70,000 available reads, base-calling and low-quality (Q≤13) region trimming were performed using the Phred program. Vector sequences and other contaminants were masked using Cross-match. Poly(A/T) regions as well as repetitive sequences were masked using ad hoc scripts. After sequence trimming and masking, sequences with fewer than 100 unmasked bases were excluded from further processing. Cleaned sequences of each library were assembled separately using Cap3, leading to a total of 13,000 contigs and singlets. Mean contig length is > 900 bp and the library redundancy ranges from 53 to 79%. Ongoing developments To facilitate and speed up oligo design for future protein expression tests, we have developed a new program called OliDA (Oligo Design Automatization) to automatically determine optimized cDNAs and protein boundaries through MACSIMS results analysis. Boundary determination combines PFAM-A domains or PDB structure boundaries with phylogenetic distribution and conservation patterns. This program is integrated into the GScope platform upstream to oligo ordering for PCR and will be available as a web application. Beta version of OliDA Web2.0 results page. The red lines indicate the proposed boundaries. User can correct cloning boundaries by clicking on the alignment. Proposed boundary Propagated strand Propagated helix 1 CNRS-INSERM-ULP, UMR7104/U596 – LBGI Laboratoire de Biologie et Génomique Intégratives 4 CNRS-UPMC-MNHN-IRD, UMR 7138 – Génétique et Evolution 7 ISV-UCL, Laboratoire de Biologie cellulaire (Belgium) 2 CNRS-UFR: FRE 2535- Laboratoire d’Etude des Parasites Génétiques 5 CNRS-UPMC, UMR 7144 - Evolution et Génétique des Populations Marines 8 GENOSCOPE 3 CNRS-UPMC-MNHN-IRD, UMR 7138 – Systématique, Adaptation, Evolution 6 CNRS-ULP, UMR 7512 - Laboratoire de Spectrométrie de masse BioOrganique 9 CNRS-UPMC Equipe Ecophysiologie : Adaptation et Evolution Moléculaires Protein complete ? yes no Select full sequence Select complete domains Correct the region by comparing to aligned PDBs Order oligos Insert long enough to include C terminal end ? Use run-off oligo Generate 5’ & 3’ oligos Generate 5’ oligo Generate 3’ oligo yes no Contigs and singlets are annotated by the software platform, GScope, developed at the LBGI (R. Ripp, manuscript in preparation). GScope manages, integrates, validates, analyses and visualizes high-throughput information (genome & proteic sequences, transcriptomics…). Classical tools for similarity search, gene prediction, codon usage determination are implemented as well as in-house programs for specialised analysis (start codon validation, frameshift detection, oligonucleotide design, target analysis, phylogenetic distribution…). Protein sequence prediction Protein sequence prediction We developed an original BlastX-based approach to detect and translate Alvinella CDS segments complementary to hidden Markov Model CDS prediction program ESTscan2 (Lottaz et al.). Due to the limited number of Alvinella cDNA coding versus non-coding sequences, robust HMM model could not be constructed leading to the use of the bundled human model that proved to be efficient. This result is linked to the close relationships existing between A. pompejana and vertebrates (Alvinella consortium, manuscript in preparation). MACS creation MACS creation All the annotation process programs rely on high quality clustered multiple alignments generated by the PipeAlign ( http://bips.u-strasbg.fr/PipeAlign / ) protein analysis toolkit. This allows the reliable characterization of a target protein sequence in its evolutionary context. Annotation Annotation We used MACSIMS ( http://bips.u-strasbg.fr/MACSIMS ) to propagate to Alvinella sequences structural and functional information mined from the public databases. In addition, the GoAnno program ( http://bips.u-strasbg.fr/GOAnno / ) annotates proteins according to the Gene Ontology and a data mining programs generates a consensus functional definition and a consensus EC number from close homologs. Throughout the whole analysis protocol, fine grained information about cDNAs (tissular origin, cloning errors, sequence quality, …) are maintained in a relational database to facilitate tissue libraries comparison, variant comparison and efficient exploitation of A. pompejana cDNAs. Available cDNA libraries Semi automated cDNA sequence analysis protocol Abstract Protein creation and integrative annotation with MACSIMS Conclusion and perspectives Annotation results summary 70,000 cDNAs 50,000 cleaned cDNAs 4,000 contigs 9,000 singlets Per library assembly Cleaning protocol 6,600 proteins BlastX based protein creation • About 30% of initial cDNA sequences have been discarded from the assembly by the cleaning process. Although some short sequences of good quality were removed, the vast majority of these sequences were empty vector sequences and chimeric inserts. • From the 13,000 assembled sequences, only half of them have significant BlastX homologs for protein creation and annotation. ESTscan2 prediction using human model on the sequences without homologs showed many long open reading frames with biased composition. Almost all the proteins have been annotated with either PFAM-A domains, Gene Ontology, functional definition or EC number. Annotation verification is in progress, nevertheless we will also implement a scoring function that will help to semi automatically check the consistency of the annotation for each sequence. Annotation protocol References • Chalmel F, Lardenois A, Thompson JD, Muller J, Sahel JA, Leveillard T, PochO. GOAnno: GO annotation based on multiple alignment. Bioinformatics. 2005 • Clamp, M., Cuff, J., Searle, SM, Barton, GJ. The Jalview Java Alignment Editor. Bioinformatics. 2004 • Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. Genome Res. 1998 • Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999 • Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O. Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene. 2001 • Lottaz C, Iseli C, Jongeneel CV, Bucher P. Modeling sequencing errors by combining Hidden Markov models. Bioinformatics. 2003 • Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O,Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC,Thompson JD, Wicker N, Poch O. PipeAlign: A new toolkit for protein family analysis.Nucleic Acids Res. 2003 • Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O. MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics. 2006 Alvinella pompejana, the « pompeii worm », is a Polychaete Annelid discovered in 1980. This tubiculous worm colonizes hydrothermal Vents where it is faced with extreme and variable physico-chemical conditions including very high temperatures (from 20 to over 80°C), anoxic conditions, low pH, high concentration of heavy metals and sulfide… This environment makes A. pompejana an ideal model for studies aimed at deciphering adaptation in general as well as a unique source of thermostable proteins of eukaryotic origin for structural studies. For these reasons, the Alvinella consortium initiated a massive cDNA sequencing project. To exploit the first 70,000 reads, we have designed a semi automated protocol starting from Alvinella cDNA collection up to annotated proteins. This protocol includes chromatograms base calling, raw sequences cleaning and assembling as well as original strategies for protein creation and annotation. Overview of the OliDA decision tree. Since sequenced 3’ cDNA extremities are often unusable , when the C terminus extremity of the protein is expected to be in the 1,200 mean base pairs of the insert, the program will use vector specific hand designed oligos called ‘run-off oligos’. These oligos match the vector downstream to the insert and then the endogenous protein stop codon should be used. BlastX-based protein sequence prediction. The significant assembled sequence BlastX HSPs are reported on the corresponding cDNA segment to be translated in correct reading frame. Unmatched cDNA segments and covering HSPs segments are padded with ‘X’ characters. Finally protein is extended in both directions until stop codon or cDNA extremities. Multiple Alignment of Complete Sequences (MACS) creation using PipeAlign. Propagation of functional and structural information using MACSIMS PFAM-A annotation display using JalView ( www.jalview.org / ). Propagated features appear in a lighter color than database mined features. No definition 13% With definition 87% With Gene Ontology 86% No Gene Ontology 14% No Pfam-A 32% 2 and more 8% 1 Pfam-A 60% No annotation 5% Annotated 95% No EC number 26% With EC number 74%

Post on 18-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: From cDNA to integrative protein annotation and beyond: application to Alvinella pompejana cDNA collection Gagnière, N. 1, Bigot, Y. 2, Gaill, F. 3, Higuet,

From cDNA to integrative protein annotation and beyond:

application to Alvinella pompejana cDNA collection

Gagnière, N.1, Bigot, Y.2, Gaill, F.3, Higuet, D.4, Jollivet, D.5, Leize, E.6, Perrodou, E.1, Rees, J.F.7, Weissenbach, J.8, Zal, F.9, Poch, O.1 , Lecompte, O.1

gills

pygidium

dorsal facewith epibiotic

bacteriaP

hare

200

2, IF

RE

ME

Full-length enriched cDNA libraries were generated at the Genoscope (http://www.genoscope.cns.fr/) for:

• whole animal (Cloneminer method)• gills (Oligo-capping method)• ventral tissue (Oligo-capping method)• pygidium (Cloneminer method, sequencing in progress)

Whole animals as well as dissected tissues were been collected during the oceanographic Biospeedo cruise on the Pacific Ridge in 2004. The sequencing of the 5’ ends is ongoing at Genoscope on a ABI 3730 sequencer using dye-terminator fluorescent DNA sequencing technology. A total of 200,000 reads will be achieved. We will select about 10,000 full-length cDNA using the sequence data and the entire sequence of the selected clones will be determined.

Cleaning and assembling process

chromatograms

PHRED: sequence and quality extraction

Cross-match: vector maskingad hoc script: polyA masking

PHRED: low-quality region trimming

File synchronization

eliminated sequences(<100 bp, chimera)

ad hoc scripts: sequence trimming and parsing

For the 70,000 available reads, base-calling and low-quality (Q≤13) region trimming were performed using the Phred program. Vector sequences and other contaminants were masked using Cross-match. Poly(A/T) regions as well as repetitive sequences were masked using ad hoc scripts. After sequence trimming and masking, sequences with fewer than 100 unmasked bases were excluded from further processing. Cleaned sequences of each library were assembled separately using Cap3, leading to a total of 13,000 contigs and singlets. Mean contig length is > 900 bp and the library redundancy ranges from 53 to 79%.

Ongoing developments

To facilitate and speed up oligo design for future protein expression tests, we have developed a new program called OliDA (Oligo Design Automatization) to automatically determine optimized cDNAs and protein boundaries through MACSIMS results analysis. Boundary determination combines PFAM-A domains or PDB structure boundaries with phylogenetic distribution and conservation patterns. This program is integrated into the GScope platform upstream to oligo ordering for PCR and will be available as a web application.

Beta version of OliDA Web2.0 results page. The red lines indicate the proposed boundaries. User can correct cloning boundaries by clicking on the alignment.

Proposedboundary

Propagatedstrand

Propagatedhelix

1 CNRS-INSERM-ULP, UMR7104/U596 – LBGI Laboratoire de Biologie et Génomique Intégratives 4 CNRS-UPMC-MNHN-IRD, UMR 7138 – Génétique et Evolution 7 ISV-UCL, Laboratoire de Biologie cellulaire (Belgium)

2 CNRS-UFR: FRE 2535- Laboratoire d’Etude des Parasites Génétiques 5 CNRS-UPMC, UMR 7144 - Evolution et Génétique des Populations Marines 8 GENOSCOPE

3 CNRS-UPMC-MNHN-IRD, UMR 7138 – Systématique, Adaptation, Evolution 6 CNRS-ULP, UMR 7512 - Laboratoire de Spectrométrie de masse BioOrganique

9 CNRS-UPMC Equipe Ecophysiologie : Adaptation et Evolution Moléculaires

Protein complete ?yes no

Select full sequence Select complete domains

Correct the region by comparing to aligned PDBs

Order oligos

Insert long enough toinclude C terminal end ?

Use run-off oligoGenerate 5’ & 3’ oligos

Generate 5’ oligo

Generate 3’ oligo

yes no

Contigs and singlets are annotated by the software platform, GScope, developed at the LBGI (R. Ripp, manuscript in preparation). GScope manages, integrates, validates, analyses and visualizes high-throughput information (genome & proteic sequences, transcriptomics…). Classical tools for similarity search, gene prediction, codon usage determination are implemented as well as in-house programs for specialised analysis (start codon validation, frameshift detection, oligonucleotide design, target analysis, phylogenetic distribution…).

Protein sequence predictionProtein sequence predictionWe developed an original BlastX-based approach to

detect and translate Alvinella CDS segments complementary to hidden Markov Model CDS prediction program ESTscan2 (Lottaz et al.). Due to the limited number of Alvinella cDNA coding versus non-coding sequences, robust HMM model could not be constructed leading to the use of the bundled human model that proved to be efficient. This result is linked to the close relationships existing between A. pompejana and vertebrates (Alvinella consortium, manuscript in preparation).

MACS creationMACS creationAll the annotation process programs rely on high

quality clustered multiple alignments generated by the PipeAlign (http://bips.u-strasbg.fr/PipeAlign/) protein analysis toolkit. This allows the reliable characterization of a target protein sequence in its evolutionary context.

AnnotationAnnotationWe used MACSIMS (http://bips.u-strasbg.fr/MACSIMS)

to propagate to Alvinella sequences structural and functional information mined from the public databases. In addition, the GoAnno program (http://bips.u-strasbg.fr/GOAnno/) annotates proteins according to the Gene Ontology and a data mining programs generates a consensus functional definition and a consensus EC number from close homologs.Throughout the whole analysis protocol, fine grained information about cDNAs (tissular origin, cloning errors, sequence quality, …) are maintained in a relational database to facilitate tissue libraries comparison, variant comparison and efficient exploitation of A. pompejana cDNAs.

Available cDNA librariesAvailable cDNA libraries

Semi automated cDNA sequence analysis protocolSemi automated cDNA sequence analysis protocol

AbstractAbstract

Protein creation and integrative annotation with MACSIMS

Conclusion and perspectivesConclusion and perspectives

Annotation results summary

70,000 cDNAs

50,000 cleaned cDNAs

4,000 contigs9,000 singlets

Per library assembly

Cleaning protocol

6,600 proteins

BlastX based protein creation

• About 30% of initial cDNA sequences have been discarded from the assembly by the cleaning process. Although some short sequences of good quality were removed, the vast majority of these sequences were empty vector sequences and chimeric inserts.• From the 13,000 assembled sequences, only half of them have significant BlastX homologs for protein creation and annotation. ESTscan2 prediction using human model on the sequences without homologs showed many long open reading frames with biased composition.• Almost all the proteins have been annotated with either PFAM-A domains, Gene Ontology, functional definition or EC number. Annotation verification is in progress, nevertheless we will also implement a scoring function that will help to semi automatically check the consistency of the annotation for each sequence.

Annotation protocol

References

• Chalmel F, Lardenois A, Thompson JD, Muller J, Sahel JA, Leveillard T, PochO. GOAnno: GO annotation based on multiple alignment. Bioinformatics. 2005• Clamp, M., Cuff, J., Searle, SM, Barton, GJ. The Jalview Java Alignment Editor. Bioinformatics. 2004• Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. Genome Res. 1998• Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999• Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O. Multiple alignment of complete sequences (MACS) in the post-genomic era.Gene. 2001• Lottaz C, Iseli C, Jongeneel CV, Bucher P. Modeling sequencing errors by combining Hidden Markov models. Bioinformatics. 2003• Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O,Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC,Thompson JD, Wicker N, Poch O. PipeAlign: A new toolkit for protein family analysis.Nucleic Acids Res. 2003 • Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O. MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics. 2006

Alvinella pompejana, the « pompeii worm », is a Polychaete Annelid discovered in 1980. This tubiculous worm colonizes hydrothermal Vents where it is faced with extreme and variable physico-chemical conditions including very high temperatures (from 20 to over 80°C), anoxic conditions, low pH, high concentration of heavy metals and sulfide…This environment makes A. pompejana an ideal model for studies aimed at deciphering adaptation in general as well as a unique source of thermostable proteins of eukaryotic origin for structural studies. For these reasons, the Alvinella consortium initiated a massive cDNA sequencing project.To exploit the first 70,000 reads, we have designed a semi automated protocol starting from Alvinella cDNA collection up to annotated proteins. This protocol includes chromatograms base calling, raw sequences cleaning and assembling as well as original strategies for protein creation and annotation.

Overview of the OliDA decision tree. Since sequenced 3’ cDNA extremities are often unusable , when the C terminus extremity of the protein is expected to be in the 1,200 mean base pairs of the insert, the program will use vector specific hand designed oligos called ‘run-off oligos’. These oligos match the vector downstream to the insert and then the endogenous protein stop codon should be used.

BlastX-based protein sequence prediction. The significant assembled sequence BlastX HSPs are reported on the corresponding cDNA segment to be translated in correct reading frame. Unmatched cDNA segments and covering HSPs segments are padded with ‘X’ characters. Finally protein is extended in both directions until stop codon or cDNA extremities.

Multiple Alignment of Complete Sequences (MACS) creation using PipeAlign.

Propagation of functional and structural information using MACSIMS

PFAM-A annotation display using JalView (www.jalview.org/). Propagated features appear in a lighter color than database mined features.

No definition13%

With definition87%

With Gene Ontology86%

No Gene Ontology14%

No Pfam-A32%

2 and more8%

1 Pfam-A60%

No annotation5%

Annotated95%

No EC number26%

With EC number74%