the ensembl databasedocente.unife.it/.../dispense-corsi/genomica-1/prokopenko_lezione7.… ·...

25
1 The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica

Upload: others

Post on 08-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

1

The Ensembl Database

Dott.ssa Inga Prokopenko

Corso di Genomica

Page 2: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 2

www.ensembl.org

Page 3: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 3

What is Ensembl?

• Public annotation of mammalian and other genomes

• Open source software• Relational database system• The future of genomic bioinformatics?

Page 4: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 4

The Ensembl Project

“Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust”

Page 5: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

5

The Ensembl Project

“The main aim of this campaign is to encourage scientists across the world - in academia, pharmaceutical companies, and the biotechnology and computer industries -to use this free information.”

- Dr. Mike Dexter, Director of the Wellcome Trust

Page 6: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 6

Diagram of contigview as “what we want in the end”

Goal: An Accessible, Annotated Genome

Page 7: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 7

Ensembl Software System

• Uses extensively BioPerl (www.bioperl.org)• The free mySQL database• Entire Ensembl code base is freely available

under Apache open source license. • Mainly written in Perl, extensions in C. Some

viewers have been written in Java (e.g. Apollo).

Page 8: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 8

Ensembl Genome Annotation

• Utilizes raw DNA sequence data from public sources• Creates a tracking database (The “Ensembl database”)

• Joins the sequences - based on a sequence scaffold or “Golden Path”• Automatically finds genes and other features of the sequence• Associates sequence and features with data from other sources

• Provides a publicly accessible web based interface to the database

Page 9: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 9

The Genome Problem

• The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis” (Ewan Birney, Ensembl)

• Genome features to identify include:– Genes: protein coding, RNA, pseudogenes– Regulatory elements– SNPs, repeats, etc….

Page 10: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 10

DNA sequence in Ensembl

• Sequences are determined in fragments (contigs)• Features cross boundaries between fragments

• Entire sequence too large and changes too much (constantly updated and reassembled) to be stored as one long database entry

Page 11: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 11

DNA sequence in Ensembl

• Core design feature is the “virtual contig”object

• Allows genome sequence to be accessed as a single large contiguous sequence even though it is stored as a collection of fragments

• VC object handles reading and writing features to the DNA sequence

Page 12: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 12

Ensembl Gene Build System

• Three-part gene build system– “Best in genome” matches for known genes– Alignment of homologous genes

• Genes predicted on repeat-masked DNA• All genes predicted based on experimental

(available sequence) evidence

Page 13: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 13

“Best in genome” predictions

• Find known proteins from SPTREMBL on genome using pmatch

• Incorporate cDNAs using exonerate and EST_genome– Align with gaps placed preferentially at splice

consensus sites– Allows prediction of 5’ and 3’ UTRs

• Refine predictions using genewise

Page 14: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 14

“Best in genome” predictions

ContigView of best in genome gene with associated evidence

Known gene (p53)

Proteins aligned

cDNAs aligned

UTRs predicted

Unigene clusters aligned

• Alignments shown in ContigView

Page 15: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 15

Homology predictions

• Align homologous proteins using BLAST, genewise– Paralogs (from same organism)– Orthologs (from closely related organisms)

• Assemble novel genes

Page 16: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 16

Ab initio gene predictions

• Use Genscan to identify novel exons• Confirm exons by BLAST to known proteins, mRNAs,

UniGene clusters• Based on ab initio predictions but require homology

evidence

ContigView of homology gene with associated evidence

Novel geneGenScan predictions

Proteins aligned

Unigene clusters aligned

Page 17: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 17

Pseudogenes

• Many pseudogenes also predicted

Page 18: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 18

Ensembl Gene Build System

• Resulting “Ensembl genes” are highly accurate with low false positive rates

• Ensembl human gene identifiers are 95% stable between builds

Snapshot or stats on genes

Page 19: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 19

Ensembl EST genes

• ESTs not accurate enough to produce Ensembl genes, but important especially for identifying alternative transcripts

• Create an independent set of “EST genes”

Known gene

Unigene clusters aligned

EST genes

Page 20: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Expressed sequence tag or EST

• short sub-sequence of a transcribed cDNA sequence.

• may be used to identify gene transcripts, • instrumental in gene discovery and gene

sequence determination.• The identification of ESTs has proceeded

rapidly, with approximately 52 million ESTs available in public databases (e.g. GenBank 5/2008, all species).

Lecture 7.1 20

Page 21: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

ESTs

• An EST is produced by one-shot sequencing of a cloned mRNA (i.e. sequencing several hundred base pairs from an end of a cDNA clone taken from a cDNA library).

• The resulting sequence is a relatively low quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. – Because these clones consist of DNA that is

complementary to mRNA, the ESTs represent portions of expressed genes. They may be present in the database as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand.

21

Page 22: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

EST mapping

• ESTs can be mapped to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping or FISH.

• Alternatively, if the genome of the organism that originated the EST has been sequenced one can align the EST sequence to that genome.

Lecture 7.1 22

Page 23: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

ESTs and genes

• The current understanding of the human set of genes includes the existence of thousands of genes based solely on EST evidence.

• ESTs become a tool to refine the predicted transcripts for those genes, which leads to prediction of their protein products, and eventually of their function.

• The situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting.

• ESTs contain enough information to permit the design of precise probes for DNA microarrays that then can be used to determine the gene expression. 23

Page 24: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 24

Ensembl EST genes

• Map ESTs to genome using Exonerate, BLAST, and EST2Genome

• Define transcripts by merging redundant ends, setting splice sites to common ends– Finds splice sites and defines UTRs– Alternative transcript predicted if at least one

alternatively spliced EST exists

• Process transcripts with Genomewise to find longest ORF for each

Page 25: The Ensembl Databasedocente.unife.it/.../dispense-corsi/genomica-1/Prokopenko_Lezione7.… · viewers have been written in Java (e.g. Apollo). Lecture 7.1 8 Ensembl Genome Annotation

Lecture 7.1 25

Ensembl EST genes

• Evidence for genes shown (ExonView)