browsing genes and variants with ensembl - amazon s3 · 2016-11-04 · let’s add some tracks to...

31
Browsing Genes and Variants with Ensembl www.ensembl.org Coursebook v86 training.ensembl.org Heraklion, November 2016 Helen Sparrow 1

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Browsing Genes and Variants

with Ensembl

www.ensembl.org

Coursebook v86

training.ensembl.org

Heraklion, November 2016

Helen Sparrow

1

Page 2: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Introduction to Ensembl

Getting started with Ensembl

www.ensembl.org Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. It is a joint project between the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute. Genome assemblies are imported from public repositories, in line with the genome browser agreement (with NCBI and UCSC). Ensembl provides genes and other annotation such as regulatory regions, conserved base pairs across species, and sequence variations. The Ensembl gene set is based on protein and mRNA evidence in UniProtKB and NCBI RefSeq databases, along with manual annotation from the VEGA/Havana group. Gene sets from model organisms e.g. yeast, fruitfly and worm are also imported for Comparative Genomics. Most annotation is updated every two months, leading to increasing Ensembl versions (such as version 85), however the gene sets are determined less frequently. A sister browser at www.ensemblgenomes.org is set up to access non-chordates, namely bacteria, plants, fungi, metazoa, and protists. All the data are freely available and can be accessed via the web browser at www.ensembl.org. Perl programmers can directly access Ensembl databases through an Application Programming Interfaces (Perl APIs). Gene sequences can be downloaded from the Ensembl browser itself, or through the use of the BioMart web interface, which can extract information from the Ensembl databases without the need for programming knowledge by the user.

2

Page 3: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Synopsis – What can I do with Ensembl?

● View genes with other annotation along the chromosome. ● View alternative transcripts (i.e. splice variants) for a given

gene. ● Explore homologues and phylogenetic trees across more

than 60 species for any gene. ● Compare whole genome alignments and conserved regions

across species. ● View microarray sequences that match to Ensembl genes. ● View ESTs, clones, mRNA and proteins for any

chromosomal region. ● Examine single nucleotide polymorphisms (SNPs) for a

gene or chromosomal region. ● View SNPs across strains (rat, mouse), populations

(human), or breeds (dog). ● View positions and sequence of mRNAs and proteins that

align with Ensembl genes. ● Upload your own data. ● Use BLAST, or BLAT against any Ensembl genome. ● Export sequence or create a table of gene information with

BioMart. ● Determine how your variants affect genes and transcripts

using the Variant Effect Predictor. ● Share Ensembl views with your colleagues and

collaborators.

3

Page 4: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Need more help? ● Check Ensembl documentation

● Watch video tutorials on YouTube

● View the FAQs

● Try some exercises

● Read some publications

● Go to our online course

Stay in touch! ● Email the team with comments or questions at [email protected]

● Follow the Ensembl blog

● Sign up to a mailing list

Further reading

Yates, A.. et al

Ensembl 2016

Nucleic Acids Research (Database Issue)

https://nar.oxfordjournals.org/content/44/D1/D710

Kersey, PJ. et al

Ensembl Genomes 2016: more genomes, more complexity Nucleic Acids Research (Database Issue)

http://nar.oxfordjournals.org/content/44/D1/D574

For a complete list of publications,visit:

http://www.ensembl.org/info/about/publications.html

http://ensemblgenomes.org/info/publications

4

Page 5: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Exploring the Ensembl genes and transcripts

Walkthrough: Homepage and assemblies The front page of Ensembl is found at ensembl.org. It contains lots of information and links to help you navigate Ensembl:

At the top right you can see the current release number and what has come out in this release. To access old releases, scroll to the bottom of the page and click on View in archive site.

5

Page 6: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Click on the links to go to the archives. Alternatively, you can jump quickly to the correct release by putting it into the URL, for example e78.ensembl.org jumps to release 78. Click on Human GRCh38.p7.

To find out more about the genome assembly and genebuild, click on More information and statistics.

6

Page 7: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

The current genome assembly for human is GRCh38. If you want to see the previous assembly, GRCh37, visit our dedicated site grch37.ensembl.org.

7

Page 8: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Walkthrough: Region in detail From the GRCh38 home page, choose a search box and type human 7:117465784-117715971.

The Region in detail page is made up of three images, let’s look at each one on detail. The first image shows the chromosome:

You can jump to a different region by dragging out a box in this image. Drag out a box on the chromosome, a pop-up menu will appear.

8

Page 9: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

If you wanted to move to the region, you could click on Jump to region (### bp). If you wanted to highlight it, click on Mark region (###bp). For now, we’ll close the pop-up by clicking on the X on the corner. The second image shows a 1Mb region around our selected region. This view allows you to scroll back and forth along the chromosome.

You can also drag out and jump to or mark a region.

Click on the X to close the pop-up menu.

Click on the Drag/Select button to change the action of your mouse click. Now you can scroll along the chromosome by clicking and dragging within the image. As you

9

Page 10: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

do this you’ll see the image below grey out and two blue buttons appear. Clicking on Update this image would jump the lower image to the region central to the scrollable image. We want to go back to where we started, so we’ll click on Reset scrollable image.

The third image is a detailed, configurable view of the region.

Click on the Drag/Select option at the top or bottom right to switch mouse action. On Drag, you can click and drag left or right to move along the genome, the page will reload when you drop the mouse button. On Select you can drag out a box to highlight or zoom in on a region of interest.

10

Page 11: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

With the tool set to Select, drag out a box around an exon and choose Mark region.

The highlight will remain in place if you zoom in and out or move around the region. This allows you to keep track of regions or features of interest. We can edit what we see on this page by clicking on the blue Configure this page menu at the left.

This will open a menu that allows you to change the image. You can put some tracks on in different styles; more details are in this FAQ: http://www.ensembl.org/Help/Faq?id=335.

11

Page 12: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Let’s add some tracks to this image, add:

● Refseq human import - (In genes and transcripts)

● Proteins (mammal) from Uniprot - (In mRNA and Protein alignments)

Now click on the tick in the top left hand to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image. We can also change the way the tracks appear by hovering over the track name then the cog wheel to open a menu. We can move tracks around by clicking and dragging on the bar to the left of the track name.

12

Page 13: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Now that you’ve got the view how you want it, you might like to show something you’ve found to a colleague or collaborator. Click on the Share this page button to generate a link. Email the link to someone else, so that they can see the same view as you, including all the tracks you’ve added. These links contain the Ensembl release number, so if a new release or even assembly comes out, your link will just take you to the archive site for the release it was made on.

To return this to the default view, go to Configure this page and select Reset configuration at the bottom of the menu.

13

Page 14: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Walkthrough: Gene and transcripts We’re going to look at the human CFTR gene. This gene encodes a Transmembrane Conductance Regulator. The CFTR protein functions as a channel for the movement of chloride ions in and out of cells, which is important for the salt and water balance on epithelial surfaces, such as in the lungs or pancreas. Mutations in this gene are associated with Cystic Fibrosis. From ensembl.org, type CFTR into the search bar and click the Go button. You will get a list of hits with the human gene at the top.

Click on CFTR (Human Gene) or Ensembl ID. The Gene tab should open:

14

Page 15: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Let’s walk through some of the links in the left hand navigation column. How can we view the genomic sequence? Click Sequence at the left of the page.

15

Page 16: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

The sequence is shown in FASTA format. Take a look at the FASTA header:

You can download this sequence by clicking in the Download sequence button above the sequence:

16

Page 17: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

This will open a dialogue box that allows you to pick between plain FASTA sequence, or sequence in RTF, which includes all the coloured annotations and can be opened in a word processor. This button is available for all sequence views.

To find all of the phenotypes which have been found to be associated with the gene, click on Phenotypes in the left-hand menu.

17

Page 18: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

This contains links to the source databases which have identified the link between the gene and the phenotype, as well as the study IDs which found the association. Can our gene be found in other databases? Go up the left-hand menu to External references:

This contains links to the gene in other projects, such as OMIM, and papers where this sequence is published.

18

Page 19: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

To find out in which tissues the gene is expressed, click on Gene expression in the left-hand menu.

This page pulls in expression data from the EBI resource Expression Atlas (http://www.ebi.ac.uk/gxa/home), which determines baseline and differential gene expression for genes.

19

Page 20: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Walkthrough: The transcript tab Let’s now explore one splice isoform. Click on Show transcript table at the top.

Have a look at the largest one, CFTR-001.

If we were to only choose one transcript to analyse, we would choose this one because it has:

● Matching annotation between automatic and manual methods (Gold in Biotype column).

● Matching annotation with RefSeq giving it a CCDS. ● High transcript support (TSL1). ● A complete structure, making it a member of GENCODE

Basic. ● Judged to be the principal isoform, making it an APPRIS P1.

Click on the ID, ENST00000003084.10. You are now in the Transcript tab for CFTR-001. The left hand navigation column provides several options for the transcript CFTR-001. For detailed information on the biological evidence used to annotate this transcript, click on Supporting evidence.

20

Page 21: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Click on the yellow evidence track (protein) to get a pop-up. This links out to the original records of these data in, for example, Uniprot. Clicking on the Uniprot ID P13569 will let us assess this evidence in Uniprot:

Uniprot have annotated this with a score of 5 out of 5, giving a level of confidence in this data used for transcript annotation. Next, follow the General identifiers link at the left. This page shows information from other databases such as RefSeq, UniProtKB, CCDS and others, that match to the Ensembl transcript and protein.

21

Page 22: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

22

Page 23: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Variation

Walkthrough: Exploring CFTR variants in Ensembl In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views. Let’s take a look at the Gene sequence view for CFTR . Search for CFTR again and go to the Sequence view. If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links.

Find out more about a variant by clicking on it.

23

Page 24: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

You can add variants to all other sequence views in the same way. To view all the sequence variations in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in. For the ΔF508 variant, we know it to be an inframe deletion. Click Turn all off and then select Inframe deletion.

24

Page 25: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

You can also filter by SIFT, PolyPhen and MAF, or click on Filter other columns for filtering by other columns such as, Evidence or Class. There are still 586 variants in our table, to narrow down further to AA position 508, we can Filter Other Columns, then AA coord. Use the sliding bar to narrow the search area to 500-520. We can now see 6 inframe deletions in this region. The two of interest to us are rs199826652 and rs113993960, as they have IF/I in the AA column, indicating the loss of phenylalanine at position 508. Hover over the Evidence and Clinical Significance columns to see what has been attributed to each variant. The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too. Click on the link for rs113993960, to open the Variant Tab.

25

Page 26: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

In the Variant Tab you can view things like Alleles, most severe consequence, and interestingly for this example, Variant with equivalent alleles. While rs199826652 and rs113993960 are deletions of different 3bps, the resultant alleles are the same. We can view that by clicking on Genes and Regulation in the left panel to see the codon change for rs113993960 is: ATCTTT/ATT To see the codon change for rs199826652, click back to the gene tab and select it from the variant table, or navigate directly by typing into the search box. Click on the link for rs199826652, this will open the Variant Tab. Click on Genes and Regulation in the left panel to see the codon change for rs199826652 is: ATCTTT/ATT, compared to ATCTTT/ATT for the other variant. Now let’s look at the population genetics of this variant. Click on Population genetics in the left hand menu.

26

Page 27: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Here you can see that the frequency of this allele in all 1000Genomes populations is 0.004. No individuals were homozygous for this variant, as this study only recruited healthy volunteers. You can also see the phenotypes associated with the variant. Click on Phenotype data in the left hand menu.

Let’s have a look at how we can view variants in a broader context, in the Location tab. Click on the Location tab in the top bar.

Clickon the blue Configure this page menu at the left.

This will open a menu that allows you to change the image.

27

Page 28: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Let’s add some tracks to this image. Open Variation from the left-hand menu.

● dbSNP variants – Normal

● 1000 Genomes - All - Short variants

● 1000 Genomes 3 - All (Structural variants)

● DECIPHER structural variants - Expanded

Now click on the tick in the top left hand to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image. We can also change the way the tracks appear by hovering over the track name then the cog wheel to open a menu. We can move tracks around by clicking and dragging on the bar to the left of the track name.

28

Page 29: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

To return this to the default view, go to Configure this page and select Reset configuration at the bottom of the menu.

29

Page 30: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

Quick Guide to Databases and Projects

Here is a list of databases and projects you will come across in these exercises. Google any of these to learn more. Projects include many species, unless otherwise noted. Other help: The Ensembl Glossary: http://www.ensembl.org/Help/Glossary Ensembl FAQs: http://www.ensembl.org/Help/Faq SEQUENCES EMBL-Bank, NCBI GenBank, DDBJ – Contain nucleic acid sequences deposited by submitters such as wet-lab biologists and gene sequencing projects. These three databases are synchronised with each other every day, so the same sequences should be found in each. CCDS – coding sequences that are agreed upon by Ensembl, VEGA-Havana, UCSC, and NCBI. (human and mouse). NCBI Entrez Gene – NCBI’s gene collection ` NCBI RefSeq – NCBI’s collection of ‘reference sequences’, includes genomic DNA, transcripts and proteins. NM stands for ‘Known mRNA’ (eg NM_005476) and NP (eg NP_005467) are ‘Known proteins’. UniProtKB – the “Protein knowledgebase”, a comprehensive set of protein sequences. Divided into two parts: Swiss-Prot and TrEMBL UniProt Swiss-Prot – the manually annotated, reviewed protein sequences in the UniProtKB. High quality. UniProt TrEMBL – the automatically annotated, unreviewed set of proteins (EMBL-Bank translated). Varying quality. VEGA – Vertebrate Genome Annotation, a selection of manually-curated genes, transcripts, and proteins. (human, mouse, zebrafish, gorilla, wallaby, pig, and dog).

30

Page 31: Browsing Genes and Variants with Ensembl - Amazon S3 · 2016-11-04 · Let’s add some tracks to this image, add: Refseq human import - (In genes and transcripts) Proteins (mammal)

VEGA-HAVANA – The main contributor to the VEGA project, located at the Wellcome Trust Sanger Institute, Hinxton, UK. GENE NAMES HGNC – HUGO Gene Nomenclature Committee, a project assigning a unique and meaningful name and symbol to every human gene. (Human). ZFIN – The Zebrafish Model Organism Database. Gene names are only one part of this project. (Z-fish). PROTEIN SIGNATURES InterPro – A collection of domains, motifs, and other protein signatures. Protein signature records are extensive, and combine information from individual projects such as UniProt, along with other databases such as SMART, PFAM and PROSITE (explained below). PFAM – A collection of protein families PROSITE – A collection of protein domains, families, and functional sites. SMART – A collection of evolutionarily conserved protein domains. OTHER PROJECTS NCBI dbSNP – A collection of sequence polymorphisms; mainly single nucleotide polymorphisms, along with insertion-deletions. NCBI OMIM – Online Mendelian Inheritance in Man – a resource showing phenotypes and diseases related to genes (human).

31