introduction to bioinformatics monday, november 19, 2012 jonathan pevsner [email protected]...

82
Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner [email protected] Bioinformatics M.E:800.707

Upload: franklin-king

Post on 16-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Introduction to Bioinformatics

Monday, November 19, 2012Jonathan Pevsner

[email protected]

M.E:800.707

Page 2: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• People with very diverse backgrounds in biology

• Some people with backgrounds in computer

science and biostatistics

• Most people (will) have a favorite gene, protein, or disease,

or a high throughput dataset

Who is taking this course?

Page 3: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Different user needs, different approaches

web-based or graphical user interface (GUI)

command line

NCBIEBI,

centralresources

UCSC,Ensemblgenomebrowsers

Galaxy: webimplementationof browser data,

NGS tools

Perl, R:manipulate data files

Linux:next-

generation sequencing, other tools

Software for data analysis,

large databases

PartekMEGA5RStudio

Page 4: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI), UCSC, and EBI

• To focus on the analysis of DNA, RNA and proteins

• To introduce you to the analysis of genomes

• To combine theory and practice to help you solve research problems

What are the goals of the course?

Page 5: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• Suppose we are interested in learning about beta globin (HBB), a subunit of hemoglobin

• NCBI offers information at the levels of DNA (e.g. variation in disease), RNA (e.g. gene expression data from microarrays), protein (e.g. 3D structure), pathways

• EBI offers comparable information

• Other portals such as ExPASy offer hundreds of analysis tools

Workflow #1: How do we analyze a protein?

Page 6: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• Choose an individual (e.g. a patient), obtain informed consent, get blood, purify genomic DNA

• Obtain raw DNA reads by next-generation sequencing

• Align the reads to a reference human genome

• Call variants (single nucleotide variants, indels)

• Determine the functional significance of variants (deleterious or neutral)

Workflow #2: How do we analyze a human genome?

Page 7: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Textbook

The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2nd edition 2009). The lectures in this course correspond closely to chapters.

I will make pdfs of the chapters available to everyone.

You can also purchase a copy at the bookstore, at amazon.com, or at Wiley with a 20% discount through the book’s website www.bioinfbook.org.

Page 8: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Web sites

The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle(or Google “moodle bioinformatics”)--This site contains the powerpoints for each lecture, including black & white versions for printing--The weekly quizzes are here--You can ask questions via the forum--Audiovisual files of each lecture will be posted here

The textbook website is:http://www.bioinfbook.orgThis has powerpoints, URLs, etc. organized by

chapter. This is most useful to find “web documents” corresponding to each chapter.

Page 9: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Themes throughout the course: the beta globin gene/protein family

We will use beta globin as a model gene/protein throughout the course. Globins including hemoglobin and myoglobin carry oxygen. We will study globins in a variety of contexts including

--sequence alignment--gene expression--protein structure--phylogeny--homologs in various species

Page 10: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Computer labs

There is no in-class computer lab, but the seven weekly quizzes function as a take-home computer lab.

To solve the questions, you will need to go to websites, use databases, and use software.

Most quizzes are due in 7 days. Because of Thanksgiving, the first quiz will be due in 9 days (November 28th at noon).

Page 11: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Grading

60% moodle quizzes (your top 6 out of 7 quizzes). Quizzes are taken at the moodle website, andare due one week after the relevant lecture.

Special extended due date for quizzes due immediately after Thanksgiving and the New Year.

40% final exam Thursday, January 17 (2pm, in class).Closed book, cumulative, no computer,short answer / multiple choice. Past exams will be made available ahead of time.

Page 12: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Google “moodle bioinformatics” to get here;Click “Bioinformatics” to sign in;The enrollment key you need is…

Page 13: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707
Page 14: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for the course (all on Mondays)

1. Accessing information about DNA and proteins Nov. 19

2. Pairwise alignment and BLAST Nov. 26

3. Advanced BLAST to next-generation sequencing Dec. 3

4. Multiple sequence alignment Dec. 10

5. Molecular phylogeny and evolution Dec. 17

6. Microarrays Jan. 7

7. Next-generation sequencing Jan. 14

Final exam (Thursday 2:00 pm) Jan. 17

Page 15: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Page 16: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Learning objectives for today

[1] You should be able to explain what accession numbers are and why the RefSeq project is significant

[2] You should be able to find accession numbers for any gene (or protein) from any organism via Entrez Gene

[3] You should be able to locate any human gene using the UCSC Genome Browser

[4] You should be able to locate information about genes (and proteins) using the UCSC Table Browser and the related resource Galaxy

Page 17: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• Interface of biology and computers

• Analysis of proteins, genes and genomes using computer algorithms and computer databases

• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

What is bioinformatics?

Page 18: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Three perspectives on bioinformatics

The cell

The organism

The tree of life

Page 4

Page 19: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

First perspective: the cell

Page 20: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

DNA RNA phenotypeprotein

Page 5

Page 21: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

DNA RNA protein

Central dogma of molecular biology

genome transcriptome proteome

Central dogma of bioinformatics and genomics

Page 22: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Fig. 2.2Page 18

Page 23: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Growth of GenBank

Year

Bas

e p

airs

of

DN

A (

mil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17

Page 24: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Sequence Read Archive (SRA) at NCBI: over 900 terabases of sequence (~1 petabase)

A project to sequence 10,000 human genomes requires about 3 petabytes of storage (3,000 terabytes). To buy a server with 100 Tb now costs $10,000.

Year201320122011201020092008

Siz

e,

tera

ba

ses

Page 25: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Arrival of next-generation sequencing: We have sequenced hundreds of terabases

(November 2012)

6 years ago GenBank celebrated reaching 100 billion base pairs of DNA.

Now when my lab sequences the genome of one patient, we obtain ~150 billion base pairs of sequence and ~1 terabyte of data for $3,000.

This course will reflect the major impact of increased DNA sequencing on all areas of biology.

Page 26: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

Page 14

Page 27: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 14

Page 28: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

There storage of next-generation sequence data is an emerging challenge

NCBI offers a sequence read archive (SRA), but the best storage strategies are uncertain. http://www.ncbi.nlm.nih.gov/Traces/sra/

The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), http://www.ebi.ac.uk/ena/

For individual labs, tens of terabytes of storage are routinely needed.

Page 29: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Time ofdevelopment

Body region, physiology, pharmacology, pathology

Page 5

Second perspective: the organism

Page 30: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

After Pace NR (1997) Science 276:734 Page 6

Third perspective: the tree of life

Page 31: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Taxonomy at NCBI:>250,000 species are represented in GenBank

Page 16http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi11/12

Page 32: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

The most sequenced organisms in GenBank

Homo sapiens 16.3 billion basesMus musculus 10.0bRattus norvegicus 6.5bBos taurus 5.4bZea mays 5.1bSus scrofa 4.9bDanio rerio 3.1bStrongylocentrotus purpurata 1.4bMacaca mulatta 1.3bOryza sativa (japonica) 1.3b

Updated Nov. 2012GenBank release 192.0Excluding WGS, organelles, metagenomics

Table 2-2Page 17

Page 33: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Page 34: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

National Center for Biotechnology Information (NCBI)www.ncbi.nlm.nih.gov

Fig. 2.4Page 24

Page 35: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

• National Library of Medicine's search service• 22 million citations in MEDLINE (as of 2012)• links to participating online journals• PubMed tutorial on the site

http://www.ncbi.nlm.nih.gov/pubmed

or visit NLM:

http://www.nlm.nih.gov/bsd/disted/pubmed.html

Page 23

NCBI key features: PubMed

Be sure to access PubMed via Welch Library!

http://welch.jhmi.edu/

Also get to know your informationists—

Peggy Gross, Rob Wright et al.

Page 36: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Page 24

NCBI key features: Entrez search and retrieval system

Page 37: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Page 38: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.

Page 26

Page 39: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for beta globin, HBB):

X02775 GenBank genomic DNA sequenceNG_000007.3 RefSeqGeners192792910 dbSNP (single nucleotide polymorphism)

AA970968.1 An expressed sequence tag (1 of 2,345)NM_000518.4 RefSeq DNA sequence (from a transcript)

NP_000509.1 RefSeq proteinCAA00182.1 GenBank proteinQ14473 SwissProt protein1YE0|B Protein Data Bank structure record

protein

DNA

RNA

Page 27

Page 40: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_000518Protein NP_###### e.g. NP_000509

Page 27

Page 41: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Accession Molecule Method NoteAC_123456 Genomic Mixed Alternate complete genomicAP_123456 Protein Mixed Protein products; alternateNC_123456 Genomic Mixed Complete genomic moleculesNG_123456 Genomic Mixed Incomplete genomic regionsNM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assembliesNW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun dataXM_123456 mRNA Automated Transcript productsXP_123456 Protein Automated Protein productsXR_123456 RNA Automated Transcript productsYP_123456 Protein Auto. & Curated Protein productsZP_12345678 Protein Automated Protein products

NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences

Page 42: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Access to sequences: Entrez Gene at NCBI

Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_000518.4 for beta globin DNA corresponding to mRNA) or protein (NP_000509.1)

Page 29

Page 43: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

From the NCBI homepage, type “beta globin”and hit “Search”

Fig. 2.5Page 28

Page 44: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Fig. 2.5Page 28

Follow the link to “Gene”

Page 45: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Entrez Gene is in the headerNote the “Official Symbol” HBB for beta globinNote the “limits” option

Page 46: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Page 30

Entrez Gene (top of page):Note a useful summary, and links to other databases

Page 47: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

“Gene” page at NCBI offers a wealth of information

• Genomic context• Bibliography• Phenotypes• Gene Ontology (organizing principles of biological

process, molecular function, cellular component)• Reference sequences• Additional (non-RefSeq sequences)• Many, many links to NCBI resources (e.g. HomoloGene)• Many, many links to external resources

Page 29

Page 48: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Entrez Gene (bottom of page): non-RefSeq accessions(it’s unclear what these are, highlighting usefulness of RefSeq)

Page 49: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Fig. 2.8Page 31

Entrez Protein:accession, organism,literature…

Page 50: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Fig. 2.8Page 31

Entrez Protein:…features of a protein, and its sequence in the one-letter amino acid code

Page 51: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

You should learn the one-letter amino acid code!

Name 3-Letter 1-LetterAlanine Ala AArginine Arg R

Asparagine Asn NAspartic acid Asp D

Cysteine Cys CGlutamic Acid Glu E

Glutamine Gln QGlycine Gly G

Histidine His HIsoleucine Ile I

Name 3-Letter 1-LetterLeucine Leu LLysine Lys K

Methionine Met MPhenylalanine Phe F

Proline Pro PSerine Ser S

Threonine Thr TTryptophan Trp W

Tyrosine Tyr YValine Val V

Page 52: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Page 31

Entrez Protein:You can change the display (as shown)…

Page 53: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

FASTA format:versatile, compact with one header line

followed by a string of nucleotides or amino acids in the single letter code

Fig. 2.9Page 32

Page 54: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

While FASTA is one file format, there are many others

FASTA Sequences in one letter DNA or protein codeFASTQ DNA sequences with quality scores for each baseBAM compressed binary version of SAMSAM Sequence Alignment/Map file (tab-delimited)VCF variant call format (genomic variants; indels)

(See genome.ucsc.edu/FAQ/FAQformat.html for the following:)

BED a table including chromosome, start, endWIG wiggle format (displays dense, continuous data)GFF General Feature Format (tab separated)

Also, besides Excel (.xls, .xlsx) spreadsheets can also be:.txt tab-delimited text file (or space delimited).csv comma separated text file

Page 55: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

FASTQ format specification

The FASTQ format has four lines per read (and typically has millions of reads)

http://maq.sourceforge.net/fastq.shtml

Sequence read (like FASTA)

Quality scores (per base)

Sequencing run information

Page 56: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Page 57: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Genome Browsers:increasingly important resources

Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information.

The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used.

Page 58: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Ensembl genome browser (www.ensembl.org)

clickhuman

note BioMart

Page 59: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

enterbeta globin

Page 60: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom).

There are various horizontal annotation tracks.

Page 61: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

A quiz question about Ensembl/BioMart

For this week’s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the microRNAs on chromosome 11, as well as their GC content. There are step-by-step instructions and it should take no more than a couple minutes.

Page 62: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

The UCSC Genome Browser:an increasingly important resource

• This browser’s focus is on humans and other eukaryotes

• you can select which tracks to display (and how much information for each track)

• tracks are based on data generated by the UCSC team and by the broad research community

• you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it

• The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

Page 63: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

[1] Visit http://genome.ucsc.edu/, click Genome Browser

[2] Choose organisms, enter query (beta globin), hit submit

Page 36

Page 64: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Note that there are choices of assemblies such as hg19

An assembly (or “build”) is a fixed version of a genome. Builds are released every several years. In practice, you should always be aware whether you are using hg18 or hg19 for the human genome. They are annotated with different types of information such as experimental data sets.

To learn more visit http://www.ncbi.nlm.nih.gov/assembly/basics/ http://www.ncbi.nlm.nih.gov/assembly/model/

Page 65: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

[3] Choose the RefSeq beta globin gene (HBB)

Page 66: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

[4] On the UCSC Genome Browser:--choose which tracks to display--add custom tracks--the Table Browser is complementary

Page 67: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Exploring the UCSC Genome Browser

• The human genome can be viewed with different “assemblies” (hg18, hg19). These contain different data sets.

• You can get information about a track by clicking its header (e.g. “RefSeq Genes”).

• You can choose the density to display each track (e.g. hide, dense, squish, pack).

Page 68: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Page 69: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

You can reach the Table Browser from the Genome Browser

Page 70: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Use the UCSC Table Browser to get tabular data related to any of the Genome Browser tracks. It is quantitative, not visual.

Get output Get a summary of the output

assemblyspecies

position (where you were on the Genome Browser)

check box to send a BED file to Galaxy

Page 71: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 2: click “Send query to Galaxy”

Page 72: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Toolspanel

Displaypanel

Historypanel

Step 3: In history panel, click eye to view main panel; click edit attributes to change name to snps

Page 73: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 3 (continued): Thus, there are ~890,000 SNPs on human chromosome 11

Step 4: To get coding exons, click “Get Data” (under Tools) then “UCSC Main table browser”

Page 74: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 4 (continued): To get coding exons, set the group to Genes and the track to RefSeq Genes, then “get output”

Page 75: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 4 (continued): Set the output to “Coding Exons” then send the query to Galaxy

Page 76: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 5: In Galaxy, a new data set appears. Rename it “coding exons” by clicking the pencil. Under Tools click “Operate on Genomic Intervals.”

Page 77: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 6: Click “Intersect the intervals of two datasets.” In the central panel, choose snps and coding exons. Execute.

Page 78: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 7. Here’s the answer. There are 11,652 SNPs in coding exons. Finally, click “display at UCSC main.”

Page 79: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Step 7. This returns us to the UCSC Genome Browser, where our Galaxy results are displayed as a custom track (entitled “User Supplied Track”).

Page 80: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Galaxy’s URL is http://usegalaxy.org. Its features include:

• Integration with UCSC and many other major genomics resources such as BioMart

• It is intuitive and does not require knowledge of computer programming, but it offers access to sophisticated software

• There are excellent videocasts and tutorials

• It fosters reproducibility. Your history and workflows are saved. See the “User” tab and sign up for a free account!

Summary: Galaxy

Page 81: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

[1] NCBI is a central resource for bioinformatics.

[2] We described accession numbers and the RefSeq project, which offers a trusted version of a sequence. Use Entrez Gene as a key source of gene information.

[3] We used the UCSC Genome Browser to visualize information along chromosomes.

[4] We then used the UCSC Table Browser to obtain tabular outputs of queries. The underlying data are the same in the Genome and Table Browsers.

[5] Next we used Galaxy to solve a problem, and sent the final result back to the UCSC Genome Browser.

Summary: today’s lecture

Page 82: Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

Reminder: Please enroll! Google “moodle bioinformatics” to get here; click “Bioinformatics” to sign in;The enrollment key is…