introduction to bioinformatics monday, november 19, 2012 jonathan pevsner [email protected]...

Introduction to Bioinformatics

Monday, November 19, 2012Jonathan Pevsner

[email protected]

M.E:800.707

• People with very diverse backgrounds in biology

• Some people with backgrounds in computer

science and biostatistics

• Most people (will) have a favorite gene, protein, or disease,

or a high throughput dataset

Who is taking this course?

Different user needs, different approaches

web-based or graphical user interface (GUI)

command line

NCBIEBI,

centralresources

UCSC,Ensemblgenomebrowsers

Galaxy: webimplementationof browser data,

NGS tools

Perl, R:manipulate data files

Linux:next-

generation sequencing, other tools

Software for data analysis,

large databases

PartekMEGA5RStudio

• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI), UCSC, and EBI

• To focus on the analysis of DNA, RNA and proteins

• To introduce you to the analysis of genomes

• To combine theory and practice to help you solve research problems

What are the goals of the course?

• Suppose we are interested in learning about beta globin (HBB), a subunit of hemoglobin

• NCBI offers information at the levels of DNA (e.g. variation in disease), RNA (e.g. gene expression data from microarrays), protein (e.g. 3D structure), pathways

• EBI offers comparable information

• Other portals such as ExPASy offer hundreds of analysis tools

Workflow #1: How do we analyze a protein?

• Choose an individual (e.g. a patient), obtain informed consent, get blood, purify genomic DNA

• Obtain raw DNA reads by next-generation sequencing

• Align the reads to a reference human genome

• Call variants (single nucleotide variants, indels)

• Determine the functional significance of variants (deleterious or neutral)

Workflow #2: How do we analyze a human genome?

Textbook

The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2nd edition 2009). The lectures in this course correspond closely to chapters.

I will make pdfs of the chapters available to everyone.

You can also purchase a copy at the bookstore, at amazon.com, or at Wiley with a 20% discount through the book’s website www.bioinfbook.org.

Web sites

The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle(or Google “moodle bioinformatics”)--This site contains the powerpoints for each lecture, including black & white versions for printing--The weekly quizzes are here--You can ask questions via the forum--Audiovisual files of each lecture will be posted here

The textbook website is:http://www.bioinfbook.orgThis has powerpoints, URLs, etc. organized by

chapter. This is most useful to find “web documents” corresponding to each chapter.

Themes throughout the course: the beta globin gene/protein family

We will use beta globin as a model gene/protein throughout the course. Globins including hemoglobin and myoglobin carry oxygen. We will study globins in a variety of contexts including

--sequence alignment--gene expression--protein structure--phylogeny--homologs in various species

Computer labs

There is no in-class computer lab, but the seven weekly quizzes function as a take-home computer lab.

To solve the questions, you will need to go to websites, use databases, and use software.

Most quizzes are due in 7 days. Because of Thanksgiving, the first quiz will be due in 9 days (November 28th at noon).

Grading

60% moodle quizzes (your top 6 out of 7 quizzes). Quizzes are taken at the moodle website, andare due one week after the relevant lecture.

Special extended due date for quizzes due immediately after Thanksgiving and the New Year.

40% final exam Thursday, January 17 (2pm, in class).Closed book, cumulative, no computer,short answer / multiple choice. Past exams will be made available ahead of time.

Google “moodle bioinformatics” to get here;Click “Bioinformatics” to sign in;The enrollment key you need is…

Outline for the course (all on Mondays)

1. Accessing information about DNA and proteins Nov. 19

2. Pairwise alignment and BLAST Nov. 26

3. Advanced BLAST to next-generation sequencing Dec. 3

4. Multiple sequence alignment Dec. 10

5. Molecular phylogeny and evolution Dec. 17

6. Microarrays Jan. 7

7. Next-generation sequencing Jan. 14

Final exam (Thursday 2:00 pm) Jan. 17

Outline for today

Definition of bioinformatics

Overview of the NCBI website

Accession numbers, RefSeq, and Entrez Gene

Two genome browsers: UCSC and Ensembl

From UCSC Table Browser to Galaxy

Learning objectives for today

[1] You should be able to explain what accession numbers are and why the RefSeq project is significant

[2] You should be able to find accession numbers for any gene (or protein) from any organism via Entrez Gene

[3] You should be able to locate any human gene using the UCSC Genome Browser

[4] You should be able to locate information about genes (and proteins) using the UCSC Table Browser and the related resource Galaxy

• Interface of biology and computers

• Analysis of proteins, genes and genomes using computer algorithms and computer databases

• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

What is bioinformatics?

Three perspectives on bioinformatics

The cell

The organism

The tree of life

Page 4

First perspective: the cell

DNA RNA phenotypeprotein

Page 5

DNA RNA protein

Central dogma of molecular biology

genome transcriptome proteome

Central dogma of bioinformatics and genomics

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Fig. 2.2Page 18

Growth of GenBank

Year

Bas

e p

airs

of

DN

A (

mil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17

Sequence Read Archive (SRA) at NCBI: over 900 terabases of sequence (~1 petabase)

A project to sequence 10,000 human genomes requires about 3 petabytes of storage (3,000 terabytes). To buy a server with 100 Tb now costs $10,000.

Year201320122011201020092008

Siz

e,

tera

ba

ses

Arrival of next-generation sequencing: We have sequenced hundreds of terabases

(November 2012)

6 years ago GenBank celebrated reaching 100 billion base pairs of DNA.

Now when my lab sequences the genome of one patient, we obtain ~150 billion base pairs of sequence and ~1 terabyte of data for $3,000.

This course will reflect the major impact of increased DNA sequencing on all areas of biology.

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

Page 14

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 14

There storage of next-generation sequence data is an emerging challenge

NCBI offers a sequence read archive (SRA), but the best storage strategies are uncertain. http://www.ncbi.nlm.nih.gov/Traces/sra/

The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), http://www.ebi.ac.uk/ena/

For individual labs, tens of terabytes of storage are routinely needed.

Time ofdevelopment

Body region, physiology, pharmacology, pathology

Page 5

Second perspective: the organism

After Pace NR (1997) Science 276:734 Page 6

Third perspective: the tree of life

Taxonomy at NCBI:>250,000 species are represented in GenBank

Page 16http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi11/12

The most sequenced organisms in GenBank

Homo sapiens 16.3 billion basesMus musculus 10.0bRattus norvegicus 6.5bBos taurus 5.4bZea mays 5.1bSus scrofa 4.9bDanio rerio 3.1bStrongylocentrotus purpurata 1.4bMacaca mulatta 1.3bOryza sativa (japonica) 1.3b

Updated Nov. 2012GenBank release 192.0Excluding WGS, organelles, metagenomics

Table 2-2Page 17

Outline for today






National Center for Biotechnology Information (NCBI)www.ncbi.nlm.nih.gov

Fig. 2.4Page 24

• National Library of Medicine's search service• 22 million citations in MEDLINE (as of 2012)• links to participating online journals• PubMed tutorial on the site

http://www.ncbi.nlm.nih.gov/pubmed

or visit NLM:

http://www.nlm.nih.gov/bsd/disted/pubmed.html

Page 23

NCBI key features: PubMed

Be sure to access PubMed via Welch Library!

http://welch.jhmi.edu/

Also get to know your informationists—

Peggy Gross, Rob Wright et al.

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Page 24

NCBI key features: Entrez search and retrieval system

Outline for today






Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.

Page 26

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for beta globin, HBB):

X02775 GenBank genomic DNA sequenceNG_000007.3 RefSeqGeners192792910 dbSNP (single nucleotide polymorphism)

AA970968.1 An expressed sequence tag (1 of 2,345)NM_000518.4 RefSeq DNA sequence (from a transcript)

NP_000509.1 RefSeq proteinCAA00182.1 GenBank proteinQ14473 SwissProt protein1YE0|B Protein Data Bank structure record

protein

DNA

RNA

Page 27

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_000518Protein NP_###### e.g. NP_000509

Page 27

Accession Molecule Method NoteAC_123456 Genomic Mixed Alternate complete genomicAP_123456 Protein Mixed Protein products; alternateNC_123456 Genomic Mixed Complete genomic moleculesNG_123456 Genomic Mixed Incomplete genomic regionsNM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assembliesNW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun dataXM_123456 mRNA Automated Transcript productsXP_123456 Protein Automated Protein productsXR_123456 RNA Automated Transcript productsYP_123456 Protein Auto. & Curated Protein productsZP_12345678 Protein Automated Protein products

NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences

Access to sequences: Entrez Gene at NCBI

Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_000518.4 for beta globin DNA corresponding to mRNA) or protein (NP_000509.1)

Page 29

From the NCBI homepage, type “beta globin”and hit “Search”

Fig. 2.5Page 28

Fig. 2.5Page 28

Follow the link to “Gene”

Entrez Gene is in the headerNote the “Official Symbol” HBB for beta globinNote the “limits” option

Entrez Gene (top of page):Note a useful summary, and links to other databases

“Gene” page at NCBI offers a wealth of information

• Genomic context• Bibliography• Phenotypes• Gene Ontology (organizing principles of biological

process, molecular function, cellular component)• Reference sequences• Additional (non-RefSeq sequences)• Many, many links to NCBI resources (e.g. HomoloGene)• Many, many links to external resources

Page 29

Entrez Gene (bottom of page): non-RefSeq accessions(it’s unclear what these are, highlighting usefulness of RefSeq)

Fig. 2.8Page 31

Entrez Protein:accession, organism,literature…

Fig. 2.8Page 31

Entrez Protein:…features of a protein, and its sequence in the one-letter amino acid code

You should learn the one-letter amino acid code!

Name 3-Letter 1-LetterAlanine Ala AArginine Arg R

Asparagine Asn NAspartic acid Asp D

Cysteine Cys CGlutamic Acid Glu E

Glutamine Gln QGlycine Gly G

Histidine His HIsoleucine Ile I

Name 3-Letter 1-LetterLeucine Leu LLysine Lys K

Methionine Met MPhenylalanine Phe F

Proline Pro PSerine Ser S

Threonine Thr TTryptophan Trp W

Tyrosine Tyr YValine Val V

Entrez Protein:You can change the display (as shown)…

FASTA format:versatile, compact with one header line

followed by a string of nucleotides or amino acids in the single letter code

Fig. 2.9Page 32

While FASTA is one file format, there are many others

FASTA Sequences in one letter DNA or protein codeFASTQ DNA sequences with quality scores for each baseBAM compressed binary version of SAMSAM Sequence Alignment/Map file (tab-delimited)VCF variant call format (genomic variants; indels)

(See genome.ucsc.edu/FAQ/FAQformat.html for the following:)

BED a table including chromosome, start, endWIG wiggle format (displays dense, continuous data)GFF General Feature Format (tab separated)

Also, besides Excel (.xls, .xlsx) spreadsheets can also be:.txt tab-delimited text file (or space delimited).csv comma separated text file

FASTQ format specification

The FASTQ format has four lines per read (and typically has millions of reads)

http://maq.sourceforge.net/fastq.shtml

Sequence read (like FASTA)

Quality scores (per base)

Sequencing run information

Outline for today






Genome Browsers:increasingly important resources

Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information.

The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used.

Ensembl genome browser (www.ensembl.org)

clickhuman

note BioMart

enterbeta globin

Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom).

There are various horizontal annotation tracks.

A quiz question about Ensembl/BioMart

For this week’s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the microRNAs on chromosome 11, as well as their GC content. There are step-by-step instructions and it should take no more than a couple minutes.

The UCSC Genome Browser:an increasingly important resource

• This browser’s focus is on humans and other eukaryotes

• you can select which tracks to display (and how much information for each track)

• tracks are based on data generated by the UCSC team and by the broad research community

• you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it

• The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

[1] Visit http://genome.ucsc.edu/, click Genome Browser

[2] Choose organisms, enter query (beta globin), hit submit

Page 36

Note that there are choices of assemblies such as hg19

An assembly (or “build”) is a fixed version of a genome. Builds are released every several years. In practice, you should always be aware whether you are using hg18 or hg19 for the human genome. They are annotated with different types of information such as experimental data sets.

To learn more visit http://www.ncbi.nlm.nih.gov/assembly/basics/ http://www.ncbi.nlm.nih.gov/assembly/model/

[3] Choose the RefSeq beta globin gene (HBB)

[4] On the UCSC Genome Browser:--choose which tracks to display--add custom tracks--the Table Browser is complementary

Exploring the UCSC Genome Browser

• The human genome can be viewed with different “assemblies” (hg18, hg19). These contain different data sets.

• You can get information about a track by clicking its header (e.g. “RefSeq Genes”).

• You can choose the density to display each track (e.g. hide, dense, squish, pack).

Outline for today






You can reach the Table Browser from the Genome Browser

Use the UCSC Table Browser to get tabular data related to any of the Genome Browser tracks. It is quantitative, not visual.

Get output Get a summary of the output

assemblyspecies

position (where you were on the Genome Browser)

check box to send a BED file to Galaxy

Step 2: click “Send query to Galaxy”

Toolspanel

Displaypanel

Historypanel

Step 3: In history panel, click eye to view main panel; click edit attributes to change name to snps

Step 3 (continued): Thus, there are ~890,000 SNPs on human chromosome 11

Step 4: To get coding exons, click “Get Data” (under Tools) then “UCSC Main table browser”

Step 4 (continued): To get coding exons, set the group to Genes and the track to RefSeq Genes, then “get output”

Step 4 (continued): Set the output to “Coding Exons” then send the query to Galaxy

Step 5: In Galaxy, a new data set appears. Rename it “coding exons” by clicking the pencil. Under Tools click “Operate on Genomic Intervals.”

Step 6: Click “Intersect the intervals of two datasets.” In the central panel, choose snps and coding exons. Execute.

Step 7. Here’s the answer. There are 11,652 SNPs in coding exons. Finally, click “display at UCSC main.”

Step 7. This returns us to the UCSC Genome Browser, where our Galaxy results are displayed as a custom track (entitled “User Supplied Track”).

Galaxy’s URL is http://usegalaxy.org. Its features include:

• Integration with UCSC and many other major genomics resources such as BioMart

• It is intuitive and does not require knowledge of computer programming, but it offers access to sophisticated software

• There are excellent videocasts and tutorials

• It fosters reproducibility. Your history and workflows are saved. See the “User” tab and sign up for a free account!

Summary: Galaxy

[1] NCBI is a central resource for bioinformatics.

[2] We described accession numbers and the RefSeq project, which offers a trusted version of a sequence. Use Entrez Gene as a key source of gene information.

[3] We used the UCSC Genome Browser to visualize information along chromosomes.

[4] We then used the UCSC Table Browser to obtain tabular outputs of queries. The underlying data are the same in the Genome and Table Browsers.

[5] Next we used Galaxy to solve a problem, and sent the final result back to the UCSC Genome Browser.

Summary: today’s lecture

Reminder: Please enroll! Google “moodle bioinformatics” to get here; click “Bioinformatics” to sign in;The enrollment key is…

introduction to bioinformatics monday, november 19, 2012 jonathan pevsner [email protected]...

Documents

course textbook

course website

data analysis

analysis of dna

textbook website

gene expression data

protein structure

google moodle bioinformatics