introduction to bioinformatics monday, november 19, 2012 jonathan pevsner [email protected]...
TRANSCRIPT
Introduction to Bioinformatics
Monday, November 19, 2012Jonathan Pevsner
M.E:800.707
• People with very diverse backgrounds in biology
• Some people with backgrounds in computer
science and biostatistics
• Most people (will) have a favorite gene, protein, or disease,
or a high throughput dataset
Who is taking this course?
Different user needs, different approaches
web-based or graphical user interface (GUI)
command line
NCBIEBI,
centralresources
UCSC,Ensemblgenomebrowsers
Galaxy: webimplementationof browser data,
NGS tools
Perl, R:manipulate data files
Linux:next-
generation sequencing, other tools
Software for data analysis,
large databases
PartekMEGA5RStudio
• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI), UCSC, and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you solve research problems
What are the goals of the course?
• Suppose we are interested in learning about beta globin (HBB), a subunit of hemoglobin
• NCBI offers information at the levels of DNA (e.g. variation in disease), RNA (e.g. gene expression data from microarrays), protein (e.g. 3D structure), pathways
• EBI offers comparable information
• Other portals such as ExPASy offer hundreds of analysis tools
Workflow #1: How do we analyze a protein?
• Choose an individual (e.g. a patient), obtain informed consent, get blood, purify genomic DNA
• Obtain raw DNA reads by next-generation sequencing
• Align the reads to a reference human genome
• Call variants (single nucleotide variants, indels)
• Determine the functional significance of variants (deleterious or neutral)
Workflow #2: How do we analyze a human genome?
Textbook
The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2nd edition 2009). The lectures in this course correspond closely to chapters.
I will make pdfs of the chapters available to everyone.
You can also purchase a copy at the bookstore, at amazon.com, or at Wiley with a 20% discount through the book’s website www.bioinfbook.org.
Web sites
The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle(or Google “moodle bioinformatics”)--This site contains the powerpoints for each lecture, including black & white versions for printing--The weekly quizzes are here--You can ask questions via the forum--Audiovisual files of each lecture will be posted here
The textbook website is:http://www.bioinfbook.orgThis has powerpoints, URLs, etc. organized by
chapter. This is most useful to find “web documents” corresponding to each chapter.
Themes throughout the course: the beta globin gene/protein family
We will use beta globin as a model gene/protein throughout the course. Globins including hemoglobin and myoglobin carry oxygen. We will study globins in a variety of contexts including
--sequence alignment--gene expression--protein structure--phylogeny--homologs in various species
Computer labs
There is no in-class computer lab, but the seven weekly quizzes function as a take-home computer lab.
To solve the questions, you will need to go to websites, use databases, and use software.
Most quizzes are due in 7 days. Because of Thanksgiving, the first quiz will be due in 9 days (November 28th at noon).
Grading
60% moodle quizzes (your top 6 out of 7 quizzes). Quizzes are taken at the moodle website, andare due one week after the relevant lecture.
Special extended due date for quizzes due immediately after Thanksgiving and the New Year.
40% final exam Thursday, January 17 (2pm, in class).Closed book, cumulative, no computer,short answer / multiple choice. Past exams will be made available ahead of time.
Google “moodle bioinformatics” to get here;Click “Bioinformatics” to sign in;The enrollment key you need is…
Outline for the course (all on Mondays)
1. Accessing information about DNA and proteins Nov. 19
2. Pairwise alignment and BLAST Nov. 26
3. Advanced BLAST to next-generation sequencing Dec. 3
4. Multiple sequence alignment Dec. 10
5. Molecular phylogeny and evolution Dec. 17
6. Microarrays Jan. 7
7. Next-generation sequencing Jan. 14
Final exam (Thursday 2:00 pm) Jan. 17
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accession numbers, RefSeq, and Entrez Gene
Two genome browsers: UCSC and Ensembl
From UCSC Table Browser to Galaxy
Learning objectives for today
[1] You should be able to explain what accession numbers are and why the RefSeq project is significant
[2] You should be able to find accession numbers for any gene (or protein) from any organism via Entrez Gene
[3] You should be able to locate any human gene using the UCSC Genome Browser
[4] You should be able to locate information about genes (and proteins) using the UCSC Table Browser and the related resource Galaxy
• Interface of biology and computers
• Analysis of proteins, genes and genomes using computer algorithms and computer databases
• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
What is bioinformatics?
Three perspectives on bioinformatics
The cell
The organism
The tree of life
Page 4
First perspective: the cell
DNA RNA phenotypeprotein
Page 5
DNA RNA protein
Central dogma of molecular biology
genome transcriptome proteome
Central dogma of bioinformatics and genomics
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
Fig. 2.2Page 18
Growth of GenBank
Year
Bas
e p
airs
of
DN
A (
mil
lio
ns)
Seq
uen
ces
(mil
lio
ns)
1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17
Sequence Read Archive (SRA) at NCBI: over 900 terabases of sequence (~1 petabase)
A project to sequence 10,000 human genomes requires about 3 petabytes of storage (3,000 terabytes). To buy a server with 100 Tb now costs $10,000.
Year201320122011201020092008
Siz
e,
tera
ba
ses
Arrival of next-generation sequencing: We have sequenced hundreds of terabases
(November 2012)
6 years ago GenBank celebrated reaching 100 billion base pairs of DNA.
Now when my lab sequences the genome of one patient, we obtain ~150 billion base pairs of sequence and ~1 terabyte of data for $3,000.
This course will reflect the major impact of increased DNA sequencing on all areas of biology.
GenBankEMBL DDBJ
There are three major public DNA databases
The underlying raw DNA sequences are identical
Page 14
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformatics
Institute
There are three major public DNA databases
Housed at NCBINational
Center forBiotechnology
Information
Housed in Japan
Page 14
There storage of next-generation sequence data is an emerging challenge
NCBI offers a sequence read archive (SRA), but the best storage strategies are uncertain. http://www.ncbi.nlm.nih.gov/Traces/sra/
The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), http://www.ebi.ac.uk/ena/
For individual labs, tens of terabytes of storage are routinely needed.
Time ofdevelopment
Body region, physiology, pharmacology, pathology
Page 5
Second perspective: the organism
After Pace NR (1997) Science 276:734 Page 6
Third perspective: the tree of life
Taxonomy at NCBI:>250,000 species are represented in GenBank
Page 16http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi11/12
The most sequenced organisms in GenBank
Homo sapiens 16.3 billion basesMus musculus 10.0bRattus norvegicus 6.5bBos taurus 5.4bZea mays 5.1bSus scrofa 4.9bDanio rerio 3.1bStrongylocentrotus purpurata 1.4bMacaca mulatta 1.3bOryza sativa (japonica) 1.3b
Updated Nov. 2012GenBank release 192.0Excluding WGS, organelles, metagenomics
Table 2-2Page 17
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accession numbers, RefSeq, and Entrez Gene
Two genome browsers: UCSC and Ensembl
From UCSC Table Browser to Galaxy
National Center for Biotechnology Information (NCBI)www.ncbi.nlm.nih.gov
Fig. 2.4Page 24
• National Library of Medicine's search service• 22 million citations in MEDLINE (as of 2012)• links to participating online journals• PubMed tutorial on the site
http://www.ncbi.nlm.nih.gov/pubmed
or visit NLM:
http://www.nlm.nih.gov/bsd/disted/pubmed.html
Page 23
NCBI key features: PubMed
Be sure to access PubMed via Welch Library!
http://welch.jhmi.edu/
Also get to know your informationists—
Peggy Gross, Rob Wright et al.
Entrez integrates…
• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes
Page 24
NCBI key features: Entrez search and retrieval system
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accession numbers, RefSeq, and Entrez Gene
Two genome browsers: UCSC and Ensembl
From UCSC Table Browser to Galaxy
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for beta globin, HBB):
X02775 GenBank genomic DNA sequenceNG_000007.3 RefSeqGeners192792910 dbSNP (single nucleotide polymorphism)
AA970968.1 An expressed sequence tag (1 of 2,345)NM_000518.4 RefSeq DNA sequence (from a transcript)
NP_000509.1 RefSeq proteinCAA00182.1 GenBank proteinQ14473 SwissProt protein1YE0|B Protein Data Bank structure record
protein
DNA
RNA
Page 27
NCBI’s important RefSeq project: best representative sequences
RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_000518Protein NP_###### e.g. NP_000509
Page 27
Accession Molecule Method NoteAC_123456 Genomic Mixed Alternate complete genomicAP_123456 Protein Mixed Protein products; alternateNC_123456 Genomic Mixed Complete genomic moleculesNG_123456 Genomic Mixed Incomplete genomic regionsNM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assembliesNW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun dataXM_123456 mRNA Automated Transcript productsXP_123456 Protein Automated Protein productsXR_123456 RNA Automated Transcript productsYP_123456 Protein Auto. & Curated Protein productsZP_12345678 Protein Automated Protein products
NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences
Access to sequences: Entrez Gene at NCBI
Entrez Gene is a great starting point: it collectskey information on each gene/protein from major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession number for each DNA (NM_000518.4 for beta globin DNA corresponding to mRNA) or protein (NP_000509.1)
Page 29
From the NCBI homepage, type “beta globin”and hit “Search”
Fig. 2.5Page 28
Fig. 2.5Page 28
Follow the link to “Gene”
Entrez Gene is in the headerNote the “Official Symbol” HBB for beta globinNote the “limits” option
Page 30
Entrez Gene (top of page):Note a useful summary, and links to other databases
“Gene” page at NCBI offers a wealth of information
• Genomic context• Bibliography• Phenotypes• Gene Ontology (organizing principles of biological
process, molecular function, cellular component)• Reference sequences• Additional (non-RefSeq sequences)• Many, many links to NCBI resources (e.g. HomoloGene)• Many, many links to external resources
Page 29
Entrez Gene (bottom of page): non-RefSeq accessions(it’s unclear what these are, highlighting usefulness of RefSeq)
Fig. 2.8Page 31
Entrez Protein:accession, organism,literature…
Fig. 2.8Page 31
Entrez Protein:…features of a protein, and its sequence in the one-letter amino acid code
You should learn the one-letter amino acid code!
Name 3-Letter 1-LetterAlanine Ala AArginine Arg R
Asparagine Asn NAspartic acid Asp D
Cysteine Cys CGlutamic Acid Glu E
Glutamine Gln QGlycine Gly G
Histidine His HIsoleucine Ile I
Name 3-Letter 1-LetterLeucine Leu LLysine Lys K
Methionine Met MPhenylalanine Phe F
Proline Pro PSerine Ser S
Threonine Thr TTryptophan Trp W
Tyrosine Tyr YValine Val V
Page 31
Entrez Protein:You can change the display (as shown)…
FASTA format:versatile, compact with one header line
followed by a string of nucleotides or amino acids in the single letter code
Fig. 2.9Page 32
While FASTA is one file format, there are many others
FASTA Sequences in one letter DNA or protein codeFASTQ DNA sequences with quality scores for each baseBAM compressed binary version of SAMSAM Sequence Alignment/Map file (tab-delimited)VCF variant call format (genomic variants; indels)
(See genome.ucsc.edu/FAQ/FAQformat.html for the following:)
BED a table including chromosome, start, endWIG wiggle format (displays dense, continuous data)GFF General Feature Format (tab separated)
Also, besides Excel (.xls, .xlsx) spreadsheets can also be:.txt tab-delimited text file (or space delimited).csv comma separated text file
FASTQ format specification
The FASTQ format has four lines per read (and typically has millions of reads)
http://maq.sourceforge.net/fastq.shtml
Sequence read (like FASTA)
Quality scores (per base)
Sequencing run information
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accession numbers, RefSeq, and Entrez Gene
Two genome browsers: UCSC and Ensembl
From UCSC Table Browser to Galaxy
Genome Browsers:increasingly important resources
Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information.
The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used.
Ensembl genome browser (www.ensembl.org)
clickhuman
note BioMart
enterbeta globin
Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom).
There are various horizontal annotation tracks.
A quiz question about Ensembl/BioMart
For this week’s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the microRNAs on chromosome 11, as well as their GC content. There are step-by-step instructions and it should take no more than a couple minutes.
The UCSC Genome Browser:an increasingly important resource
• This browser’s focus is on humans and other eukaryotes
• you can select which tracks to display (and how much information for each track)
• tracks are based on data generated by the UCSC team and by the broad research community
• you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it
• The Table Browser is equally important as the more visual Genome Browser, and you can move between the two
[1] Visit http://genome.ucsc.edu/, click Genome Browser
[2] Choose organisms, enter query (beta globin), hit submit
Page 36
Note that there are choices of assemblies such as hg19
An assembly (or “build”) is a fixed version of a genome. Builds are released every several years. In practice, you should always be aware whether you are using hg18 or hg19 for the human genome. They are annotated with different types of information such as experimental data sets.
To learn more visit http://www.ncbi.nlm.nih.gov/assembly/basics/ http://www.ncbi.nlm.nih.gov/assembly/model/
[3] Choose the RefSeq beta globin gene (HBB)
[4] On the UCSC Genome Browser:--choose which tracks to display--add custom tracks--the Table Browser is complementary
Exploring the UCSC Genome Browser
• The human genome can be viewed with different “assemblies” (hg18, hg19). These contain different data sets.
• You can get information about a track by clicking its header (e.g. “RefSeq Genes”).
• You can choose the density to display each track (e.g. hide, dense, squish, pack).
Outline for today
Definition of bioinformatics
Overview of the NCBI website
Accession numbers, RefSeq, and Entrez Gene
Two genome browsers: UCSC and Ensembl
From UCSC Table Browser to Galaxy
You can reach the Table Browser from the Genome Browser
Use the UCSC Table Browser to get tabular data related to any of the Genome Browser tracks. It is quantitative, not visual.
Get output Get a summary of the output
assemblyspecies
position (where you were on the Genome Browser)
check box to send a BED file to Galaxy
Step 2: click “Send query to Galaxy”
Toolspanel
Displaypanel
Historypanel
Step 3: In history panel, click eye to view main panel; click edit attributes to change name to snps
Step 3 (continued): Thus, there are ~890,000 SNPs on human chromosome 11
Step 4: To get coding exons, click “Get Data” (under Tools) then “UCSC Main table browser”
Step 4 (continued): To get coding exons, set the group to Genes and the track to RefSeq Genes, then “get output”
Step 4 (continued): Set the output to “Coding Exons” then send the query to Galaxy
Step 5: In Galaxy, a new data set appears. Rename it “coding exons” by clicking the pencil. Under Tools click “Operate on Genomic Intervals.”
Step 6: Click “Intersect the intervals of two datasets.” In the central panel, choose snps and coding exons. Execute.
Step 7. Here’s the answer. There are 11,652 SNPs in coding exons. Finally, click “display at UCSC main.”
Step 7. This returns us to the UCSC Genome Browser, where our Galaxy results are displayed as a custom track (entitled “User Supplied Track”).
Galaxy’s URL is http://usegalaxy.org. Its features include:
• Integration with UCSC and many other major genomics resources such as BioMart
• It is intuitive and does not require knowledge of computer programming, but it offers access to sophisticated software
• There are excellent videocasts and tutorials
• It fosters reproducibility. Your history and workflows are saved. See the “User” tab and sign up for a free account!
Summary: Galaxy
[1] NCBI is a central resource for bioinformatics.
[2] We described accession numbers and the RefSeq project, which offers a trusted version of a sequence. Use Entrez Gene as a key source of gene information.
[3] We used the UCSC Genome Browser to visualize information along chromosomes.
[4] We then used the UCSC Table Browser to obtain tabular outputs of queries. The underlying data are the same in the Genome and Table Browsers.
[5] Next we used Galaxy to solve a problem, and sent the final result back to the UCSC Genome Browser.
Summary: today’s lecture
Reminder: Please enroll! Google “moodle bioinformatics” to get here; click “Bioinformatics” to sign in;The enrollment key is…