genome annotations - schatzlabschatzlab.cshl.edu/teaching/2014/advseqcourse... · • ensembl...

73
Genome Annotations Michael Schatz Nov 15, 2014 Adv Sequencing Course

Upload: others

Post on 23-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Genome Annotations Michael Schatz Nov 15, 2014 Adv Sequencing Course

Page 2: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

������������� � ����

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Page 3: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

������������� � ����

Page 4: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Outline

1.  Alignment to other genomes 2.  Prediction aka “Gene Finding” 3.  Experimental & Functional Assays 4.  Online Resources

Page 5: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Outline

1.  Alignment to other genomes 2.  Prediction aka “Gene Finding” 3.  Experimental & Functional Assays 4.  Online Resources

Page 6: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

•  Rapidly compare a sequence Q to a database to find all sequences in the database with an score above some cutoff S. –  Which protein is most similar to a newly sequenced one? –  Where does this sequence of DNA originate?

•  Speed achieved by using a procedure that typically finds �most� matches with scores > S. –  Tradeoff between sensitivity and specificity/speed

•  Sensitivity – ability to find all related sequences •  Specificity – ability to reject unrelated sequences

Basic Local Alignment Search Tool

(Altschul et al. 1990)

Page 7: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Seed and Extend FAKDFLAGGVAAAISKTAVAPIERVKLLLQVQHASKQITADKQYKGIIDCVVRIPKEQGV F D +GG AAA+SKTAVAPIERVKLLLQVQ ASK I DK+YKGI+D ++R+PKEQGV FLIDLASGGTAAAVSKTAVAPIERVKLLLQVQDASKAIAVDKRYKGIMDVLIRVPKEQGV

•  Homologous sequences are likely to contain a short high

scoring word pair, a seed. –  Smaller seed sizes make the sense more sensitive, but also (much)

slower –  Typically do a fast search for prototypes, but then most sensitive for

final result

•  BLAST then tries to extend high scoring word pairs to compute high scoring segment pairs (HSPs). –  Significance of the alignment reported via an e-value

Page 8: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

BLAST E-values

E-value = the number of HSPs having alignment score S (or higher) expected to occur by chance.

! Smaller E-value, more significant in statistics ! Bigger E-value, less significant ! Over 1 means expect this totally by chance

(not significant at all!) The expected number of HSPs with the score at least S is :

E = K*n*m*e-�S K, � are constant depending on model

n, m are the length of query and sequence E-values quickly drop off for better alignment bits scores

Page 9: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Very Similar Sequences

Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: HBB_HUMAN Hemoglobin beta subunit Score = 114 bits (285), Expect = 1e-26 Identities = 61/145 (42%), Positives = 86/145 (59%), Gaps = 8/145 (5%) Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55 L+P +K+ V A WGKV + E G EAL R+ + +P T+ +F F D G+ +V Sbjct 3 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115 K HGKKV A ++ +AH+D++ + LS+LH KL VDP NF+LL + L+ LA H Sbjct 61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 120 Query 116 EFTPAVHASLDKFLASVSTVLTSKY 140 EFTP V A+ K +A V+ L KY Sbjct 121 EFTPPVQAAYQKVVAGVANALAHKY 145

Page 10: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Quite Similar Sequences

Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: MYG_HUMAN Myoglobin Score = 51.2 bits (121), Expect = 1e-07, Identities = 38/146 (26%), Positives = 58/146 (39%), Gaps = 6/146 (4%) Query 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQV 55 LS + V WGKV A +G E L R+F P T F F D S + Sbjct 3 LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 62 Query 56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA 115 K HG V AL + + L+ HA K ++ + +S C++ L + P Sbjct 63 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 122 Query 116 EFTPAVHASLDKFLASVSTVLTSKYR 141 +F +++K L + S Y+ Sbjct 123 DFGADAQGAMNKALELFRKDMASNYK 148

Page 11: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Not similar sequences

Query: HBA_HUMAN Hemoglobin alpha subunit Sbjct: SPAC869.02c [Schizosaccharomyces pombe] Score = 33.1 bits (74), Expect = 0.24 Identities = 27/95 (28%), Positives = 50/95 (52%), Gaps = 10/95 (10%) Query 30 ERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAH 89 ++M ++P P+F+ +H + + +A AL N ++DD+ +LSA D Sbjct 59 QKMLGNYPEV---LPYFNKAHQISL--SQPRILAFALLNYAKNIDDL-TSLSAFMDQIVV 112 Query 90 K---LRVDPVNFKLLSHCLLVTLAAHLPAEF-TPA 120 K L++ ++ ++ HCLL T+ LP++ TPA Sbjct 113 KHVGLQIKAEHYPIVGHCLLSTMQELLPSDVATPA 147

Page 12: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Blast Versions

Program! Database! Query!

BLASTN" Nucleotide" Nucleotide"BLASTP" Protein" Protein"

BLASTX" Protein" Nucleotide translated into protein"

TBLASTN" Nucleotide translated into protein" Protein"

TBLASTX" Nucleotide translated into protein"

Nucleotide translated into protein"

Page 13: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

NCBI Blast •  Nucleotide Databases

–  nr: All Genbank –  refseq: Reference

organisms –  wgs: All reads

•  Protein Databases –  nr: All non-redundant

sequences –  Refseq: Reference

proteins

Page 14: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Genomic Coordinates

What are coordinates of “TAC” !in GATTACA?!

GATTACA!1234567!

1-based coordinates •  Base 4 through 6: [4,6] “closed” •  Base 4 through 7: [4,7) “half-open” •  3 bases starting at base 4: [4, +3]

0-based coordinates •  Position 3 through 5: [3,5] “closed” •  Position 3 through 6: [3,6) “half-open” •  3 bases starting at position 3: [3, +3]

GATTACA!0123456!

Page 15: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Genomic Conventions

GATTACA!1234567!

1-based coordinates •  BLAST/MUMmer alignments •  Ensembl Genome Browser •  SAM, VCF, GFF and Wiggle

0-based coordinates •  BAM, BCFv2, BED, and PSL •  UCSC Genome Browser •  C/C++, Perl, Python, Java

GATTACA!0123456!

Always double check the manual!!You will get this wrong someday "!

Page 16: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Outline

1.  Alignment to other genomes 2.  Prediction aka “Gene Finding” 3.  Experimental & Functional Assays 4.  Online Resources

Page 17: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding)

Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational Biology

Johns Hopkins University School of Medicine

Page 18: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Step One

•  Find open reading frames (ORFs).

…TAGATGAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stop codon

Stop codon

Start codon

Page 19: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Step One

•  Find open reading frames (ORFs).

•  But ORFs generally overlap …

…TAGATGAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stop codon

Stop codon

…ATCTACTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…

Shifted Stop

Stop codon

Reverse strand

Page 20: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Campylobacter jejuni RM1221 30.3%GC

All ORFs longer than 100bp on both strands shown - color indicates reading frame

Longest ORFs likely to be protein-coding genes Note the low GC content All genes are ORFs but not all ORFs are genes

Page 21: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Campylobacter jejuni RM1221 30.3%GC

Campylobacter jejuni RM1221 30.3%GC

Page 22: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Mycobacterium smegmatis MC2 67.4%GC

Note what happens in a high-GC genome

Page 23: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Mycobacterium smegmatis MC2 67.4%GC

Mycobacterium smegmatis MC2 67.4%GC

Page 24: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Probabilistic Methods •  Create models that have a probability of

generating any given sequence. –  Evaluate gene/non-genome models against a sequence

•  Train the models using examples of the types of sequences to generate. – Use RNA sequencing, homology, or “obvious” genes

•  The �score� of an orf is the probability of the model generating it. – Most basic technique is to count how kmers occur in

known genes versus intergenic sequences – More sophisticated methods consider variable length

contexts, “wobble” bases, other statistical clues

Page 25: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Overview of Eukaryotic Gene Prediction

CBB 231 / COMPSCI 261

W.H. Majoros

Page 26: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

ATG TGA

coding segment complete mRNA

ATG GT AG GT AG . . . . . . . . . start codon stop codon donor site donor site acceptor

site acceptor

site

exon exon exon intron intron

TGA

Eukaryotic Gene Syntax

Regions of the gene outside of the CDS are called UTR�s (untranslated regions), and are mostly ignored by gene finders, though they are important for regulatory functions.

Page 27: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Representing Gene Syntax with ORF Graphs

After identifying the most promising (i.e., highest-scoring) signals in an input sequence, we can apply the gene syntax rules to connect these into an ORF graph:

An ORF graph represents all possible gene parses (and their scores) for a given set of putative signals. A path through the graph represents a single gene parse.

Page 28: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Conceptual Gene-finding Framework TATTCCGATCGATCGATCTCTCTAGCGTCTACGCTATCATCGCTCTCTATTATCGCGCGATCGTCGATCGCGCGAGAGTATGCTACGTCGATCGAATTG

identify most promising signals, score signals and content regions between them; induce an ORF graph on the signals

find highest-scoring path through ORF graph; interpret path as a gene parse = gene structure

Page 29: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Gene Finding Overview •  Prokaryotic gene finding distinguishes real genes and

random ORFs –  Prokaryotic genes have simple structure and are largely

homogenous, making it relatively easy to recognize their sequence composition

•  Eukaryotic gene finding identifies the genome-wide most probable gene models (set of exons) –  “Probabilistic Graphical Model” to enforce overall gene

structure, separate models to score splicing/transcription signals

–  Accuracy depends to a large extent on the quality of the training data

Page 30: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Gene Models

•  “Generic Feature Format” (GFF) records genomic features –  Coordinates of each exon –  Coordinates of UTRs –  Link together exons into transcripts –  Link together transcripts into gene models

http://www.sequenceontology.org/gff3.shtml

Page 31: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

GFF File format GFF3 files are nine-column, tab-delimited, plain text files 1. seqid: The ID of the sequence 2. source: Algorithm or database that generated this feature 3. type: gene/exon/CDS/etc… 4. start: 1-based coordinate 5. end: 1-based coordinate 6. score: E-values/p-values/index/colors/… 7. strand: “+’ for positive “-” for minus, “.” not stranded 8. phase: For "CDS", where the feature begins with reference

to the reading frame (0,1,2) 9. attributes: A list of tag=value features

Parent: Indicates the parent of the feature (group exons into transcripts, transcripts into genes, …)

Page 32: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

GFF Example

##gff-version 3!##sequence-region ctg123 1 1497228 !ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN!!ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001!!ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1!ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2!ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3!!ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003!ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002!ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003!ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003!ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003!!ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1!ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1!ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1!ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1!!ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2!ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2!ctg123 . CDS ! 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2!!ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3!ctg123 . CDS ! 5000 5500 . + 1 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3!ctg123 . CDS ! 7000 7600 . + 1 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3!!ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4!ctg123 . CDS ! 5000 5500 . + 1 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4!ctg123 . CDS ! 7000 7600 . + 1 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4!

Gene “EDEN” with 3 alternatively spliced transcripts, isoform 3 has two alternative translation start sites!

Page 33: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Break

Page 34: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Outline

1.  Alignment to other genomes 2.  Prediction aka “Gene Finding” 3.  Experimental & Functional Assays 4.  Online Resources

Page 35: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Sequencing techniques

Much of the capacity is used to sequence genomes (or exomes) of individuals…

… but biology is much more than just genomes…

Page 36: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

�10Soon et al., Molecular Systems Biology, 2013

Page 37: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

The *Seq List (in chronological order)

1.  Gregory E. Crawford et al., “Genome-wide Mapping of DNase Hypersensitive Sites Using Massively Parallel Signature Sequencing (MPSS),” Genome Research 16, no. 1 (January 1, 2006): 123–131, doi:10.1101/gr.4074106.

2.  David S. Johnson et al., “Genome-Wide Mapping of in Vivo Protein-DNA Interactions,” Science 316, no. 5830 (June 8, 2007): 1497–1502, doi:10.1126/science.1141319.

3.  Tarjei S. Mikkelsen et al., “Genome-wide Maps of Chromatin State in Pluripotent and Lineage-committed Cells,” Nature 448, no. 7153 (August 2, 2007): 553–560, doi:10.1038/nature06008.

4.  Thomas A. Down et al., “A Bayesian Deconvolution Strategy for Immunoprecipitation-based DNA Methylome Analysis,” Nature Biotechnology 26, no. 7 (July 2008): 779–785, doi:10.1038/nbt1414.

5.  Ali Mortazavi et al., “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq,” Nature Methods 5, no. 7 (July 2008): 621–628, doi:10.1038/nmeth.1226.

6.  Nathan A. Baird et al., “Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers,” PLoS ONE 3, no. 10 (October 13, 2008): e3376, doi:10.1371/journal.pone.0003376.

7.  Leighton J. Core, Joshua J. Waterfall, and John T. Lis, “Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters,” Science 322, no. 5909 (December 19, 2008): 1845–1848, doi:10.1126/science.1162228.

8.  Chao Xie and Martti T. Tammi, “CNV-seq, a New Method to Detect Copy Number Variation Using High-throughput Sequencing,” BMC Bioinformatics 10, no. 1 (March 6, 2009): 80, doi:10.1186/1471-2105-10-80.

9.  Jay R. Hesselberth et al., “Global Mapping of protein-DNA Interactions in Vivo by Digital Genomic Footprinting,” Nature Methods 6, no. 4 (April 2009): 283–289, doi:10.1038/nmeth.1313.

10. Nicholas T. Ingolia et al., “Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling,” Science 324, no. 5924 (April 10, 2009): 218–223, doi:10.1126/science.1168978.

11. Alayne L. Brunner et al., “Distinct DNA Methylation Patterns Characterize Differentiated Human Embryonic Stem Cells and Developing Human Fetal Liver,” Genome Research 19, no. 6 (June 1, 2009): 1044–1056, doi:10.1101/gr.088773.108.

12. Mayumi Oda et al., “High-resolution Genome-wide Cytosine Methylation Profiling with Simultaneous Copy Number Analysis and Optimization for Limited Cell Numbers,” Nucleic Acids Research 37, no. 12 (July 1, 2009): 3829–3839, doi:10.1093/nar/gkp260.

13. Zachary D. Smith et al., “High-throughput Bisulfite Sequencing in Mammalian Genomes,” Methods 48, no. 3 (July 2009): 226–232, doi:10.1016/j.ymeth.2009.05.003.

14. Andrew M. Smith et al., “Quantitative Phenotyping via Deep Barcode Sequencing,” Genome Research (July 21, 2009), doi:10.1101/gr.093955.109.

15. Raymond K. Auerbach et al., “Mapping Accessible Chromatin Regions Using Sono-Seq,” Proceedings of the National Academy of Sciences 106, no. 35 (September 1, 2009): 14926–14931, doi:10.1073/pnas.0905443106.

16. Andrew L. Goodman et al., “Identifying Genetic Determinants Needed to Establish a Human Gut Symbiont in Its Habitat,” Cell Host & Microbe 6, no. 3 (September 17, 2009): 279–289, doi:10.1016/j.chom.2009.08.003.

17. Tim van Opijnen, Kip L. Bodi, and Andrew Camilli, “Tn-seq; High-throughput Parallel Sequencing for Fitness and Genetic Interaction Studies in Microorganisms,” Nature Methods 6, no. 10 (October 2009): 767–772, doi:10.1038/nmeth.1377.

18. Erez Lieberman-Aiden et al., “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome,” Science 326, no. 5950 (October 9, 2009): 289–293, doi:10.1126/science.1181369.

19. Gemma C. Langridge et al., “Simultaneous Assay of Every Salmonella Typhi Gene Using One Million Transposon Mutants,” Genome Research (October 13, 2009), doi:10.1101/gr.097097.109.

20. Melissa J. Fullwood et al., “An Oestrogen-receptor-�-bound Human Chromatin Interactome,” Nature 462, no. 7269 (November 5, 2009): 58–64, doi:10.1038/nature08497.

21. Ryan Lister et al., “Human DNA Methylomes at Base Resolution Show Widespread Epigenomic Differences,” Nature 462, no. 7271 (November 19, 2009): 315–322, doi:10.1038/nature08514.

22. Yuanchao Xue et al., “Genome-wide Analysis of PTB-RNA Interactions Reveals a Strategy Used by the General Splicing Repressor to Modulate Exon Inclusion or Skipping,” Molecular Cell 36, no. 6 (December 24, 2009): 996–1006, doi:10.1016/j.molcel.2009.12.003.

23. Andreas Ernst et al., “Coevolution of PDZ Domain–ligand Interactions Analyzed by High-throughput Phage Display and Deep Sequencing,” Molecular BioSystems 6, no. 10 (2010): 1782, doi:10.1039/c0mb00061b.

24. R. Scott Hansen et al., “Sequencing Newly Replicated DNA Reveals Widespread Plasticity in Human Replication Timing,” Proceedings of the National Academy of Sciences 107, no. 1 (January 5, 2010): 139–144, doi:10.1073/pnas.0912402107.

25. Andrew H. Beck et al., “3-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples,” PLoS ONE 5, no. 1 (January 19, 2010): e8768, doi:10.1371/journal.pone.0008768.

26. Lira Mamanova et al., “FRT-seq: Amplification-free, Strand-specific Transcriptome Sequencing,” Nature Methods 7, no. 2 (February 2010): 130–132, doi:10.1038/nmeth.1417.

27. Markus Hafner et al., “Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP,” Cell 141, no. 1 (April 2, 2010): 129–141, doi:10.1016/j.cell.2010.03.009.

28.  Iman Hajirasouliha et al., “Detection and Characterization of Novel Sequence Insertions Using Paired-end Next-generation Sequencing,” Bioinformatics 26, no. 10 (May 15, 2010): 1277–1283, doi:10.1093/bioinformatics/btq152.

29. Oh Kyu Yoon and Rachel B. Brem, “Noncanonical Transcript Forms in Yeast and Their Regulation During Environmental Stress,” RNA 16, no. 6 (June 1, 2010): 1256–1267, doi:10.1261/rna.2038810.

30.  Julian König et al., “iCLIP Reveals the Function of hnRNP Particles in Splicing at Individual Nucleotide Resolution,” Nature Structural & Molecular Biology 17, no. 7 (July 2010): 909–915, doi:10.1038/nsmb.1838.

31. Michael Kertesz et al., “Genome-wide Measurement of RNA Secondary Structure in Yeast,” Nature 467, no. 7311 (September 2, 2010): 103–107, doi:10.1038/nature09322.

32. Qi Zheng et al., “Genome-Wide Double-Stranded RNA Sequencing Reveals the Functional Significance of Base-Paired RNAs in Arabidopsis,” PLoS Genet 6, no. 9 (September 30, 2010): e1001141, doi:10.1371/journal.pgen.1001141.

33.  Jason G. Underwood et al., “FragSeq: Transcriptome-wide RNA Structure Probing Using High-throughput Sequencing,” Nature Methods 7, no. 12 (December 2010): 995–1001, doi:10.1038/nmeth.1529.

34.  Jing Zhao et al., “Genome-wide Identification of Polycomb-Associated RNAs by RIP-seq,” Molecular Cell 40, no. 6 (December 22, 2010): 939–953, doi:10.1016/j.molcel.2010.12.011.

35. Calvin H. Jan et al., “Formation, Regulation and Evolution of Caenorhabditis Elegans 3UTRs,” Nature 469, no. 7328 (January 6, 2011): 97–101, doi:10.1038/nature09616.

36. L. Stirling Churchman and Jonathan S. Weissman, “Nascent Transcript Sequencing Visualizes Transcription at Nucleotide Resolution,” Nature 469, no. 7330 (January 20, 2011): 368–373, doi:10.1038/nature09652.

37. Peter J. Shepard et al., “Complex and Dynamic Landscape of RNA Polyadenylation Revealed by PAS-Seq,” RNA 17, no. 4 (April 1, 2011): 761–772, doi:10.1261/rna.2581711.

38.  (a) Sharon Aviran et al., “Modeling and Automation of Sequencing-based Characterization of RNA Structure,” Proceedings of the National Academy of Sciences (June 3, 2011), doi:10.1073/pnas.1106541108.

39.  (b) Julius B. Lucks et al., “Multiplexed RNA Structure Characterization with Selective 2-hydroxyl Acylation Analyzed by Primer Extension Sequencing (SHAPE-Seq),” Proceedings of the National Academy of Sciences 108, no. 27 (July 5, 2011): 11063–11068, doi:10.1073/pnas.1106501108.

40. Anton Valouev et al., “Determinants of Nucleosome Organization in Primary Human Cells,” Nature 474, no. 7352 (June 23, 2011): 516–520, doi:10.1038/nature10002.

41. Razvan Nutiu et al., “Direct Measurement of DNA Affinity Landscapes on a High-throughput Sequencing Instrument,” Nature Biotechnology 29, no. 7 (July 2011): 659–664, doi:10.1038/nbt.1882.

42. Hironori Waki et al., “Global Mapping of Cell Type–Specific Open Chromatin by FAIRE-seq Reveals the Regulatory Role of the NFI Family in Adipocyte Differentiation,” PLoS Genet 7, no. 10 (October 20, 2011): e1002311, doi:10.1371/journal.pgen.1002311.

43. Harlan S. Robins et al., “Comprehensive Assessment of T-cell Receptor �-chain Diversity in �� T Cells,” Blood 114, no. 19 (November 5, 2009): 4099–4107, doi:10.1182/blood-2009-04-217604.

44. Ci Chu et al., “Genomic Maps of Long Noncoding RNA Occupancy Reveal Principles of RNA-Chromatin Interactions,” Molecular Cell 44, no. 4 (November 18, 2011): 667–678, doi:10.1016/j.molcel.2011.08.027.

45. Ho Sung Rhee and B. Franklin Pugh, “Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution,” Cell 147, no. 6 (December 9, 2011): 1408–1419, doi:10.1016/j.cell.2011.11.013.\

46. Cofactor Binding Evokes Latent Differences in DNA Binding Specificity Between Hox ProteinsMatthew Slattery et al., “,” Cell 147, no. 6 (December 9, 2011): 1270–1282, doi:10.1016/j.cell.2011.10.053.

47. Matthew D. Simon et al., “The Genomic Binding Sites of a Noncoding RNA,” Proceedings of the National Academy of Sciences 108, no. 51 (December 20, 2011): 20497–20502, doi:10.1073/pnas.1113536108.

48. Tim R. Mercer et al., “Targeted RNA Sequencing Reveals the Deep Complexity of the Human Transcriptome,” Nature Biotechnology 30, no. 1 (January 2012): 99–104, doi:10.1038/nbt.2024.

49. Robert Kofler, Andrea J. Betancourt, and Christian Schlötterer, “Sequencing of Pooled DNA Samples (Pool-Seq) Uncovers Complex Dynamics of Transposable Element Insertions in Drosophila Melanogaster,” PLoS Genet 8, no. 1 (January 26, 2012): e1002487, doi:10.1371/journal.pgen.1002487.

50. Hazuki Takahashi et al., “5 End-centered Expression Profiling Using Cap-analysis Gene Expression and Next-generation Sequencing,” Nature Protocols 7, no. 3 (March 2012): 542–561, doi:10.1038/nprot.2012.005.

51. Michael J. Guertin et al., “Accurate Prediction of Inducible Transcription Factor Binding Intensities In Vivo,” PLoS Genet 8, no. 3 (March 29, 2012): e1002610, doi:10.1371/journal.pgen.1002610.

52. Michael J. Booth et al., “Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution,” Science 336, no. 6083 (May 18, 2012): 934–937, doi:10.1126/science.1220671.

53. Tim Forshew et al., “Noninvasive Identification and Monitoring of Cancer Mutations by Targeted Deep Sequencing of Plasma DNA,” Science Translational Medicine 4, no. 136 (May 30, 2012): 136ra68, doi:10.1126/scitranslmed.3003726.

54. Aaron L. Statham et al., “Bisulfite Sequencing of Chromatin Immunoprecipitated DNA (BisChIP-seq) Directly Informs Methylation Status of Histone-modified DNA,” Genome Research 22, no. 6 (June 1, 2012): 1120–1127, doi:10.1101/gr.132076.111.

55. Miao Yu et al., “Base-Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome,” Cell 149, no. 6 (June 8, 2012): 1368–1380, doi:10.1016/j.cell.2012.04.027.

56. Andrzej Drukier et al., “New Dark Matter Detectors Using DNA for Nanometer Tracking,” arXiv:1206.6809 (June 28, 2012), http://arxiv.org/abs/1206.6809.

57.  Joseph Rodriguez, Jerome S. Menet, and Michael Rosbash, “Nascent-Seq Indicates Widespread Cotranscriptional RNA Editing in Drosophila,” Molecular Cell 47, no. 1 (July 13, 2012): 27–37, doi:10.1016/j.molcel.2012.05.002.

58.  Schmitt, Michael W., Scott R. Kennedy, Jesse J. Salk, Edward J. Fox, Joseph B. Hiatt, and Lawrence A. Loeb. “Detection of Ultra-rare Mutations by Next-generation Sequencing.” Proceedings of the National Academy of Sciences 109, no. 36 (September 4, 2012): 14508–14513. doi:10.1073/pnas.1208715109.

59. Tamar Hashimshony et al., “CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification,” Cell Reports 2, no. 3 (September 27, 2012): 666–673, doi:10.1016/j.celrep.2012.08.003.

60. Anthony M. Zador et al., “Sequencing the Connectome,” PLoS Biol 10, no. 10 (October 23, 2012): e1001411, doi:10.1371/journal.pbio.1001411.

61. Yue Wan et al., “Genome-wide Measurement of RNA Folding Energies,” Molecular Cell 48, no. 2 (October 26, 2012): 169–181, doi:10.1016/j.molcel.2012.08.008.

62. Lon M. Chubiz et al., “FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method to Determine Allele Frequencies Directly from Mixed Populations,” PLoS ONE 7, no. 10 (October 31, 2012): e47959, doi:10.1371/journal.pone.0047959.

63.  Jamie C. Kwasnieski et al., “Complex Effects of Nucleotide Variants in a Mammalian Cis-regulatory Element,” Proceedings of the National Academy of Sciences 109, no. 47 (November 20, 2012): 19498–19503, doi:10.1073/pnas.1210678109.

64. Theresa K. Kelly et al., “Genome-wide Mapping of Nucleosome Positioning and DNA Methylation Within Individual DNA Molecules,” Genome Research 22, no. 12 (December 1, 2012): 2497–2506, doi:10.1101/gr.143008.112.

65. Hojoong Kwak et al., “Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing,” Science 339, no. 6122 (February 22, 2013): 950–953, doi:10.1126/science.1229386.

66. Liang Zhang et al., “Tet-mediated Covalent Labelling of 5-methylcytosine for Its Genome-wide Detection and Sequencing,” Nature Communications 4 (February 26, 2013): 1517, doi:10.1038/ncomms2527.

67. Cosmas D. Arnold et al., “Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq,” Science 339, no. 6123 (March 1, 2013): 1074–1077, doi:10.1126/science.1232542.

68.  Ivan Liachko et al., “High-resolution Mapping, Characterization, and Optimization of Autonomously Replicating Sequences in Yeast,” Genome Research 23, no. 4 (April 1, 2013): 698–704, doi:10.1101/gr.144659.112.

69. Aleksandra Helwak et al., “Mapping the Human miRNA Interactome by CLASH Reveals Frequent Noncanonical Binding,” Cell 153, no. 3 (April 2013): 654–665, doi:10.1016/j.cell.2013.03.043.

70. Yohei Sasagawa et al., “Quartz-Seq: a Highly Reproducible and Sensitive Single-cell RNA-Seq Reveals Non-genetic Gene Expression Heterogeneity,” Genome Biology 14, no. 4 (April 17, 2013): R31, doi:10.1186/gb-2013-14-4-r31.

71. Chun-Xiao Song et al., “Genome-wide Profiling of 5-Formylcytosine Reveals Its Roles in Epigenetic Priming,” Cell 153, no. 3 (April 25, 2013): 678–691, doi:10.1016/j.cell.2013.04.001.

72. Vicent Pelechano, Wu Wei, and Lars M. Steinmetz, “Extensive Transcriptional Heterogeneity Revealed by Isoform Profiling,” Nature 497, no. 7447 (May 2, 2013): 127–131, doi:10.1038/nature12121.

73. Daniel Arango et al., “Molecular Basis for the Action of a Dietary Flavonoid Revealed by the Comprehensive Identification of Apigenin Human Targets,” Proceedings of the National Academy of Sciences 110, no. 24 (June 11, 2013): E2153–E2162, doi:10.1073/pnas.1303726110.

74. Timothy Sterne-Weiler et al., “Frac-seq Reveals Isoform-specific Recruitment to Polyribosomes,” Genome Research (June 19, 2013), doi:10.1101/gr.148585.112.

75. Xingyu Lu et al., “Chemical Modification-Assisted Bisulfite Sequencing (CAB-Seq) for 5-Carboxylcytosine Detection in DNA,” Journal of the American Chemical Society 135, no. 25 (June 26, 2013): 9315–9317, doi:10.1021/ja4044856.

76. Larry D. Mesner et al., “Bubble-seq Analysis of the Human Genome Reveals Distinct Chromatin-mediated Mechanisms for Regulating Early- and Late-firing Origins,” Genome Research (July 16, 2013), doi:10.1101/gr.155218.113.

77. Vollmers, Christopher, Rene V. Sit, Joshua A. Weinstein, Cornelia L. Dekker, and Stephen R. Quake. “Genetic Measurement of Memory B-cell Recall Using Antibody Repertoire Sequencing” Proceedings of the National Academy of Sciences 110, no. 33 (August 13, 2013): 13463–13468. doi:10.1073/pnas.1312146110.

78.  Jesse M. Engreitz et al., “The Xist lncRNA Exploits Three-Dimensional Genome Architecture to Spread Across the X Chromosome,” Science 341, no. 6147 (August 16, 2013): 1237973, doi:10.1126/science.1237973.

79. Robin P. Smith et al., “Massively Parallel Decoding of Mammalian Regulatory Sequences Supports a Flexible Organizational Model,” Nature Genetics 45, no. 9 (September 2013): 1021–1028, doi:10.1038/ng.2713.

80. Carolin A. Müller et al., “The Dynamics of Genome Replication Using Deep Sequencing,” Nucleic Acids Research (October 1, 2013): gkt878, doi:10.1093/nar/gkt878.

81.  Jason D. Buenrostro et al., “Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-binding Proteins and Nucleosome Position,” Nature Methods advance online publication (October 6, 2013), doi:10.1038/nmeth.2688.

82.  Florian Jupe et al., “Resistance Gene Enrichment Sequencing (RenSeq) Enables Reannotation of the NB-LRR Gene Family from Sequenced Plant Genomes and Rapid Mapping of Resistance Loci in Segregating Populations,” The Plant Journal 76, no. 3 (2013): 530–544, doi:10.1111/tpj.12307.

83. Ulf-Peter Guenther et al., “Hidden Specificity in an Apparently Nonspecific RNA-binding Protein,” Nature 502, no. 7471 (October 17, 2013): 385–388, doi:10.1038/nature12543.

84. Busse, Christian E., Irina Czogiel, Peter Braun, Peter F. Arndt, and Hedda Wardemann. “Single-cell Based High-throughput Sequencing of Full-length Immunoglobulin Heavy and Light Chain Genes.” European Journal of Immunology (2013): n/a–n/a. doi:10.1002/eji.201343917.

85. Donald R. McCarty et al., “Mu-seq: Sequence-Based Mapping and Identification of Transposon Induced Mutations,” PLoS ONE 8, no. 10 (October 23, 2013): e77172, doi:10.1371/journal.pone.0077172.

86. Artur Veloso et al., “Genome-Wide Transcriptional Effects of the Anti-Cancer Agent Camptothecin,” PLoS ONE 8, no. 10 (October 23, 2013): e78190, doi:10.1371/journal.pone.0078190.

87.  Steve Lianoglou et al., “Ubiquitously Transcribed Genes Use Alternative Polyadenylation to Achieve Tissue-specific Expression,” Genes & Development 27, no. 21 (November 1, 2013): 2380–2396, doi:10.1101/gad.229328.113.

88.  Ji Wan and Shu-Bing Qian, “TISdb: a Database for Alternative Translation Initiation in Mammalian Cells,” Nucleic Acids Research (November 6, 2013), doi:10.1093/nar/gkt1085.

89.  Sarah Aldridge et al., “AHT-ChIP-seq: a Completely Automated Robotic Protocol for High-throughput Chromatin Immunoprecipitation,” Genome Biology 14, no. 11 (November 7, 2013): R124, doi:10.1186/gb-2013-14-11-r124.

90. Helena M. B. Seth-Smith et al., “Generating Whole Bacterial Genome Sequences of Low-abundance Species from Complex Samples with IMS-MDA,” Nature Protocols 8, no. 12 (December 2013): 2404–2412, doi:10.1038/nprot.2013.147.

91.  Ikjin Kim et al., “High-throughput Analysis of in Vivo Protein Stability,” Molecular & Cellular Proteomics: MCP 12, no. 11 (November 2013): 3370–3378, doi:10.1074/mcp.O113.031708.

92. Yiliang Ding et al., “In Vivo Genome-wide Profiling of RNA Secondary Structure Reveals Novel Regulatory Features,” Nature advance online publication (November 24, 2013), doi:10.1038/nature12756.

93.  Silvi Rouskin et al., “Genome-wide Probing of RNA Structure Reveals Active Unfolding of mRNA Structures in Vivo,” Nature advance online publication (December 15, 2013), doi:10.1038/nature12894.

94. Lars Anders et al., “Genome-wide Localization of Small Molecules,” Nature Biotechnology 32, no. 1 (January 2014): 92–96, doi:10.1038/nbt.2776.

95.  Simone Picelli et al., “Full-length RNA-seq from Single Cells Using Smart-seq2,” Nature Protocols 9, no. 1 (January 2014): 171–181, doi:10.1038/nprot.2014.006.

96. Alexander O. Subtelny et al., “Poly(A)-tail Profiling Reveals an Embryonic Switch in Translational Control,” Nature advance online publication (January 29, 2014), doi:10.1038/nature13007.

97. Ashley Acevedo, Leonid Brodsky, and Raul Andino, “Mutational and Fitness Landscapes of an RNA Virus Revealed through Population Sequencing,” Nature 505, no. 7485 (January 30, 2014): 686–690, doi:10.1038/nature12861.

98. Ditte Lovatt et al., “Transcriptome in Vivo Analysis (TIVA) of Spatially Defined Single Cells in Live Tissue,” Nature Methods 11, no. 2 (February 2014): 190–196, doi:10.1038/nmeth.2804.

99.  Jim R. Hughes et al., “Analysis of Hundreds of Cis-regulatory Landscapes at High Resolution in a Single, High-throughput Experiment,” Nature Genetics 46, no. 2 (February 2014): 205–212, doi:10.1038/ng.2871.

http://liorpachter.wordpress.com/seq/

Sequencing Assays

Page 38: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

What is a *Seq assay?

�7

Desired measurement

reduce to sequencing Sequence

Solve inverse problem

Chemistry & Physics

Analyze

Creativity

Mathematics &Computer Science

Statistics

ComputationalBiology

MolecularBiology

Biology

Page 39: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

*-seq in 3 short vignettes

RNA-seq

Methyl-seq ChIP-seq

Page 40: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

RNA-seq

Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Sørlie et al (2001) PNAS. 98(19):10869-74.

Page 41: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

RNA-seq Overview

Sequencing

Mapping & Assembly

Quantification

Page 42: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

RNA-seq Overview

Page 43: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

RNA-seq Challenges!

Challenge 1: Eukaryotic genes are spliced!Solution: Use a spliced aligner, and assemble isoforms!!TopHat: discovering spliced junctions with RNA-Seq. !Trapnell et al (2009) Bioinformatics. 25:0 1105-1111!

Challenge 2: Read Count != Transcript abundance!Solution: Infer underlying abundances (e.g. FPKM)!!Transcript assembly and quantification by RNA-seq!Trapnell et al (2010) Nat. Biotech. 25(5): 511-515!

Challenge 3: Transcript abundances are stochastic!Solution: Replicates, replicates, and more replicates!!RNA-seq differential expression studies: more sequence or more replication?!Liu et al (2013) Bioinformatics. doi:10.1093/bioinformatics/btt688!

Page 44: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

RNA-seq to determine the expression dynamics during development!•  Laser microdissection to precisely

extract tissue from developing organs!

•  Use RNA-seq to watch different classes of genes become activated at different stages of development!

•  When those genes are delayed or interupted, tomato mutants take on very different branching patterns.!

Page 45: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Methyl-seq

Finding the fifth base: Genome-wide sequencing of cytosine methylation Lister and Ecker (2009) Genome Research. 19: 959-966

Page 46: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

“The queen honey bee and her worker sisters do not seem to have much in common. Workers are active and intelligent, skillfully navigating the outside world in search of food for the colony. They never reproduce; that task is left entirely to the much larger and longer-lived queen, who is permanently ensconced within the colony and uses a powerful chemical influence to exert control. Remarkably, these two female castes are generated from identical genomes. The key to each female's developmental destiny is her diet as a larva: future queens are raised on royal jelly. This specialized diet is thought to affect a particular chemical modification, methylation, of the bee's DNA, causing the same genome to be deployed differently. “!

Page 47: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Bisulfite Conversion

Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications Krueger and Andrews (2010) Bioinformatics. 27 (11): 1571-1572.

Treating DNA with sodium bisulfite will convert unmethylated C to T!!•  5-MethyC will be protected and not change,

so can look for differences when mapping!

•  Requires great care when analyzing reads, since the complementary strand will also be converted (G to A)!

•  Typically analyzed by mapping to a “reduced alphabet” where we assume all Cs are converted to Ts once on the forward strand and once on the reverse!

Page 48: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Bisulfite Conversion

Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications Krueger and Andrews (2010) Bioinformatics. 27 (11): 1571-1572.

Treating DNA with sodium bisulfite will convert unmethylated C to T!!•  5-MethyC will be protected and not change,

so can look for differences when mapping!

•  Requires great care when analyzing reads, since the complementary strand will also be converted (G to A)!

•  Typically analyzed by mapping to a “reduced alphabet” where we assume all Cs are converted to Ts once on the forward strand and once on the reverse!

Page 49: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

ChIP-seq

Genome-wide mapping of in vivo protein-DNA interactions. Johnson et al (2007) Science. 316(5830):1497-502

Page 50: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

ChIP-seq

Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data Valouev et al (2008) Nature Methods. 5, 829 - 834

Goals:!•  Where are transcription

factors and other proteins binding to the DNA?!

•  How strongly are they binding?!

•  Do the protein binding patterns change over developmental stages or when the cells are stressed?!

Page 51: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Related Assays

ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions Furey (2012) Nature Reviews Genetics. 13, 840-852

Page 52: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

ENCODE Data Sets

1,640 data sets total over 147 different cell types

Page 53: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Summary of ENCODE elements

“Accounting for all these elements, a surprisingly large amount of the human genome, 80.4%, is covered by at least one ENCODE-identified element” •  62% transcribed •  56% enriched for histone marks •  15% open chromatin •  8% TF binding •  19% At least one DHS or TF Chip-seq peak •  4% TF binding site motif •  (Note protein coding genes comprise ~2.94% of the genome)

“Given that the ENCODE project did not assay all cell types, or all transcription factors, and in particular has sampled few specialized or developmentally restricted cell lineages, these proportions must be

underestimates of the total amount of functional bases.”

Page 54: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

ChromHMM: Signal Integration

•  Summarize the individual assays into 7 functional/regulatory states using an HMM across the genome

ChromHMM: automating chromatin-state discovery and characterization Ernst & Kellis (2012) Nature Methods. doi:10.1038/nmeth.1906

Page 55: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Genotyping vs *-seq •  Genotyping: Identify Variations

•  *-seq: Classify & measure significant peaks

GCCCTATCG GCCCTATCG

CCTATCGGA CTATCGGAAA

AAATTTGC AAATTTGC

TTTGCGGT

TCGGAAATT CGGAAATTT CGGAAATTT

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC

GAAATTTGC

Page 56: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

WIG/bigWIG Format

1 2 3 4 5 6 C

over

age

Contig

Reads

•  Coverage can change at every single position (3B integers)!•  But we often want to summarize to every 100th or every 1000th !•  WIG format to the rescue!!!

Page 57: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

WIG/bigWIG Format Wiggle format is line-oriented, 1st line must be a track definition, followed by declaration lines and data lines!!

fixedStep is for data with regular intervals between new data values!!

fixedStep chrom=chrN start=position step=stepInterval [span=windowSize]!dataValues!!fixedStep chrom=chr3 start=400601 step=100!11!22!33!

!variableStep is for data with irregular intervals!!

variableStep chrom=chrN [span=windowSize]!chromStartA dataValueA!!variableStep chrom=chr2!300701 12.5!300702 12.5!300703 12.5!300704 12.5!300705 12.5!

Page 58: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

WIG Example browser position chr19:49304200-49310700!browser hide all!# !150 base wide bar graph at arbitrarily spaced positions,!# !threshold line drawn at y=11.76!# !autoScale off viewing range set to [0:25]!# !priority = 10 positions this as the first graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10!variableStep chrom=chr19 span=150!49304701 10.0!49304901 12.5!49305401 15.0!49305601 17.5!49305901 20.0!49306081 17.5!49306301 15.0!49306691 12.5!49307871 10.0!!# !200 base wide points graph at every 300 bases, 50 pixel high graph!# !autoScale off and viewing range set to [0:1000]!# !priority = 20 positions this as the second graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20!fixedStep chrom=chr19 start=49307401 step=300 span=200!1000! 900! 800! 700! 600! 500! 400! 300! 200! 100!

Page 59: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

WIG Example browser position chr19:49304200-49310700!browser hide all!# !150 base wide bar graph at arbitrarily spaced positions,!# !threshold line drawn at y=11.76!# !autoScale off viewing range set to [0:25]!# !priority = 10 positions this as the first graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10!variableStep chrom=chr19 span=150!49304701 10.0!49304901 12.5!49305401 15.0!49305601 17.5!49305901 20.0!49306081 17.5!49306301 15.0!49306691 12.5!49307871 10.0!!# !200 base wide points graph at every 300 bases, 50 pixel high graph!# !autoScale off and viewing range set to [0:1000]!# !priority = 20 positions this as the second graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20!fixedStep chrom=chr19 start=49307401 step=300 span=200!1000! 900! 800! 700! 600! 500! 400! 300! 200! 100!

Page 60: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

WIG Example browser position chr19:49304200-49310700!browser hide all!# !150 base wide bar graph at arbitrarily spaced positions,!# !threshold line drawn at y=11.76!# !autoScale off viewing range set to [0:25]!# !priority = 10 positions this as the first graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10!variableStep chrom=chr19 span=150!49304701 10.0!49304901 12.5!49305401 15.0!49305601 17.5!49305901 20.0!49306081 17.5!49306301 15.0!49306691 12.5!49307871 10.0!!# !200 base wide points graph at every 300 bases, 50 pixel high graph!# !autoScale off and viewing range set to [0:1000]!# !priority = 20 positions this as the second graph!# !Note, one-relative coordinate system in use for this format!track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20!fixedStep chrom=chr19 start=49307401 step=300 span=200!1000! 900! 800! 700! 600! 500! 400! 300! 200! 100!

Page 61: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

BED Format Simple tab-delimited general format for recording “intervals”!!Required fields:!1. chrom: !The name of the sequence!2. chromStart: !The 0-based starting position!3. chromEnd: !The 0-based half-open ending position !!

The first 100 bases of a sequence are defined as chromStart=0, chromEnd=100, !and span the bases numbered 0-99.!

!The 9 additional optional BED fields are:!4. name: !Defines the name of the BED line!5. score: !A score value (typically between 0 and 1000)!6. strand: !Defines the strand - either '+' or '-'.!7. thickStart: !The starting position at which the feature is drawn thickly!8. thickEnd: !The ending position at which the feature is drawn thickly!9. itemRgb: !An RGB value of the form R,G,B (e.g. 255,0,0). !10. blockCount: !The number of blocks (exons) in the BED line.!11. blockSizes: !A comma-separated list of the block sizes.!12. blockStarts: !A comma-separated list of block starts. !

http://genome.ucsc.edu/FAQ/FAQformat.html

Page 62: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

BED Example browser position chr7:127471196-127495720!browser hide all!track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"!chr7!127471196!127472363!Pos1!0 !+ !127471196!127472363!255,0,0!chr7!127472363!127473530!Pos2!0 !+ !127472363!127473530!255,0,0!chr7!127473530!127474697!Pos3!0 !+ !127473530!127474697!255,0,0!chr7!127474697!127475864!Pos4!0 !+ !127474697!127475864!255,0,0!chr7!127475864!127477031!Neg1!0 !- !127475864!127477031!0,0,255!chr7!127477031!127478198!Neg2!0 !- !127477031!127478198!0,0,255!chr7!127478198!127479365!Neg3!0 !- !127478198!127479365!0,0,255!chr7!127479365!127480532!Pos5!0 !+ !127479365!127480532!255,0,0!chr7!127480532!127481699!Neg4!0 !- !127480532!127481699!0,0,255!

http://genome.ucsc.edu/FAQ/FAQformat.html

Page 63: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Outline

1.  Alignment to other genomes 2.  Prediction aka “Gene Finding” 3.  Experimental & Functional Assays 4.  Online Resources

Page 64: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Common Genomics Questions

•  What is the closest gene to this ChIP-seq peak?!

•  Is my latest discovery novel?!

•  Is there strand bias in my data?!

•  How many genes does this mutation affect?!

•  Where did I fail to collect sequence coverage?!

•  Is this feature significantly correlated with some other feature?!

Solution is to integrate (many) online resources with your own data!

Page 65: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

NCBI http://www.ncbi.nlm.nih.gov/

Page 66: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Ensembl http://www.ensembl.org

Page 67: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Biomart http://www.biomart.org

Page 68: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

UCSC Genome Browser http://genome.ucsc.edu/

Page 69: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

UCSC Genome Browser / Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Page 70: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

BEDTools to the rescue!

BED+ G[T|F]F VCF BAM BIGWIG (future)

Page 71: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Find SNPs that have the potential to alter gene expression regulation by affecting methylation at CpG islands.

Wednesday @ 1pm

Page 72: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Annotation Summary •  Three major approaches to annotate a genome

1.  Alignment: •  Does this sequence align to any other sequences of known function? •  Great for projecting knowledge from one species to another

2.  Prediction: •  Does this sequence statistically resemble other known sequences? •  Potentially most flexible but dependent on good training data

3.  Experimental: •  Lets test to see if it is transcribed/methylated/bound/etc •  Strongest but expensive and context dependent

•  Many great resources available –  Learn to love the literature and the databases –  Standard formats let you rapidly query and cross reference –  Google is your number one resource #

•  Coming up: –  IGV, QC, Variant Analysis, De novo assembly, Transcriptome, etc…

Page 73: Genome Annotations - Schatzlabschatzlab.cshl.edu/teaching/2014/AdvSeqCourse... · • Ensembl Genome Browser • SAM, VCF, GFF and Wiggle 0-based coordinates • BAM, BCFv2, BED,

Thank you!

http://schatzlab.cshl.edu @mike_schatz