24a - mapping short sequencing reads

Upload: chpaul

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    1/17

    Mapping short sequencing

    reads

    GENOME 373: Genomic

    InformaticsProf. William Stafford Noble

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    2/17

    One-minute responses

    All these sequencing methods are for sequencing genomes, right? Or isthis for replicating DNA to perform testing and analysis? The methods we discussed are just for sequencing genomes (or metagenomes

    from microbial communities).

    Im sorry but Im still a little confused by todays sequencing content. Themethod we use today is Sanger, right? For we can get several hundreds of

    sequences every time instead of 36bp or 35bp that read by Solexa orSOLiD. Yes, the method you use today is Sanger sequencing. The reads are a lot

    longer, but they are more expensive. Next generation technology works byproducing lots of very short reads.

    Today was a bit technical for a brain-dead Friday, but still interesting.

    Todays lecture material was terrific this material is why I enrolled. Some

    explanations a little hand-wavy, but Im excited to read more. Seeing some applications would be great.

    Today was very interesting! It was also more straightforward.

    My friend works in a nanopore sequencing lab in physics, seems really cool. Your friend, or the sequencing?

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    3/17

    One-minute responses

    How can t statistic be negative, when denominator is anabsolute value? Sorry; the formula I gave you is for the two-tailed test. If you

    want to do one-tailed, then you have to remove the absolute

    value. Does compute the t statistic itself tell us something, or is

    it necessary to compute the p-value? The t statistic itself is not of much use without the p-value

    calculation.

    Why is there a difference in the FDR calculation forpeptides versus microarrays? See the next three slides.

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    4/17

    Estimating false discovery ratePSMs sorted

    by XCorr

    FDR = 0/5 = 0%

    FDR = 1/7 = 14%

    FDR = 2/10 = 20%

    SpectraTarget

    peptide

    database

    Decoypeptide

    database

    SEQUEST

    Target

    peptide-spectrum

    matches

    Decoy

    peptide-spectrum

    matches

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    5/17

    False discovery rate

    The false discovery rate (FDR)is the percentage of genesabove a given position in theranked list that are expected tobe false positives.

    In microarray analysis, FDR isthe percentage of flaggedgenes that are not differentiallyexpressed.

    We can estimate the number

    of errors from the t-test p-values (details omitted).

    5 FP

    13 TP

    33 TN

    5 FN

    FDR = FP / (FP + TP) = 5/18 = 27.8%

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    6/17

    What is the difference?

    For PSMs, we use an explicitnull model.

    Color indicates whether thePSM is target or decoy.

    For gene expression, we usean analytic null.

    Color indicates whether thegene is actually differentiallyexpressed or not.

    In either case, the false

    discovery rate is the estimatedpercentage of items(genes/PSMs) above thethreshold that are incorrect.

    PSMs sortedby XCorr

    FDR = 2/10 = 20%

    5 FP

    13 TP

    33 TN

    5 FN

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    7/17

    Short read mapping

    Input:

    A reference genome

    A collection of many 25-100bp tags

    User-specified parameters

    Output:

    One or more genomic coordinates for each tag

    In practice, only 70-75% of tags successfully

    map to the reference genome. Why?

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    8/17

    Multiple mapping

    A single tag may occur more than once inthe reference genome.

    The user may choose to ignore tags that

    appear more than n times. As n gets large, you get more data, but

    also more noise in the data.

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    9/17

    Inexact matching

    An observed tag may not exactly match any position in the referencegenome.

    Sometimes, the tag almostmatches one or more positions.

    Such mismatches may represent a SNP or a bad read-out.

    The user can specify the maximum number of mismatches, or a

    phred-style quality score threshold. As the number of allowed mismatches goes up, the number ofmapped tags increases, but so does the number of incorrectlymapped tags.

    ?

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    10/17

    Short-read analysis software

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    11/17

    Spaced seed

    alignment Tags and tag-sized

    pieces of reference are

    cut into small seeds.

    Pairs of spaced seedsare stored in an index.

    Look up spaced seeds for

    each tag.

    For each hit, confirm the

    remaining positions.

    Report results to the user.

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    12/17

    Burrows-Wheeler

    Store entire reference

    genome.

    Align tag base by base

    from the end. When tag is traversed, all

    active locations are

    reported.

    If no match is found, then

    back up and try a

    substitution.

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    13/17

    Comparison

    Burrows-Wheeler

    Requires

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    14/17

    Spliced-read mapping

    Used for processed mRNA data.

    Reports reads that span introns.

    Examples: TopHat, ERANGE

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    15/17

    Remaining lectures

    Short read mapping case studies

    Phylogenetics (1-2 lectures)

    UCSC Genome Browser Practical computational biology

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    16/17

    Problem #1

    Modify the program find-unique-

    tags.py to report the location of each tag

    in the genome.

    Use loops, rather than string methods.> python map-tags.py genome.txt tags.txt locations.txt

    Read 18917 bases in 4 chromosomes from genome.txt.

    Read 1196 tags from tags.txt.

    Mapped to 41122 locations.

  • 8/6/2019 24A - Mapping Short Sequencing Reads

    17/17

    Problem #2

    Assume that you do not have enoughmemory to store the entire genome.

    Modify the program map-tags.py to firstread the tags into memory, and then scanthe genome once.

    The output should stay the same, but in a

    different order.> python map-tags2.py genome.txt tags.txt locations.txt

    Read 8372 bases in 1196 sequences from tags.txt.

    Read 4 chromosomes from genome.txt.

    Mapped to 41122 locations.