24a - mapping short sequencing reads

8/6/2019 24A - Mapping Short Sequencing Reads

1/17

Mapping short sequencing

reads

GENOME 373: Genomic

InformaticsProf. William Stafford Noble


2/17

One-minute responses

All these sequencing methods are for sequencing genomes, right? Or isthis for replicating DNA to perform testing and analysis? The methods we discussed are just for sequencing genomes (or metagenomes

from microbial communities).

Im sorry but Im still a little confused by todays sequencing content. Themethod we use today is Sanger, right? For we can get several hundreds of

sequences every time instead of 36bp or 35bp that read by Solexa orSOLiD. Yes, the method you use today is Sanger sequencing. The reads are a lot

longer, but they are more expensive. Next generation technology works byproducing lots of very short reads.

Today was a bit technical for a brain-dead Friday, but still interesting.

Todays lecture material was terrific this material is why I enrolled. Some

explanations a little hand-wavy, but Im excited to read more. Seeing some applications would be great.

Today was very interesting! It was also more straightforward.

My friend works in a nanopore sequencing lab in physics, seems really cool. Your friend, or the sequencing?


3/17

One-minute responses

How can t statistic be negative, when denominator is anabsolute value? Sorry; the formula I gave you is for the two-tailed test. If you

want to do one-tailed, then you have to remove the absolute

value. Does compute the t statistic itself tell us something, or is

it necessary to compute the p-value? The t statistic itself is not of much use without the p-value

calculation.

Why is there a difference in the FDR calculation forpeptides versus microarrays? See the next three slides.


4/17

Estimating false discovery ratePSMs sorted

by XCorr

FDR = 0/5 = 0%

FDR = 1/7 = 14%

FDR = 2/10 = 20%

SpectraTarget

peptide

database

Decoypeptide

database

SEQUEST

Target

peptide-spectrum

matches

Decoy

peptide-spectrum

matches


5/17

False discovery rate

The false discovery rate (FDR)is the percentage of genesabove a given position in theranked list that are expected tobe false positives.

In microarray analysis, FDR isthe percentage of flaggedgenes that are not differentiallyexpressed.

We can estimate the number

of errors from the t-test p-values (details omitted).

5 FP

13 TP

33 TN

5 FN

FDR = FP / (FP + TP) = 5/18 = 27.8%


6/17

What is the difference?

For PSMs, we use an explicitnull model.

Color indicates whether thePSM is target or decoy.

For gene expression, we usean analytic null.

Color indicates whether thegene is actually differentiallyexpressed or not.

In either case, the false

discovery rate is the estimatedpercentage of items(genes/PSMs) above thethreshold that are incorrect.

PSMs sortedby XCorr

FDR = 2/10 = 20%

5 FP

13 TP

33 TN

5 FN


7/17

Short read mapping

Input:

A reference genome

A collection of many 25-100bp tags

User-specified parameters

Output:

One or more genomic coordinates for each tag

In practice, only 70-75% of tags successfully

map to the reference genome. Why?


8/17

Multiple mapping

A single tag may occur more than once inthe reference genome.

The user may choose to ignore tags that

appear more than n times. As n gets large, you get more data, but

also more noise in the data.


9/17

Inexact matching

An observed tag may not exactly match any position in the referencegenome.

Sometimes, the tag almostmatches one or more positions.

Such mismatches may represent a SNP or a bad read-out.

The user can specify the maximum number of mismatches, or a

phred-style quality score threshold. As the number of allowed mismatches goes up, the number ofmapped tags increases, but so does the number of incorrectlymapped tags.

?


10/17

Short-read analysis software


11/17

Spaced seed

alignment Tags and tag-sized

pieces of reference are

cut into small seeds.

Pairs of spaced seedsare stored in an index.

Look up spaced seeds for

each tag.

For each hit, confirm the

remaining positions.

Report results to the user.


12/17

Burrows-Wheeler

Store entire reference

genome.

Align tag base by base

from the end. When tag is traversed, all

active locations are

reported.

If no match is found, then

back up and try a

substitution.


13/17

Comparison

Burrows-Wheeler

Requires


14/17

Spliced-read mapping

Used for processed mRNA data.

Reports reads that span introns.

Examples: TopHat, ERANGE


15/17

Remaining lectures

Short read mapping case studies

Phylogenetics (1-2 lectures)

UCSC Genome Browser Practical computational biology


16/17

Problem #1

Modify the program find-unique-

tags.py to report the location of each tag

in the genome.

Use loops, rather than string methods.> python map-tags.py genome.txt tags.txt locations.txt

Read 18917 bases in 4 chromosomes from genome.txt.

Read 1196 tags from tags.txt.

Mapped to 41122 locations.


17/17

Problem #2

Assume that you do not have enoughmemory to store the entire genome.

Modify the program map-tags.py to firstread the tags into memory, and then scanthe genome once.

The output should stay the same, but in a

different order.> python map-tags2.py genome.txt tags.txt locations.txt

Read 8372 bases in 1196 sequences from tags.txt.

Read 4 chromosomes from genome.txt.

Mapped to 41122 locations.

24a - mapping short sequencing reads

Documents