20131001 lab meeting

43
Error correction for next generation sequencing Wu Chihua (Gigi) Matsuyama Lab M2 Bioinformatics Group October 1st, 2013 13115星期

Upload: gigi-wu

Post on 10-May-2015

121 views

Category:

Technology


0 download

DESCRIPTION

2013 Oct. 01 part of slides for lab meeting

TRANSCRIPT

Page 1: 20131001 lab meeting

Error correction for next generation sequencing

Wu Chihua (Gigi)Matsuyama Lab M2

Bioinformatics GroupOctober 1st, 2013

13年11月5⽇日星期⼆二

Page 2: 20131001 lab meeting

Agenda

BackgroundExisting researchToy ExperimentFuture workReferences

2

13年11月5⽇日星期⼆二

Page 3: 20131001 lab meeting

Background

3

why & what

13年11月5⽇日星期⼆二

Page 4: 20131001 lab meeting

DNA Sequencing

4

Angelina Jolie tested for one gene, what about the other 20,000?

13年11月5⽇日星期⼆二

Page 5: 20131001 lab meeting

20,000

5

1

full genome sequence

13年11月5⽇日星期⼆二

Page 6: 20131001 lab meeting

Genome

6

An organism's complete set of DNA

13年11月5⽇日星期⼆二

Page 7: 20131001 lab meeting

7

Chromosome

����������� ������������������  a����������� ������������������  region����������� ������������������  of����������� ������������������  chromosome����������� ������������������  that����������� ������������������  controls����������� ������������������  a����������� ������������������  hereditary����������� ������������������  characteristic

DNA����������� ������������������  +����������� ������������������  protein

=13年11月5⽇日星期⼆二

Page 8: 20131001 lab meeting

8

Chromosome

����������� ������������������  a����������� ������������������  region����������� ������������������  of����������� ������������������  chromosome����������� ������������������  that����������� ������������������  controls����������� ������������������  a����������� ������������������  hereditary����������� ������������������  characteristic

DNA����������� ������������������  +����������� ������������������  protein

=

ATCG

base pair(bp)

13年11月5⽇日星期⼆二

Page 9: 20131001 lab meeting

Chromosome Gene

����������� ������������������  a����������� ������������������  region����������� ������������������  of����������� ������������������  chromosome����������� ������������������  that����������� ������������������  controls����������� ������������������  a����������� ������������������  hereditary����������� ������������������  characteristic

20,000+

13年11月5⽇日星期⼆二

Page 10: 20131001 lab meeting

10

average : 3,000 bpslargest : 2,400,000 bps

Human gene

Human genome3 billion bps

Human DNA50 ~ 250 Mbps

13年11月5⽇日星期⼆二

Page 11: 20131001 lab meeting

Next Generation

11

Sequencing

high����������� ������������������  throughput����������� ������������������  &����������� ������������������  chea

per

output����������� ������������������  short����������� ������������������  reads

13年11月5⽇日星期⼆二

Page 12: 20131001 lab meeting

12

Elaine R. Mardis. A decade’s perspective on DNA sequencing technology. Figure 1.

13年11月5⽇日星期⼆二

Page 13: 20131001 lab meeting

13

wikipedia. http://en.wikipedia.org/wiki/DNA_sequencing#cite_note-quail2012-37

13年11月5⽇日星期⼆二

Page 14: 20131001 lab meeting

14

13年11月5⽇日星期⼆二

Page 15: 20131001 lab meeting

Error Correction

15

highly accurate sequenced reads will likely lead to higher quality results.

13年11月5⽇日星期⼆二

Page 16: 20131001 lab meeting

Existing Research

16

13年11月5⽇日星期⼆二

Page 17: 20131001 lab meeting

17

13年11月5⽇日星期⼆二

Page 18: 20131001 lab meeting

Possible direction

To handle large genomes and larger datasets.

To handle insertion and deletion errors.

To correct hybrid datasets from multiple next generation platforms.

To develop error correction methods for datasets in population studies.

18

13年11月5⽇日星期⼆二

Page 19: 20131001 lab meeting

Toy experiment

19

13年11月5⽇日星期⼆二

Page 20: 20131001 lab meeting

short read

find similar pairs of reads by SlideSort

vote each position by paired read

decide the new base

correct the erroneous bases

13年11月5⽇日星期⼆二

Page 21: 20131001 lab meeting

• All pairs similarity search (APSS) for sequence dataset.

• APSS: find all similar pairs in a dataset.

• Performance of SlideSort• 10 minutes for 10 million reads.• 2~3G byte for 10 million reads.

• Complexity of SlideSort• Time: O(N+α)• Equivalence classes are found in O(N).• α is a number of neighbor pairs.

Slidesort

21

13年11月5⽇日星期⼆二

Page 22: 20131001 lab meeting

ATGCATAATGCTCAAAGTCGGAAGGTCG

ATTCATTATGCCCAATGTATTATGCTTA

Input Output

ATGCATAATGCTTA

AAG-TCGGAAGGTCG-

• A set of short reads• Distance threshold d

Alignments and distancesof all similar pairs.

ed= 1

ed= 2

ATGCATAATGCTCA

ed= 2SlideSort

Slidesort

22

13年11月5⽇日星期⼆二

Page 23: 20131001 lab meeting

ACGC.….

ATGC…….

AAGT…….

Naive approach:O(N2)

How to reduce computational

cost?*Animation by Prof. Shimizu

13年11月5⽇日星期⼆二

Page 24: 20131001 lab meeting

ACGC.….

ATGC…….

AAGT…….

Naive approach:O(N2)

How to reduce computational

cost?*Animation by Prof. Shimizu

13年11月5⽇日星期⼆二

Page 25: 20131001 lab meeting

ATGC…….

AAGT…….

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

*Animation by Prof. Shimizu

13年11月5⽇日星期⼆二

Page 26: 20131001 lab meeting

ATGC…….

AAGT…….

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

*Animation by Prof. Shimizu

13年11月5⽇日星期⼆二

Page 27: 20131001 lab meeting

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 28: 20131001 lab meeting

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 29: 20131001 lab meeting

ACGC.….

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 30: 20131001 lab meeting

ACGC.….

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 31: 20131001 lab meeting

ACGC.….

AAGT…….

ATGC…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 32: 20131001 lab meeting

ACGC.….

AAGT…….

ATGC…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 33: 20131001 lab meeting

ATGC…….ACGC.….

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 34: 20131001 lab meeting

ATGC…….ACGC.….

ATGC…….

AAGT…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 35: 20131001 lab meeting

AAGT…….

ACGC.….

ATGC…….

ATGC…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 36: 20131001 lab meeting

AAGT…….

ACGC.….

ATGC…….

ATGC…….

*Animation by Prof. Shimizu

Basic strategy:1. Filtering stage

Find subsets sharing common substring(s)

2. Pair-wise comparison stageCompares all pairs for each subset.

13年11月5⽇日星期⼆二

Page 37: 20131001 lab meeting

S1 & S2 are decomposed into m blocks.

If edit distance of S1 & S2 is at most d, there exist at least (m-d) common blocks between S1&S2, at similar position.

Slidesort

13年11月5⽇日星期⼆二

Page 38: 20131001 lab meeting

• First step:• Quickly finds a subset of short

reads which shares (m-d) common blocks. (k-mers)

• Second step:• Calculates edit-dist between all

pairs included in the subset (equivalence class).

• Outputs pairs whose edit-dist are more than d, as well as alignments and scores.

ATGC…….

S1

S2

S3S4

S5S6

S1S2S5

S1S2S5

Equivalence class

Slidesort

13年11月5⽇日星期⼆二

Page 39: 20131001 lab meeting

Toy ExperimentData: test.fasta

Simulator: Stampy. (An open source that can simulate short read error.)

Num of sequence : 5

Max_seq_length: 51

Min_seq_length: 51

32

13年11月5⽇日星期⼆二

Page 40: 20131001 lab meeting

Toy Experiment

33

seq 0 1 2 3 4

◉ 1 1

△ 1 1

✖ 1

13年11月5⽇日星期⼆二

Page 41: 20131001 lab meeting

Discussion

• Not sure if test data generated by Stampy is good or not.

• Data set is way too small.

34

13年11月5⽇日星期⼆二

Page 42: 20131001 lab meeting

Future work

• Proper, bigger dataset.

• Select data sets from real experiments from online database instead of simulations.

• Try Bayesian model

35

13年11月5⽇日星期⼆二

Page 43: 20131001 lab meeting

References

• Elaine R. Mardis. A decade’s perspective on DNA sequencing technology.

• Michael L. Metzker. Sequencing technologies — the next generation.

• Xiao Yang, Sriram P. Chockalingam, Srinivas Aluru. A survey of error-correction methods for next-generation sequencing. Briefing in Bioinformatics (2013) 14 (1): 56-66.

• Kana Shimizu1, Koji Tsuda. SlideSort: all pairs similarity search for short reads. Bioinformatics (2011) 27 (4): 464-470.

• Next Generation Sequencing (NGS) Market [Platforms (Illumina HiSeq, MiSeq, Life Technologies Ion Proton/PGM, 454 Roche), Bioinformatics (RNA-Seq, ChIP-Seq), (Pyrosequencing, SBS, SMRT), (Diagnostics, Personalized Medicine)] - Global Forecast to 2017.

13年11月5⽇日星期⼆二