bioinformatics at molecular epidemiology - new tools for identifying indels in sequencing data

24
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye [email protected]

Upload: annis

Post on 05-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data. Kai Ye [email protected]. Data collection for osteoarthritis, cardiovascular disease and longevity. Serum parameters Cellular characteristics (biobank) Skin ageing Glycosylation Metabonomic - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Bioinformatics at Molecular Epidemiology- new tools for identifying indels in sequencing data

Kai [email protected]

Page 2: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Data collection for osteoarthritis, cardiovascular disease and longevity

• Serum parameters• Cellular characteristics (biobank)• Skin ageing• Glycosylation • Metabonomic• Transcriptomic• Genetic (GWAS/sequence)• Epigenetic• Data Integration

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0-50

-20

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

350 612 #68 6 dec B4 FLUmV

min

1 - 36.281

2 - 38.161

3 - 41.934

4 - Intergrate-11 - 42.787

5 - 44.173

6 - Intergrate-12 - 45.324

7 - Intergrate-13 - 48.294

8 - 49.809

9 - 52.029

10 - 54.688

11 - 55.813

12 - 58.113

13 - 60.439

14 - 65.038

15 - 66.956

16 - 69.878

17 - 72.70518 - 76.407

N-Acetylglucosamine

Galactose

Mannose

Sialicacid

Fucose

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0-50

-20

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

350 612 #68 6 dec B4 FLUmV

min

1 - 36.281

2 - 38.161

3 - 41.934

4 - Intergrate-11 - 42.787

5 - 44.173

6 - Intergrate-12 - 45.324

7 - Intergrate-13 - 48.294

8 - 49.809

9 - 52.029

10 - 54.688

11 - 55.813

12 - 58.113

13 - 60.439

14 - 65.038

15 - 66.956

16 - 69.878

17 - 72.70518 - 76.407

N-Acetylglucosamine

Galactose

Mannose

Sialicacid

Fucose

N-Acetylglucosamine

Galactose

Mannose

Sialicacid

Fucose

Page 3: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Genetic &Epigenetic analyses

BiochemanalysesExpression

analysis

metabonomicanalysis

GlycosylationCell responses

Joost KokErik vd Akker Kai Ye Statistical analysis

Page 4: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

About me

• 1995 – 2003 B.S. and M.S. in biology and pharmaceutical science

• 2004 – 2008 PhD with Cum Laude at Leiden University. Thesis title: Novel algorithms for protein sequence analysis

• 2008 – 2009 Postdoc at European Bioinformatics Institute, collaborating with scientists in Sanger Institute

• Currently assistant professor at MolEpi

Page 5: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

A Pindel approach for identifying indels in Next-Gen sequencing data

• Paired-end reads in Next-gen sequencing

• Indel detection algorithms• Pindel• Cancer genome project• 1000 genomes project

Page 6: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Paired-end reads in Next Generation sequencing

~ insert size

Page 7: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

SNP

Mapping paired-end reads

CNVs: copy number variations; INDELs: insertions and deletions; SVs: Structural variations

Page 8: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Gapped alignment for small indels

ATCCGTATCACGGTCA-CAGATCAGTCCAGT

ATCCGTATCACGGTCAGCAGATCAGTCCAGT

indel

Page 9: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Read-depth for CNVs

Page 10: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Read-pair approach for SVs

No Indel

Deletion

Insertion

Sample

Reference

Sample

Reference

Sample

Reference

Page 11: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Mapping paired-end reads

• read-pairs

• read-depth

SNP or small indel

Page 12: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Mapping paired-end reads

• read-pairs

• read-depth

SNP or small indel

Page 13: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

test

ref

1base - 1million bases

Pindel: Deletions

Page 14: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 14

Pindel: Deletions

ref

Anchor

Page 15: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 15

ref

Pindel: Deletions

Anchor

2 x average distance

Page 16: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 16

ref

Pindel: Deletions

Anchor

2 x average distance

Expected maximum deletion size + read length (36)

Page 17: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 17

reference

Pindel: Deletions

sample

Page 18: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 18

African male: NA18507

• Bentley et al., Nature 2008• 135Gb of sequence• ~4 billion paired 35-base reads• After preprocessing:

56,161,333 pairs of one-end mapped reads

• Pindel– 142,908 1-16bp insertions– 162,068 1bp-10kb deletions

Page 19: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

22 April 2023 19

Deletion size distribution

Page 20: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Applications

• Cancer genome project• 1000 genomes project

Page 21: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

Cancer genome

• COLO-829 cells• Normal ~30x paired-end 100bp reads• Tumor ~40x paired-end 100bp reads• Search for somatic (tumor specific) indels

Page 22: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data
Page 23: Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data

1000genomes project

• Pilot 1: 180 people of 3 major geographic groups (YRI, CEU, CHB and JPT) at low coverage (~4x)

• Pilot 2: the genomes of two families (CEU and YRI, both parents and an adult child) with deep coverage (20x per genome)

• Pilot 3: sequencing the coding regions (exons) of 1,000 genes in 1,000 people with deep coverage (20x).