developing bioinformatics tools for genome analysis zemin ning the wellcome trust sanger institute

Developing Bioinformatics Developing Bioinformatics Tools for Genome Tools for Genome

AnalysisAnalysis

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

My academic background Informatics Projects Sequence alignment Detecting breakpoints of CNVs Genome assembly - algorithm Genome assembly – Tasmanian Devil 3rd generation sequencing – Oxford Nanopore

Outline of the Talk:

Powder Simulation

Hair Dynamics

Genetics and Human Hair Structure Genetics and Human Hair Structure

AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN

Sequence AlignmentSsaha/Ssaha2/SMALT - Alignment tool for Illumina, 454, ABI capillary reads

ssahaSNP – SNP/indel detection, mainly for ABI capillary reads

ssahaEST – EST or cDNA alignment

ssaha_pileup – SNP/indel detection from next-gen data Genome Assembly

Phusion/Phusion2: Clustering based genome assembly pipeline

Fuzzypath: De Bruijn graph based assembler

iCAS: - a pipeline for Illumina clone assembly

Production of WGS assemblies:

Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Tasmanian Devil, Bamboo, Malaria and many bacterial genomes

PindelDetecting indels/SVs from short reads

TraceSeachPublic sequence search facility for all the traces

Bionformatics Projects InvolvedBionformatics Projects Involved

SMALT AlgorithmSMALT Algorithm

ATGGCGTGCAGTATGGCGTGCAGT

TGGCGTGCAGTCTGGCGTGCAGTC

GGCGTGCAGTCCGGCGTGCAGTCC

GCGTGCAGTCCAGCGTGCAGTCCA

CGTGCAGTCCATCGTGCAGTCCAT

ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA

Overlap hashingOverlap hashing W = N-k+1W = N-k+1

(k = 12)(k = 12)

Non-overlap Hashing v Overlap HashingNon-overlap Hashing v Overlap Hashing

Non-overlap HashingNon-overlap HashingW = N/kW = N/k

ATGGGCAGATGTATGGGCAGATGT

CCATGTTCGGATCCATGTTCGGAT

CATTACGTAAGCCATTACGTAAGC

ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGCATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC

Sequence Representation

Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m

K-tuple: (sisi+1...si+k-1)

Using two binary digits for each base, we may have the following representations:

“A” =00; “C” = 01; “G” = 10; “T” = 11

For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way

k

i

ii EE

2

1

2kmax

1 1-2 with 2

where i = 0 or 1, depending on the value of the sequence base

and Emax is the maximum value of the possible E values.

E k-tuple Ni Indices and Offsets

0 AA 1 2, 19

1 AC 3 1, 9 2, 5 2, 11

2 AG 2 1, 15 2, 35

3 AT 2 2, 13 3, 3

4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23

5 CC 4 1, 21 2, 31 3, 5 3, 7

6 CG 1 1, 5

7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17

8 GA 4 1, 3 1, 17 2, 15 2, 25

9 GC 0

10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1

11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19

12 TA 1 3, 25

13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11

14 TG 3 1, 13 2, 7 3, 9

15 TT

S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT)S3=(GGATCCCCTGTCCTCTCTGTCACATA)

Hash Table: A 2-tuple hashing table of S1, S2 and S3

Burrows-Wheeler vs Hashingseed (28 bp)

hi-half lo-half

seed (32 bp, optional)

5'

seed (28 bp)seed (28 bp)

3'

5' 3'

BOWTIE/TOPHAT

BWA

depth-1st by default, breadth-1st slowerno indels

breadth first,upper bound on edit distance, e.g. max 5 mismatches in 100bp read. Can deal with indels.

5' 3'SMALT/SSAHA2

Exact matching k-segment (1 kmer) required.Partial alignments (indels, splice junctions)

• Strengths and weaknesses (trends)– Burrows-Wheeler, e.g. bwa, bowtie

– Fast, esp. (multiple) exact matches

– High sensitivity at repetitive regions

– less robust at high genomic variation

– Hashing (overlapping k-mer words, e.g SMALT/SSAHA2, Stampy, SNAP)

– Slower (more memory hungry)

– Less sensitivity at repetitive regions

– tolerate high genomic variation

– partial alignments (junction reads) easier

– Flexible (multiple sequencing platforms)

Burrows-Wheeler vs Hashing

Performance Assessmenton simulated reads

human genome105 read pairs 2 x 100 bp (insert 500)20% of variations indels (max. 10)

Performance of mappers(genome re-sequencing)

Simulated for human genome:

4x106 x 100 bp single reads

1% variation of which 20% indels

14 bp maximum indel length

indel length

Sensitivity Assessment ~ 2% genomic variation

Reads: M. spretus

whole genome shotgun2 x 108 bp, insert 250 bp

Reference: M. musculus NCBI build 37

• independently mapped reads

• Count discordant pairs as erroneous mappings

19 April 2023 15

Pindel – A Pattern Growth Approach to DetectStructural Variations

ATGCAGC

ATCAAGTATGCTTAGC

Minimum unique substring: ATG

Maximum unique substring: ATGC

19 April 2023 16

String matching by pattern growth

ATGCAGC

ATCAAGTATGCTTAGC

19 April 2023 17


ATGCAGC

ATCAAGTATGCTTAGC

19 April 2023 18


ATGCAGC

ATCAAGTATGCTTAGC

19 April 2023 19


ATGCAGC

ATCAAGTATGCTTAGC



19 April 2023 20

Deletion with Base Level Breakpoint Precision

ATGCAGC

ATCAAGTATGCTTAGC



Indels from LowCov data of 1000g on chr1

Deletion: 88,425Deletion: 88,425 Insertion: Insertion: 69,71069,710

Assembly Method

1 A C C T G A T C

2 C T G A T C A A

3 T G A T C A A T

4 A G C G A T C A

5 C G A T C A A T

6 G A T C A A T G

7 T C A A T G T G

8 C A A T G T G A

1. Overlap graphSequencing reads:

2. de Bruijn graph

3. String graph

Phusion2 Assembly PipelinePhusion2 Assembly Pipeline

SolexaReads

Assembly

Reads Group

Data Process Long Insert Reads

Supercontig

Contigs

PRono

SOAPdenovo

Fermi

ABySS

2x75 or 2x100

BaseCorrection

Phrap

Gap-HashGap-Hash4x34x3

ATGGGCAGATGTATGGGCAGATGT

TGGCCAGTTGTTTGGCCAGTTGTT

GGCGAGTCGTTCGGCGAGTCGTTC

GCGTGTCCTTCGGCGTGTCCTTCG

ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA

ATGGCGTGCAGTATGGCGTGCAGT

TGGCGTGCAGTCTGGCGTGCAGTC

GGCGTGCAGTCCGGCGTGCAGTCC

GCGTGCAGTCCAGCGTGCAGTCCA

CGTGCAGTCCATCGTGCAGTCCAT

ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA

ContiguousContiguous Base HashBase Hash

K = 12K = 12

Kmer Word HashingKmer Word Hashing

Word use distribution for the mouse sequence data at ~7.5 foldWord use distribution for the mouse sequence data at ~7.5 fold

Useful Region

Poisson Curve

Real Data Curve

Clustering StrategyClustering Strategy

Making kmers and establishing kmer Making kmers and establishing kmer profile (K=31-63) from the readsprofile (K=31-63) from the reads

Read ClusteringRead Clustering

Assembly Contigs

Assembly Scaffolds

Reduce cluster Reduce cluster sizesize

R11 R12 R13 ... R1j … R1n

... … Rij … …

R21 R22 R23 ... R2j … R2n

R31 R32 R33 ... R3j … R3n

Rn1 Rn2 Rn3 ... Rnj … Rnn

R11 R12 R13 ... R1j … R1n

... … Rij … …

R21 R22 R23 ... R2j … R2n

R31 R32 R33 ... R3j … R3n

Rn1 Rn2 Rn3 ... Rnj … Rnn

Relational MatrixRelational Matrix

1 2 3 4 5 6 … j … 500

3

1

4

2

5

R(i,j)

Relation Matrix: R(i,j) – ImplementationRelation Matrix: R(i,j) – Implementation

Read index

Number of shared kmer words (< 63)

N

.

.

.

1 2 3 4 5 6 … j … N

3

1

4

2

6

5

i

N

41 0 0 0 0

R(i,j)

Relation Matrix: R(i,j) – number of kmer Relation Matrix: R(i,j) – number of kmer words shared between read i and read jwords shared between read i and read j

41 37 0 0 0 0 37 0 22 0

0 0 22 0 0

0 0 0 0 27

0 0 0 27 0

Group 1: (1,2,3,5)Group 1: (1,2,3,5)

Group 2: (4,6)Group 2: (4,6)

Genomes SequencedSequencing

PlatformGenome Size

(Gb)Contig N50

(Kb)Scaffold N50

(Kb)C. Briggsae (Published) 1st 0.1 41 1450Mouse (Published) 1st 2.6 2.1 6500Schisto Mansoni (Published) 1st 0.35 16.5 800Schisto Japonicum (Published) 1st 0.35 6.5 400Zebrafish (Published) 1st, 2nd 1.45 24 1400Gorilla (Published) 1st, 2nd 2.8 52 11.5Tasmanian Devil (Published) 2nd 3.0 20 1800Giant Panda (On-going) 2nd 2.38 126 9800Red Panda (On-going) 2nd 2.38 98 24410Grass carp (Femal) (Accepted) 2nd 0.9 40 5600Grass carp (Male) (Accepted) 2nd 1.02 26 2300Gerbillinae (On-going) 2nd 2.5 73 4485Dansheng (Submitted) 2nd 0.7 6 20Populus tomentosa (On-going) 2nd 0.5 16.2 920Populus euphratica (On-going) 2nd 0.5 15 4485Bamboo (Published) 2nd 2 12 328Mischanthus (On-gpong) 2nd 2.4 18.1 370Wheat D (On-going) 2nd 4.6 16 167Wild rice (On-going) 2nd 0.4 30 50Pan-pig 6 samples (On-going) 2nd 2.6 54.1 - 82.6 4614 - 9238

Pan-chicken 10 sample (On-going) 2nd 1.05 80 - 137.8 6094 - 8383Yeast (On-going) 2nd, 3rd 0.012 330 858

Outline of Genome AssembliesOutline of Genome Assemblies

Tasmanian devilTasmanian tiger

Tasmanian

Australian

http://z.about.com/d/geography/1/0/f/K/australia.jpg

Tasmanian devil

Opo

ssum

Wal

laby

Tasm

ania

n

devi

l

http://z.about.com/d/geography/1/0/f/K/australia.jpg

Tasmanian devil facial tumour disease (DFTD)

Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils

Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults

>1yr Death in 4 – 6 months

Reedy Marsh 2007

Mangalore 2007

Mt William 2007 or 2008

Coles Bay

Upper Natone 2007

Narawntapu 2007

Strain 1, tetraploid

Strain 2

Strain 3

DFTD samples for sequencing

DFTD originated here c.1996

Area still DFTD free

Unknown strain

“Evolved”

Forestier 2007

Reedy Marsh 2007

Mangalore 2007

Mt William

Coles Bay

Upper Natone 2007

Narawntapu 2007

Devil Genomes Sequenced

Forestier 2007

Tumour 2 (53T)

Tumour 1 (87T)

Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney.

Sequencing T. Devil on Illumina: Strategy

Tumour or normal genomic DNA

Fragments of defined size0.5, 2, 5, 7, 8, 10 kb

Sequencing

2x100bp reads short insert

2x50bp mate pairs

fragment size distribution

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1000 2000 3000 4000 5000 6000

size

fre

qu

en

cy

tumour 2kb

tumour 3kb

tumour 4kb

normal 2kb

normal 3kb

normal 4kb

Sequencing performed at Illumina

Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)

Read Coverage 85x 40x 56x 84x

1 2 3 4 5

6 7 8 X

1

4 2a 3a6

2b 3b 5

Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes

X

Opossum Devil

Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

Chr Size (Mb) Scaffolds_assigned Bases_assigned (Mb) Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641Chr4 450 4817 487Chr5 341 3188 300 Chr6 277 2844 263ChrX 122 2378 86.6Unassigned 440 1.23

Scaffolds Assigned to Chromosomes Scaffolds Assigned to Chromosomes using Flow-sorting Datausing Flow-sorting Data

Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)

Coverage 35.58 28.80 40.49 33.14

Total SNPs 615,084 646,186 758,023 738,793

Het SNPs 524,040 371,412 465,630 462,722

Hom SNPs 91,044 274,774 292,393 276,071

Total indels 235,632 262,461 320,820 312,287

Het indels 183,978 146,299 186,094 183,747

Hom indels 51,654 81,120 / 116,162

134,726 128,540

Variant calling : catalogue of variants in all 4 genomes

*Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.

Oxford NanoporeOxford NanoporeThe Data, the Alignment and the AssemblyThe Data, the Alignment and the Assembly

1D and 2D Base Calling1D and 2D Base Calling

The 1D vs 2D barcoding refers to whether the complementary strand is used to improve basecalled data. Basically – it gives two shots at examining the same loci. The advantage being that the complementary strand will have a different kmer profile.

Read Length Distribution – 1D and 2DRead Length Distribution – 1D and 2D

Yeast S288C ONT reads sequenced – 1D & 2DYeast S288C ONT reads sequenced – 1D & 2D

1D 2DNumber of bases: 275Mb 373MbNumber of reads: 47735 66106N50: 8335 6645Average length: 5765 5651Maximum length: 143706 32826

Mapping score: Mapping score: 1919Match identity: Match identity: 75.2%75.2%

Mapping score: Mapping score: 44Match identity: Match identity: 53.4%53.4%

ONT Assisted Scaffolding – Fake Mate Pairs ONT Assisted Scaffolding – Fake Mate Pairs

Insert SizeInsert Size

ONT Read LengthONT Read Length

Segment Segment lengthlength

Step Step lengthlength

ONT Assisted Scaffolding

Mate pair data is used to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:

Using expected insert size, a estimate of the gap size can be given for each contig.

http://sourceforge.net/projects/phusion2/files/smis/

Spinner – walks through a loopThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Scaffold one piece Scaffold one piece

Scaffold 2-3 pieces Scaffold 2-3 pieces

Scaffold N50 858Kb ; Contig N50 330Kb Scaffold N50 858Kb ; Contig N50 330Kb

SV between S288C and W303SV between S288C and W303

Acknowledgements:

Adam Spargo Hannes Ponstingl German Tischler Yong Gu Joe Henson Ben Blackburn

Jim Mulikim Tony Cox, Sanger Tony Cox, Illumina Elizabeth Murchuson Richard Durbin Mike Stratton

Bin Han Feng Qi Henyun Lu Zhao Qiang Ying Lu

Kai Ye Ole Schulz-Trieglaff David Bentley

Professor Keling Liu at ISU in 1945Professor Keling Liu at ISU in 1945

developing bioinformatics tools for genome analysis zemin ning the wellcome trust sanger institute

Documents