developing bioinformatics tools for genome analysis zemin ning the wellcome trust sanger institute
TRANSCRIPT
Developing Bioinformatics Developing Bioinformatics Tools for Genome Tools for Genome
AnalysisAnalysis
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
My academic background Informatics Projects Sequence alignment Detecting breakpoints of CNVs Genome assembly - algorithm Genome assembly – Tasmanian Devil 3rd generation sequencing – Oxford Nanopore
Outline of the Talk:
Powder Simulation
Hair Dynamics
Genetics and Human Hair Structure Genetics and Human Hair Structure
AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN
Sequence AlignmentSsaha/Ssaha2/SMALT - Alignment tool for Illumina, 454, ABI capillary reads
ssahaSNP – SNP/indel detection, mainly for ABI capillary reads
ssahaEST – EST or cDNA alignment
ssaha_pileup – SNP/indel detection from next-gen data Genome Assembly
Phusion/Phusion2: Clustering based genome assembly pipeline
Fuzzypath: De Bruijn graph based assembler
iCAS: - a pipeline for Illumina clone assembly
Production of WGS assemblies:
Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Tasmanian Devil, Bamboo, Malaria and many bacterial genomes
PindelDetecting indels/SVs from short reads
TraceSeachPublic sequence search facility for all the traces
Bionformatics Projects InvolvedBionformatics Projects Involved
SMALT AlgorithmSMALT Algorithm
ATGGCGTGCAGTATGGCGTGCAGT
TGGCGTGCAGTCTGGCGTGCAGTC
GGCGTGCAGTCCGGCGTGCAGTCC
GCGTGCAGTCCAGCGTGCAGTCCA
CGTGCAGTCCATCGTGCAGTCCAT
ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA
Overlap hashingOverlap hashing W = N-k+1W = N-k+1
(k = 12)(k = 12)
Non-overlap Hashing v Overlap HashingNon-overlap Hashing v Overlap Hashing
Non-overlap HashingNon-overlap HashingW = N/kW = N/k
ATGGGCAGATGTATGGGCAGATGT
CCATGTTCGGATCCATGTTCGGAT
CATTACGTAAGCCATTACGTAAGC
ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGCATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC
Sequence Representation
Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m
K-tuple: (sisi+1...si+k-1)
Using two binary digits for each base, we may have the following representations:
“A” =00; “C” = 01; “G” = 10; “T” = 11
For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way
k
i
ii EE
2
1
2kmax
1 1-2 with 2
where i = 0 or 1, depending on the value of the sequence base
and Emax is the maximum value of the possible E values.
E k-tuple Ni Indices and Offsets
0 AA 1 2, 19
1 AC 3 1, 9 2, 5 2, 11
2 AG 2 1, 15 2, 35
3 AT 2 2, 13 3, 3
4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23
5 CC 4 1, 21 2, 31 3, 5 3, 7
6 CG 1 1, 5
7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17
8 GA 4 1, 3 1, 17 2, 15 2, 25
9 GC 0
10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1
11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19
12 TA 1 3, 25
13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11
14 TG 3 1, 13 2, 7 3, 9
15 TT
S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT)S3=(GGATCCCCTGTCCTCTCTGTCACATA)
Hash Table: A 2-tuple hashing table of S1, S2 and S3
Burrows-Wheeler vs Hashingseed (28 bp)
hi-half lo-half
seed (32 bp, optional)
5'
seed (28 bp)seed (28 bp)
3'
5' 3'
BOWTIE/TOPHAT
BWA
depth-1st by default, breadth-1st slowerno indels
breadth first,upper bound on edit distance, e.g. max 5 mismatches in 100bp read. Can deal with indels.
5' 3'SMALT/SSAHA2
Exact matching k-segment (1 kmer) required.Partial alignments (indels, splice junctions)
• Strengths and weaknesses (trends)– Burrows-Wheeler, e.g. bwa, bowtie
– Fast, esp. (multiple) exact matches
– High sensitivity at repetitive regions
– less robust at high genomic variation
– Hashing (overlapping k-mer words, e.g SMALT/SSAHA2, Stampy, SNAP)
– Slower (more memory hungry)
– Less sensitivity at repetitive regions
– tolerate high genomic variation
– partial alignments (junction reads) easier
– Flexible (multiple sequencing platforms)
Burrows-Wheeler vs Hashing
Performance Assessmenton simulated reads
human genome105 read pairs 2 x 100 bp (insert 500)20% of variations indels (max. 10)
Performance of mappers(genome re-sequencing)
Simulated for human genome:
4x106 x 100 bp single reads
1% variation of which 20% indels
14 bp maximum indel length
indel length
Sensitivity Assessment ~ 2% genomic variation
Reads: M. spretus
whole genome shotgun2 x 108 bp, insert 250 bp
Reference: M. musculus NCBI build 37
• independently mapped reads
• Count discordant pairs as erroneous mappings
19 April 2023 15
Pindel – A Pattern Growth Approach to DetectStructural Variations
ATGCAGC
ATCAAGTATGCTTAGC
Minimum unique substring: ATG
Maximum unique substring: ATGC
19 April 2023 16
String matching by pattern growth
ATGCAGC
ATCAAGTATGCTTAGC
19 April 2023 17
String matching by pattern growth
ATGCAGC
ATCAAGTATGCTTAGC
19 April 2023 18
String matching by pattern growth
ATGCAGC
ATCAAGTATGCTTAGC
19 April 2023 19
String matching by pattern growth
ATGCAGC
ATCAAGTATGCTTAGC
Minimum unique substring: ATG
Maximum unique substring: ATGC
19 April 2023 20
Deletion with Base Level Breakpoint Precision
ATGCAGC
ATCAAGTATGCTTAGC
Minimum unique substring: ATG
Maximum unique substring: ATGC
Indels from LowCov data of 1000g on chr1
Deletion: 88,425Deletion: 88,425 Insertion: Insertion: 69,71069,710
Assembly Method
1 A C C T G A T C
2 C T G A T C A A
3 T G A T C A A T
4 A G C G A T C A
5 C G A T C A A T
6 G A T C A A T G
7 T C A A T G T G
8 C A A T G T G A
1. Overlap graphSequencing reads:
2. de Bruijn graph
3. String graph
Phusion2 Assembly PipelinePhusion2 Assembly Pipeline
SolexaReads
Assembly
Reads Group
Data Process Long Insert Reads
Supercontig
Contigs
PRono
SOAPdenovo
Fermi
ABySS
2x75 or 2x100
BaseCorrection
Phrap
Gap-HashGap-Hash4x34x3
ATGGGCAGATGTATGGGCAGATGT
TGGCCAGTTGTTTGGCCAGTTGTT
GGCGAGTCGTTCGGCGAGTCGTTC
GCGTGTCCTTCGGCGTGTCCTTCG
ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA
ATGGCGTGCAGTATGGCGTGCAGT
TGGCGTGCAGTCTGGCGTGCAGTC
GGCGTGCAGTCCGGCGTGCAGTCC
GCGTGCAGTCCAGCGTGCAGTCCA
CGTGCAGTCCATCGTGCAGTCCAT
ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA
ContiguousContiguous Base HashBase Hash
K = 12K = 12
Kmer Word HashingKmer Word Hashing
Word use distribution for the mouse sequence data at ~7.5 foldWord use distribution for the mouse sequence data at ~7.5 fold
Useful Region
Poisson Curve
Real Data Curve
Clustering StrategyClustering Strategy
Making kmers and establishing kmer Making kmers and establishing kmer profile (K=31-63) from the readsprofile (K=31-63) from the reads
Read ClusteringRead Clustering
Assembly Contigs
Assembly Scaffolds
Reduce cluster Reduce cluster sizesize
R11 R12 R13 ... R1j … R1n
... … Rij … …
R21 R22 R23 ... R2j … R2n
R31 R32 R33 ... R3j … R3n
Rn1 Rn2 Rn3 ... Rnj … Rnn
R11 R12 R13 ... R1j … R1n
... … Rij … …
R21 R22 R23 ... R2j … R2n
R31 R32 R33 ... R3j … R3n
Rn1 Rn2 Rn3 ... Rnj … Rnn
Relational MatrixRelational Matrix
1 2 3 4 5 6 … j … 500
3
1
4
2
5
R(i,j)
Relation Matrix: R(i,j) – ImplementationRelation Matrix: R(i,j) – Implementation
Read index
Number of shared kmer words (< 63)
N
.
.
.
1 2 3 4 5 6 … j … N
3
1
4
2
6
5
i
N
41 0 0 0 0
R(i,j)
Relation Matrix: R(i,j) – number of kmer Relation Matrix: R(i,j) – number of kmer words shared between read i and read jwords shared between read i and read j
41 37 0 0 0 0 37 0 22 0
0 0 22 0 0
0 0 0 0 27
0 0 0 27 0
Group 1: (1,2,3,5)Group 1: (1,2,3,5)
Group 2: (4,6)Group 2: (4,6)
Genomes SequencedSequencing
PlatformGenome Size
(Gb)Contig N50
(Kb)Scaffold N50
(Kb)C. Briggsae (Published) 1st 0.1 41 1450Mouse (Published) 1st 2.6 2.1 6500Schisto Mansoni (Published) 1st 0.35 16.5 800Schisto Japonicum (Published) 1st 0.35 6.5 400Zebrafish (Published) 1st, 2nd 1.45 24 1400Gorilla (Published) 1st, 2nd 2.8 52 11.5Tasmanian Devil (Published) 2nd 3.0 20 1800Giant Panda (On-going) 2nd 2.38 126 9800Red Panda (On-going) 2nd 2.38 98 24410Grass carp (Femal) (Accepted) 2nd 0.9 40 5600Grass carp (Male) (Accepted) 2nd 1.02 26 2300Gerbillinae (On-going) 2nd 2.5 73 4485Dansheng (Submitted) 2nd 0.7 6 20Populus tomentosa (On-going) 2nd 0.5 16.2 920Populus euphratica (On-going) 2nd 0.5 15 4485Bamboo (Published) 2nd 2 12 328Mischanthus (On-gpong) 2nd 2.4 18.1 370Wheat D (On-going) 2nd 4.6 16 167Wild rice (On-going) 2nd 0.4 30 50Pan-pig 6 samples (On-going) 2nd 2.6 54.1 - 82.6 4614 - 9238
Pan-chicken 10 sample (On-going) 2nd 1.05 80 - 137.8 6094 - 8383Yeast (On-going) 2nd, 3rd 0.012 330 858
Outline of Genome AssembliesOutline of Genome Assemblies
Tasmanian devilTasmanian tiger
Tasmanian
Australian
Tasmanian devil
Opo
ssum
Wal
laby
Tasm
ania
n
devi
l
Tasmanian devil facial tumour disease (DFTD)
Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils
Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults
>1yr Death in 4 – 6 months
Reedy Marsh 2007
Mangalore 2007
Mt William 2007 or 2008
Coles Bay
Upper Natone 2007
Narawntapu 2007
Strain 1, tetraploid
Strain 2
Strain 3
DFTD samples for sequencing
DFTD originated here c.1996
Area still DFTD free
Unknown strain
“Evolved”
Forestier 2007
Reedy Marsh 2007
Mangalore 2007
Mt William
Coles Bay
Upper Natone 2007
Narawntapu 2007
Devil Genomes Sequenced
Forestier 2007
Tumour 2 (53T)
Tumour 1 (87T)
Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney.
Sequencing T. Devil on Illumina: Strategy
Tumour or normal genomic DNA
Fragments of defined size0.5, 2, 5, 7, 8, 10 kb
Sequencing
2x100bp reads short insert
2x50bp mate pairs
fragment size distribution
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 1000 2000 3000 4000 5000 6000
size
fre
qu
en
cy
tumour 2kb
tumour 3kb
tumour 4kb
normal 2kb
normal 3kb
normal 4kb
Sequencing performed at Illumina
Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)
Read Coverage 85x 40x 56x 84x
1 2 3 4 5
6 7 8 X
1
4 2a 3a6
2b 3b 5
Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes
X
Opossum Devil
Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370
Chr Size (Mb) Scaffolds_assigned Bases_assigned (Mb) Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641Chr4 450 4817 487Chr5 341 3188 300 Chr6 277 2844 263ChrX 122 2378 86.6Unassigned 440 1.23
Scaffolds Assigned to Chromosomes Scaffolds Assigned to Chromosomes using Flow-sorting Datausing Flow-sorting Data
Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)
Coverage 35.58 28.80 40.49 33.14
Total SNPs 615,084 646,186 758,023 738,793
Het SNPs 524,040 371,412 465,630 462,722
Hom SNPs 91,044 274,774 292,393 276,071
Total indels 235,632 262,461 320,820 312,287
Het indels 183,978 146,299 186,094 183,747
Hom indels 51,654 81,120 / 116,162
134,726 128,540
Variant calling : catalogue of variants in all 4 genomes
*Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.
Oxford NanoporeOxford NanoporeThe Data, the Alignment and the AssemblyThe Data, the Alignment and the Assembly
1D and 2D Base Calling1D and 2D Base Calling
The 1D vs 2D barcoding refers to whether the complementary strand is used to improve basecalled data. Basically – it gives two shots at examining the same loci. The advantage being that the complementary strand will have a different kmer profile.
Read Length Distribution – 1D and 2DRead Length Distribution – 1D and 2D
Yeast S288C ONT reads sequenced – 1D & 2DYeast S288C ONT reads sequenced – 1D & 2D
1D 2DNumber of bases: 275Mb 373MbNumber of reads: 47735 66106N50: 8335 6645Average length: 5765 5651Maximum length: 143706 32826
Mapping score: Mapping score: 1919Match identity: Match identity: 75.2%75.2%
Mapping score: Mapping score: 44Match identity: Match identity: 53.4%53.4%
ONT Assisted Scaffolding – Fake Mate Pairs ONT Assisted Scaffolding – Fake Mate Pairs
Insert SizeInsert Size
ONT Read LengthONT Read Length
Segment Segment lengthlength
Step Step lengthlength
ONT Assisted Scaffolding
Mate pair data is used to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:
Using expected insert size, a estimate of the gap size can be given for each contig.
http://sourceforge.net/projects/phusion2/files/smis/
Spinner – walks through a loopThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
Scaffold one piece Scaffold one piece
Scaffold 2-3 pieces Scaffold 2-3 pieces
Scaffold N50 858Kb ; Contig N50 330Kb Scaffold N50 858Kb ; Contig N50 330Kb
SV between S288C and W303SV between S288C and W303
SV between S288C and W303SV between S288C and W303
Acknowledgements:
Adam Spargo Hannes Ponstingl German Tischler Yong Gu Joe Henson Ben Blackburn
Jim Mulikim Tony Cox, Sanger Tony Cox, Illumina Elizabeth Murchuson Richard Durbin Mike Stratton
Bin Han Feng Qi Henyun Lu Zhao Qiang Ying Lu
Kai Ye Ole Schulz-Trieglaff David Bentley
Professor Keling Liu at ISU in 1945Professor Keling Liu at ISU in 1945
Professor Keling Liu at ISU in 1945Professor Keling Liu at ISU in 1945