GenAx:AGenomeSequencingAccelerator
Daichi Fujiki* Arun Subramaniyan* Tianjun Zhang* Yu ZengReetuparna Das David Blaauw Satish Narayanasamy
*Equallycontributedtothepaper
Genomicsissettotransformmedicine
Population-basedtreatment Personalizedtreatment 2
“Illuminasaysitcandelivera$100genome— soon”
Cost
1million$
1000$100$
2008 2018 2028
1000x
Genomesequencingcostshaveplummeted
3
Portablesequencersarebecomingcommonplace
4
Sequenceanalysishasseveralcomputationalsteps
HumanGenome3Gbases
ATCGTGCAGTGTGCATCTACCAGTACATCGATCGTGCTAC
Sequencedreads(~billions)
ReadAlignment
Referencegenome
Read
ATCGTGCAGTTTCGTGAAG
GAAGTTTATTCGTA
CGTAAGT
VariantCalling
Alignedreads
Referencegenome
Diagnosis
SecondaryAnalysis
(350genomes/week)permachine
(5.6genomes/week)perserver 5
~300CPUhours
[1]Li,Heng.“Aligningsequencereads,clonesequencesandassemblycontigswithBWA-MEM.”PlatinumGenomedataset.IlluminaHiSeq 2000reads,Run:ERR194147(~50xcoverage)
Readalignmentisamajorbottleneckinsequenceanalysis
ReadAlignment
Referencegenome
Read
BWA-MEM1Seeding
Seed
Seedextension
Seed
6
*Qualitystandard
Seeding– aFiltrationStep
Read
ReferenceGenome
Seed
AATA
AATA AATA AATA
0 52 103 512
7
SeedExtension
Read
ReferenceGenome
Seed
AATA
AATA AATA AATA
0 52 103 512
8
CandidateReferenceStrings Score
SeedExtension
Read
ReferenceGenome
Seed
AATA
AATA AATA AATA
0 52 103 512
9
GAATA-CTA-AATTTAT
G--AATA-C---TTTAT
AAATACCTAAAATTTAT
CandidateReferenceStrings Score
15
11
17
AAATACCTAAAATTTATRead
Levenshtein (edit)distance:minimumnumberofedits(insertions,substitutions,deletions)requiredtoperfectlymatchtheread(orquerystringQ)andreferencestringR
Reference(R)
ReadorQuery(Q)
CATCGA– CGTAGAT
CA– CGAA CC TATAT
x x
del ins sub sub
Editdistance=4
Seedextensionasapproximatestringmatching
CA– CGAACC TAT AT
10
GenomeSequencing– AlignmentMethodology
SmithWatermanMatrix
©Wikimediacommons
11
GenomeSequencing– AlignmentMethodology
SmithWatermanMatrix 12
GenomeSequencing– AlignmentMethodology
SmithWatermanMatrix
O(n2)
nn
13
GenomeSequencing– AlignmentMethodology
Banded SmithWatermanMatrix
O(kn)
nn
14
GenomeSequencing– AlignmentMethodology
Levenshtein Automata
AcceptableEditDistancek
A G C
A G C
ins
sub
del
n
Banded SmithWatermanMatrix
O(kn)
nn
15
GenomeSequencing– AlignmentMethodology
Levenshtein Automata
AcceptableEditDistancek
A G C
A G C
ins
sub
del
A G C
ins
sub
del
O(kn)
n
Banded SmithWatermanMatrix
O(kn)
nn
16
GenomeSequencing– AlignmentMethodology
Levenshtein Automata
Stringdependent
O(kn)
Banded SmithWatermanMatrix
O(kn)
nn
17
GenomeSequencing– AlignmentMethodology
Stringindependent O(k2)
Banded SmithWatermanMatrix
O(kn)
nn
18
Silla:StringIndependentLocalLevenshtein Automata
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2Di,d = R[c-i] XNOR Q[c-d]
NewAutomatonAlgorithm
AlgorithmContribution HardwareImplementation
Silla:StringIndependentLocalLevenshtein Automata
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2Di,d = R[c-i] XNOR Q[c-d]
SillaX:SillaAcceleratorforGenomeSequencing
Hardw
areOptimization
SMEM+Hashbasedseedingalgorithm
A G T A A T G C C A T G
A G T A A T G C C A T T
Binarysearch
SMEM+HashbasedSeedingAccelerator
IndexTable
PositionTable
512entryCAM
512entryCAM
512entryCAM
512entryCAM
512entryCAM
Segmenting
19
AlgorithmContribution HardwareImplementation
Silla:StringIndependentLocalLevenshtein Automata
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2Di,d = R[c-i] XNOR Q[c-d]
SillaX:SillaAcceleratorforGenomeSequencing
Hardw
areOptimization
SMEM+Hashbasedseedingalgorithm
A G T A A T G C C A T G
A G T A A T G C C A T T
Binarysearch
SMEM+HashbasedSeedingAccelerator
IndexTable
PositionTable
512entryCAM
512entryCAM
512entryCAM
512entryCAM
512entryCAM
Segmenting
20
Indel Silla
StringIndependent
A G T A A T G C C A T TReference
A G T A A T A C C A T TQuery
Cyclecd
i
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2
Di,d = (R[c-i] == Q[c-d])
K
Silla:StringIndependentLocalLevenshtein Automaton 21
StringIndependent
T A A T G C C A T T
T A A T G C C A T T
Cyclec
0,0
D0,0
D0,0
D0,0 del
insmatch
Indel Silla
Silla:StringIndependentLocalLevenshtein Automaton 22
Reference
Query
Exactmatch
StringIndependent
T A A T G C C A T T
T A A T G C C A T T
Cyclec
0,0
D0,0
D0,0
D0,0 del
insmatch
Indel Silla
Silla:StringIndependentLocalLevenshtein Automaton 23
0,0 1,0
D0,0
D0,0
D0,0 del
insmatch
Reference
Query
T A T G C C A T T A
T A A T G C C A T T
×
insertion
Exactmatch Insertion
StringIndependent
T A A T G C C A T T
T A A T G C C A T T
Cyclec
0,0
D0,0
D0,0
D0,0 del
insmatch
Indel Silla
Silla:StringIndependentLocalLevenshtein Automaton 24
0,0 1,0
D0,0
D0,0
D0,0 del
insmatch 0,0
0,1
1,0
D0,0
D0,0
D0,0 del
insmatch
Reference
Query
T A T G C C A T T A
T A A T G C C A T T
insertion
T A A G C C A T T A
T A G C C A T T A G
×
deletion
Exactmatch Insertion Deletion
Indel Silla
StringIndependent
A G T A A T G C C A T TReference
A G T A A T A C C A T TQuery
Cyclecd
i
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2
Silla:StringIndependentLocalLevenshtein Automaton
D1,0D2,0
i =4 3 2 1
Di,d = (R[c-i] == Q[c-d])
A G T A A T G C C A T T
A G T A A T A C C A T T
Cyclecd
i
1,00,0
del
insmatch
2,0
1,10,1
0,2
K=1
K=2
D0,1D0,2
d=4 3 2 1
Di,d = (R[c-i] == Q[c-d])
25
d=4 3 2 1
i =4 3 2 1
Editdistance
↓↓
Editdistance
↓↓
StringIndependent
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0 = false
Keyobservation3DSilla≅ 2layerSilla
D0,0|0 à D0,0|1
3DSilla
Silla:StringIndependentLocalLevenshtein Automaton 26
O(k3)
StringIndependent
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0
Canwemergethesenodes?
D1,1|0 = D0,0|2
Keyobservation(2)D1,1|0willseethesamecharactersasD0,0|2 inthefuture
Collapsing3DSilla
Silla:StringIndependentLocalLevenshtein Automaton 27
Keyobservation3DSilla≅ 2layerSilla
StringIndependent
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0
Canwemergethesenodes?
D1,1|0 = D0,0|2 (t = c)
Keyobservation(2)D1,1|0willseethesamecharactersasD0,0|2 inthefuture
Collapsing3DSilla
Silla:StringIndependentLocalLevenshtein Automaton
c+1
(t=c+1)
28
Keyobservation3DSilla≅ 2layerSilla
StringIndependent
Canwemergethesenodes?
Keyobservation(2)D1,1|0willseethesamecharactersasD0,0|2 inthefuture
Collapsing3DSilla
Silla:StringIndependentLocalLevenshtein Automaton
à InsertWaitnode i,dw
iA G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0D1,1|0
+1cycle
= D0,0|1 = D0,0|2
29
Keyobservation3DSilla≅ 2layerSilla
StringIndependent
Canwemergethesenodes?
Keyobservation(2)D1,1|0willseethesamecharactersasD0,0|2 inthefuture
Collapsing3DSilla
Silla:StringIndependentLocalLevenshtein Automaton
à InsertWaitnode i,dw
iA G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0D1,1|0
+1cycle
= D0,0|1 = D0,0|2
30
Keyobservation3DSilla≅ 2layerSilla
O(k2)
StringIndependent
Canwemergethesenodes?
Keyobservation(2)D1,1|0willseethesamecharactersasD0,0|2 inthefuture
Collapsing3DSilla
Silla:StringIndependentLocalLevenshtein Automaton
à InsertWaitnode i,dw
iA G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0|0D1,1|0
+1cycle
= D0,0|1 = D0,0|2
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub
i,d1
i+1,d1
Di,d
del
insmatch
i,d+11
i+1,d+11
sub i,dw
31
Keyobservation3DSilla≅ 2layerSilla
O(k2)
Local
Silla:StringIndependentLocalLevenshtein Automaton 32
Local
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub
i,d1
i+1,d1
del
insmatch
i,d+11
i+1,d+11
sub i,dw
D1,1
sub
PipelinedDatapaths
Silla:StringIndependentLocalLevenshtein Automaton
ProblemstatementSomenodesrequirelongwiresformcomparators/othernodes
Comparators
33
D1,1
Local
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub
i,d1
i+1,d1
Di,d
del
insmatch
i,d+11
i+1,d+11
sub i,dwA G T A A G C C A T T AReference
A G T A T G C C A T T AQueryCyclec+1
D0,0(t=c) = D1,1(t=c+1) = D2,2(t=c+2)= ...
sub
Di+1,d+1
D1,1 D0,0
PipelinedDatapaths
Silla:StringIndependentLocalLevenshtein Automaton 34
Comparators
Local
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub
i,d1
i+1,d1
Di,d (t=c)
del
insmatch
i,d+11
i+1,d+11
sub i,dwA G T A A G C C A T T AReference
A G T A T G C C A T T AQueryCyclec+1
sub
D0,0(t=c) = D1,1(t=c+1) = D2,2(t=c+2)= ...
D1,1 D0,0
PipelinedDatapaths
Silla:StringIndependentLocalLevenshtein Automaton 35
×
Comparators
Local
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub
i,d1
i+1,d1
del
insmatch
i,d+11
Di+1,d+1 (t=c+1)
i+1,d+11
Di,d (t=c+1)
sub i,dwA G T A A G C C A T T AReference
A G T A T G C C A T T AQueryCyclec+1
sub
D0,0(t=c) = D1,1(t=c+1) = D2,2(t=c+2)= ...
D1,1 D0,0
PipelinedDatapaths
Silla:StringIndependentLocalLevenshtein Automaton 36
×
Comparators
Local
A G T A A G C C A T T AReference
A G T A T G C C A T T AQuery
Cyclecd
i
D0,0
i,d0
i+1,d0
Di,d
del
insmatch
i,d+10
i+1,d+10
sub Di,d (t=c+2)
i,d1
i+1,d1
del
insmatch
Di+1,d+1 (t=c+2)
i,d+11
i+1,d+11
sub i,dwA G T A A G C C A T T AReference
A G T A T G C C A T T AQueryCyclec+1
sub
D0,0(t=c) = D1,1(t=c+1) = D2,2(t=c+2)= ...
D1,1 D0,0
PipelinedDatapaths
Silla:StringIndependentLocalLevenshtein Automaton 37
×
Comparators
×
D1,1
HardwareImplementation
GenAx:AGenomeSequencingAccelerator
1,00,0
del
insmatch
2,0
1,10,1
0,2
Query
Reference
SMEM+HashbasedSeedingAccelerator
IndexTable
PositionTable
512entryCAM
512entryCAM
512entryCAM
512entryCAM
512entryCAM
Segmenting
In-placeTraceback
AffinegapScoring
Composability
SillaX:SeedextensionacceleratorSeedingmachine
Silla
38
HardwareImplementation
In-placeTraceback
TracebackRecap
39
HardwareImplementation
In-placeTraceback
NoExternalTracebackMemoryTraceback InformationisstoredinnodesGreedyapproach
StringMatchingPhaseBestscore ispropagatedTraceback PointeriskeptinnodesTraceback Pointerisupdatedwithbetterscore
Traceback PhasePointerTrailingfromthenodewithbestscore
Traceback Machine
40
HardwareImplementation
In-placeTraceback
NoExternalTracebackMemoryTraceback InformationisstoredinnodesGreedyapproach
StringMatchingPhaseBestscore ispropagatedTraceback PointeriskeptinnodesTraceback Pointerisupdatedwithbetterscore
Traceback PhasePointerTrailingfromthenodewithbestscore
Traceback Machine
k=3
k=2
k=1
41
HardwareImplementation
Traceback Machine
In-placeTraceback
NoExternalTracebackMemoryTraceback InformationisstoredinnodesGreedyapproach
StringMatchingPhaseBestscore ispropagatedTraceback PointeriskeptinnodesTraceback Pointerisupdatedwithbetterscore
Traceback PhasePointerTrailingfromthenodewithbestscore
42
HardwareImplementation
BrokenPointerTrail
In-placeTraceback
BrokenPointerTrailPreviousbestscoreisupdatedandbreaksthepath
Rerun themachinetillthecycle7.6% ofreadsrequirererun
43
HardwareImplementation
BrokenPointerTrail
In-placeTraceback
×
BrokenPointerTrailPreviousbestscoreisupdatedandbreaksthepath
Rerun themachinetillthecycle7.6% ofreadsrequirererun
44
HardwareImplementation
BrokenPointerTrail
In-placeTraceback
Re-run
BrokenPointerTrailPreviousbestscoreisupdatedandbreaksthepath
Rerun themachinetillthecycle7.6% ofreadsrequirererun
45
HardwareImplementation
BrokenPointerTrail
In-placeTraceback
BrokenPointerTrailPreviousbestscoreisupdatedandbreaksthepath
Rerun themachinetillthecycle7.6% ofreadsrequirererun
46
HardwareImplementation
GenAx:AGenomeSequencingAccelerator
1,00,0
del
insmatch
2,0
1,10,1
0,2
Query
Reference
SMEM+HashbasedSeedingAccelerator
IndexTable
PositionTable
512entryCAM
512entryCAM
512entryCAM
512entryCAM
512entryCAM
Segmenting
In-placeTraceback
AffinegapScoring
Composability
SillaXSeedingmachine
Silla
47
HardwareImplementation
ComposabilityAffinegapScoring
ScoreMatch +1
Mismatch -4GapOpening -7GapExtension -1
M M M I I M I I I M MTrace1
M M M M I I I I I M MTrace2
-7-1 -7-1-1
-7-1-1-1-1GOOD
BAD
Editdistance=5
Editdistance=5
Score=-11
Score=-5
48
HardwareImplementation
AffinegapScoring
ScoreMatch +1
Mismatch -4GapOpening -7GapExtension -1
M M M I I M I I I M MTrace1
M M M M I I I I I M MTrace2
-7-1 -7-1-1
-7-1-1-1-1
Editdistance=5
Editdistance=5
Score=-11
Score=-5
CurrentScore
insScore
delScore
max(cur– go,ins- ge)
max(cur– go,del- ge)
cur+(matchormismatch)
cur+(matchormismatch)
MAX
Gapopening
Gapextension
GOOD
BAD
49
Node(i,d)
(i-1,d-1)(i-1,d)(i,d-1)
HardwareImplementation
Composabletolargeeditdistances
Composability
R
Q
k
Q
Q
R
Q
R
R
k
k
k
Q
R2k
AffinegapScoring
ScoreMatch +1
Mismatch -4GapOpening -7GapExtension -1
M M M I I M I I I M MTrace1
M M M M I I I I I M MTrace2
-7-1 -7-1-1
-7-1-1-1-1GOOD
BAD
Editdistance=5
Editdistance=5
Score=-11
Score=-5
50
HardwareImplementation
GenAx:AGenomeSequencingAccelerator
1,00,0
del
insmatch
2,0
1,10,1
0,2
Query
Reference
SMEM+HashbasedSeedingAccelerator
IndexTable
PositionTable
512entryCAM
512entryCAM
512entryCAM
512entryCAM
512entryCAM
Segmenting
In-placeTraceback
AffinegapScoring
Composability
SillaXSeedingmachine
51
Identifyingsuper-maximalexactmatchesforaread
10 100 104 390 394
AGTAATGCCATG
Reference
0
Read
IndexTable
AGTA10100 390
PositionTableH1 = { 10, 100, 390 } 10
390100
k-mer-1 hits
CAM
52
Identifyingsuper-maximalexactmatchesforaread
0 10 30 100 104 310 390 394
Reference
Read0 4
IndexTable
ATGC30104 394
PositionTable
AGTAATGCCATG
10
390100
390, 100, 26
k-mer-2 hits
k-mer-1 hits
CAMxH1 = { 10, 100, 390 }
H2 = { 26, 100, 390 }∩
53
Identifyingsuper-maximalexactmatchesforaread
0 10 30 34 100 104 108 310 390 394
Reference
Read0 4 8
IndexTable
CATG34108
PositionTable
AGTAATGCCATG
H1 = { 10, 100, 390 }
H3 = { 26, 100 }
H2 = { 26, 100, 390 }∩∩
10
390100
k-mer-3 hits
k-mer-1 hits
CAMx
x100, 26
SMEM
SMEM = { 100 }54
Seedingimplementation:Keyideas
1 Binary search-based intersection for frequent k-mers
Intersectingmhitsofk-mer-1withnhitsofk-mer-2
104 394
Position list for each k-mer is sorted
10
504 750 }
Position Table
950,504,394,104 n hits (n > 500)
m hits
O ( m log n ) steps
55
Seedingimplementation:Keyideas
Read
AAAAAAAAAAAAAAGTAATGCCATGATGCCGTATGAATGCAAGT
1002511000000 # Hits
2 Probing: Intersect from k-mer with minimum number of hits
∩ ∩1 2
x
Read
AAAAAAAAAAAAAAGTAATGCCATGATGCCGTATGAATGCAAGT
1002511000000 # Hits∩ ∩12 56
Methodology- Input
ReferenceGenome:GRCh38(HumanGenome)fromUCSCgenomebrowserInputReads:IlluminaPlatinumGenomes(50x,787Mreads,101bp)
- Baseline
CPU:BWA-MEMonIntelXeonE5-2697(2.6GHz,56threads)+128GBDDR4GPU:CUSHAW2-GPUonNVIDIATitanXp (1.58GHz,3840cores)+12GBGDDR5X
- SillaX configuration
Synthesis:SynopsisDesignCompiler,28nmprocess,2GHzà 5.64mm2,6.6WBandwidth(K):40
- Seedingmachineconfiguration
K-mer size:12Segmenting:512segments,48MBindextableand18MBpositiontable
57
Performance(Throughput/Power)
1
10
100
1000
10000
100000
Illumina100
Throughput(Khits/s) SillaX CPU GPU
020406080100120140
BWA-MEM CUSHAW2GPU
GenAx
Avg.Power(W
)
128K 56K
4,058K
5001,0001,5002,0002,5003,0003,5004,0004,500
BWA-MEM CUSHAW2GPU
GenAxThroughput(Kreads/sec)
GenAx SillaX
BaselineCPU:SeqAn LibraryGPU:SW#
58
31.7x
12x
63xoverCPU5000xoverGPU
ConclusionContributions
Silla – anovelautomatonalgorithmforapproximatestringmatchingo O(k2)complexityo Naturallymapstosystolicarray/automatonaccelerator
SillaX – aseedextensionacceleratoro Affinegap+Traceback +Composable
GenAx – areadalignmentacceleratoro Drop-inreplacementofBWA-MEM
Results31.7x speedupoverBWA-MEMon56-threadXeonprocessor12x powerreduction5.6x areareduction
59
UniversityofMichiganPrecisionHealthInitiative
60
“Discoverthegenetic,lifestyleandenvironmentalfactorsthatinfluenceapopulation’shealthandprovidespersonalizedsolutionsthatallowindividualstoimprovetheirhealthandwellness.”
GenAx:AGenomeSequencingAcceleratorDaichi Fujiki Arun Subramaniyan Tianjun Zhang Yu Zeng
Reetuparna Das David Blaauw Satish Narayanasamy
Thank you.Any questions?