bit-parallel approximate pattern matching on the xeon phi ...€¦ · xeon phi architecture •...
TRANSCRIPT
Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor
Tuan Tu Tran, Simon Schindel, Yongchao Liu, Bertil Schmidt
Institut für Informatik
Johannes Gutenberg – University of Mainz
Germany
/ 27
Outline
• Introduction
• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers
– Data-Parallelism on the many-core coprocessor
• Performance Evaluation
• Conclusions and Perspectives
2SBAC-PAD´2014
/ 27
Outline
• Introduction
• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers
– Data-Parallelism on the many-core coprocessor
• Performance Evaluation
• Conclusions and Perspectives
3SBAC-PAD´2014
/ 27
Approximate Pattern Matching
“Given a pattern P of length m, a text T of length n over analphabet Σ and a constant k, find all sub-strings of T whose editdistances with P are at most k”
The Levenshtein edit distance: substitution, deletion and insertion
Example:T = ACTGCAT, P = CTGA, k = 1Matched with:• 1 deletion: ACTGCAT• 1 substitution: ACTGCAT• 1 insertion: ACTGCAT
4SBAC-PAD´2014
/ 27
Bit-parallel Approximate Pattern Matching with the prefix automaton
Bit-parallelism:
• Encodes calculated values into a machine work, seen as a bit array;
• Allows for simultaneous updates of multiple values by a single bit operation;
• Directly simulates all states of the NFA (Wu-Manber algorithm [Wu and Manber, 1992]);
• Limited by the size of machine words.5SBAC-PAD´2014
/ 27
Xeon Phi Architecture
• A coprocessor running Linux
• Connected to a Host CPU via PCIe
• Run in either “native” or “offload” mode
• 1 GHz Clock
• 8 GB DDR5 RAM
• 60 Cores (4 threads per core)6SBAC-PAD´2014
/ 27
Xeon Phi Architecture• Cores interconnected by a high-speed bidirectional ring;
• 512-KB L2-Cache per core
– High-speed access to all other L2 caches;
– Cache coherent across the entire processor;
• Four hardware threads per core
• 512-bit wide vector registers in addition to 64-bit x86
– 16 x 32-bit Integer or Single Precision Floating Point values
– 8 x 64-bit Integer or Double Precision Floating Point values
• Vectorize-and-scale approach to achieve high performance
7SBAC-PAD´2014
Intel Xeon Phi architecture (image courtesy of Intel Corporation)
/ 27
Motivations and Related work
Motivations:• Usage of Xeon Phi vector registers →Matching with patterns
longer than a machine word;• Parallelization on the massive number of cores → Approximate
pattern matching on large texts. Related work• Usage of CPU vector registers for bit-parallel matching algorithms:
[Külekci, 2009], [Faro and Külekci, 2012], [Fredriksson, 2003]• Implementation of the Wu-Manber algorithm on GPU: [Li et al.,
2011], [Tran et al., 2012]• Implementation of the Myer bit-parallel pattern matching
algorithm on GPU: [Chacón et al., 2014]
8SBAC-PAD´2014
/ 27
Outline
• Introduction
• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers
– Data-Parallelism on the many-core coprocessor
• Performance Evaluation
• Conclusions and Perspectives
9SBAC-PAD´2014
/ 27
Notations
• A text T of length n• A pattern P of length m• An alphabet Σ• A maximal edit distance k• A pattern bitmask B:
– |Σ| rows– B[a][i + 1] = 1 if and only if pi = a (a Σ)
• An bit array R:– k + 1 rows– Representation of the matching NFA– Once Ri,j(0 i k;1 < j m)is active the prefixp1p2 … pj is recognized with i errors
10SBAC-PAD´2014
/ 27
The Wu-Manber algorithm
11SBAC-PAD´2014
Initialization)0(10 11]0[ kjR jm
j
)1( | 1) ( | | ])[&)1((
][&)100 | )1((][1
]1[1
]1[1
]1[][
1-m]1[0
][0
ij
ij
iji
ij
ij
iii
RRRtBRRtBRR
For each
) ( 1 10&?
ipositionatmatchaforcheckR mik
)1( niTt i
(match) (insertion) (substitution) (deletion)
Computational complexity: Ο(n.w)
/ 27
Example
12SBAC-PAD´2014
ACTGCATCTGA (deletion)
ACTGCATCTGCA (insertion)
ACTGCATCTGA (substitution)
/ 27
Extended version of the Wu-Manber algorithm
• Vectorizations of the bit-wise operations: – AND – OR – SHIFT_LEFT
• Efficient check for a match
13SBAC-PAD´2014
Initialization)0(10 11]0[ kjR jm
j
)1( | 1) ( | | ])[&)1((
][&)100 | )1((][1
]1[1
]1[1
]1[][
1-m]1[0
][0
ij
ij
iji
ij
ij
iii
RRRtBRRtBRR
For each
) ( 1 10&?
ipositionatmatchaforcheckR mik
)1( niTt i
(match) (insertion) (substitution) (deletion)
/ 27
Vectorization with 512-bit wide registers
14SBAC-PAD´2014
• Use of union: flexible change between intrinsic data format (__m512i) and an array of 16 integers
• Intrinsic bit-wise functions: the elements within the vector are processed independently • Bit-wise AND: __mm512_and_epi32• Bit-wise OR: __mm512_or_epi32• Bit-wise SHIFT_LEFT:
• The left most bit of vi+1 becomes the right most bit of vi
• Combination of 4 intrinsic functions:
#define REG_NUM 16union m512{__m512i m512;unsigned int v[REG_NUM] __attribute__((aligned(64)));
};
v0 v1 … v14 v15
v1 v2 … v15
v1 v2 … v15
31 31 31 31
v0 v2 … v14 v15
1 1 1 1OR
A
B__mm512_alignr_epi32
__mm512_srli_epi32
__mm512_slli_epi32
__mm512_mask_or_epi32
A <<= 1
/ 27
Auto-vectorization• Uses an array of 16 uints to simulate a 512 bit machine words;
• Uses for – loop with directives:– simd assert
– vector aligned
– ivdep
15SBAC-PAD´2014
…/* save the right most bit */#pragma ivdep#pragma simd assertfor(i=1;i<REG_NUM;++i) B[i-1] = (A[i]>>31);/* shift left A by 1 position */#pragma vector aligned#pragma simd assertfor(i=0;i<REG_NUM;++i) A[i] <<= 1;…
/ 27
Data parallelism on the many-core coprocessor
• Given: a pattern P, a collection of text {Ti}
• The matching search of P against any Ti and Tj
(i j) can be performed in parallel
16SBAC-PAD´2014
• Three multi-threaded versions by OpenMP
• wmIntr: Xeon Phi, intrinsic data and functions
• wmAutoVec: Xeon Phi, array of 16 uints, automatical vectorization
• wmHost: multicore CPU, array of 16 uints
/ 27
Outline
• Introduction
• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers
– Data-Parallelism on the many-core coprocessor
• Performance Evaluation
• Conclusions and Perspectives
17SBAC-PAD´2014
/ 27
Testing Environment and Data
• Intel Xeon Phi 5100P– 60 cores x 1.053 GHz– 8 GB RAM
• Host:– Intel Xeon E5-2670: 8 cores x 2.6 GHz– 64 GB RAM
• Compiler: Intel icc with –O3 option• Data:
– Human chromosome 21 (chr21), – Texts: 32x or 128x of chr21 (1.1 GB or 4.3 GB) – Pattern of length 511, extracted from chr21
• Serial time to evaluate speedups: wmHost with one thread
18SBAC-PAD´2014
/ 27
Scalability with the number of cores
• Scale well with the number of cores of the Xeon Phi
• wmIntr is superior to wmAutoVec
19SBAC-PAD´2014
wmIntr wmAutoVec(Numbers of threads are the multiples of 59)
/ 27
Scalability with the Levenstein distance
The advantage of the use of the intrinsic SIMD data and function of the Xeon Phi
20SBAC-PAD´2014
/ 27
Scalability of wmIntrLinear increase with:• The Levenshtein distance• The size of input texts
21SBAC-PAD´2014
/ 27
Comparisions to related work
• Our work: approximate matching with the pattern longer than the size of common machine words (32 or 64), using the Wu-Manber algorithm.
• [Külekci, 2009], [Faro and Külekci, 2012]: exact matching.• [Fredriksson, 2003], [Chacón et al., 2014]: the Myers algorithm
– Independent of the maximal edit distance (k);– Not easy to be extended to perform matching with wild cards and regular
expressions.
• [Li et al., 2011], [Tran et al., 2012]: focus on pattern length smaller than that of common machine words.
Not identical to compare our performance with the mentioned related works
22SBAC-PAD´2014
/ 27
Outline
• Introduction
• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers
– Data-Parallelism on the many-core coprocessor
• Performance Evaluation
• Conclusions and Perspectives
23SBAC-PAD´2014
/ 27
Conclusions and Perspectives
Conclusions• Simulation of long machine words on the Intel Xeon Phi
architecture;• Extended implementation of the Wu-Manber
algorithm;• Multi-threads versions of bit-parallel approximate
pattern matching: – Long pattern– High Levenshtein distance– Large target texts
• The source code can be downloaded at: http://xbitpar.sourceforge.net/
24SBAC-PAD´2014
/ 27
Conclusions and Perspectives (cont.)
Perspectives• Matching with wildcard and regular expression• Mapping onto CUDA-enable GPUs (SIMD feature of a
“warp”)• Preprocessing step in bioinformatics sequencing
applications– Fast filtering– Seeding
• Other bit-parallel matching algorithms, such as the Myer algorithm.
• Other bit-parallel applications, such as finding the longest common subsequence (LCS).
25SBAC-PAD´2014
/ 27
Thank you for your attention!
26SBAC-PAD´2014
/ 27
References[Wu and Manber, 1992] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM, vol. 35, no. 10, pp. 83–91, 1992.
[Li et al., 2011] H. Li, B. Ni, M. H. Wong, and K.-S. Leung, “A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching,” in SASP, 2011, pp. 74–77.
[Külekci, 2009] M. O. Külekci, “Filter Based Fast Matching of Long Patterns by Using SIMD Instructions,” in Stringology, 2009, pp. 118–128.
[Faro and Külekci, 2012] S. Faro and M. O. Külekci, “Fast Multiple String Matching Using Streaming SIMD Extensions Technology,” in Proceedings of the 19th International Conference on String Processing and Information Retrieval, ser. SPIRE’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 217–228
[Fredriksson, 2003] K. Fredriksson, “Row-wise Tiling for the Myers’ Bit-Parallel Approximate String Matching Algorithm,” in String Processing and Information Retrieval, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003, vol. 2857, pp. 66–79.
[Tran et al., 2011] T. T. Tran, M. Giraud, and J.-S. Varré, “Bit-Parallel Multiple Pattern Matching,” in Parallel Processing and Applied Mathematics, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, vol. 7204, pp. 292–301.
[Chacón et al., 2014] A. Chacón, S. Marco-Sola, A. Espinosa, P. Ribeca, and J. C. Moure, “Thread-cooperative, Bit-parallel Computation of Levenshtein Distance on GPU,” in Proceedings of the 28th ACM International Conference on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014, pp. 103–112.
27SBAC-PAD´2014