bit-parallel approximate pattern matching on the xeon phi ...€¦ · xeon phi architecture •...

Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor

Tuan Tu Tran, Simon Schindel, Yongchao Liu, Bertil Schmidt

Institut für Informatik

Johannes Gutenberg – University of Mainz

Germany

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

2SBAC-PAD´2014

/ 27

Outline

• Introduction





3SBAC-PAD´2014

/ 27

Approximate Pattern Matching

“Given a pattern P of length m, a text T of length n over analphabet Σ and a constant k, find all sub-strings of T whose editdistances with P are at most k”

The Levenshtein edit distance: substitution, deletion and insertion

Example:T = ACTGCAT, P = CTGA, k = 1Matched with:• 1 deletion: ACTGCAT• 1 substitution: ACTGCAT• 1 insertion: ACTGCAT

4SBAC-PAD´2014

/ 27

Bit-parallel Approximate Pattern Matching with the prefix automaton

Bit-parallelism:

• Encodes calculated values into a machine work, seen as a bit array;

• Allows for simultaneous updates of multiple values by a single bit operation;

• Directly simulates all states of the NFA (Wu-Manber algorithm [Wu and Manber, 1992]);

• Limited by the size of machine words.5SBAC-PAD´2014

/ 27

Xeon Phi Architecture

• A coprocessor running Linux

• Connected to a Host CPU via PCIe

• Run in either “native” or “offload” mode

• 1 GHz Clock

• 8 GB DDR5 RAM

• 60 Cores (4 threads per core)6SBAC-PAD´2014

/ 27

Xeon Phi Architecture• Cores interconnected by a high-speed bidirectional ring;

• 512-KB L2-Cache per core

– High-speed access to all other L2 caches;

– Cache coherent across the entire processor;

• Four hardware threads per core

• 512-bit wide vector registers in addition to 64-bit x86

– 16 x 32-bit Integer or Single Precision Floating Point values

– 8 x 64-bit Integer or Double Precision Floating Point values

• Vectorize-and-scale approach to achieve high performance

7SBAC-PAD´2014

Intel Xeon Phi architecture (image courtesy of Intel Corporation)

/ 27

Motivations and Related work

Motivations:• Usage of Xeon Phi vector registers →Matching with patterns

longer than a machine word;• Parallelization on the massive number of cores → Approximate

pattern matching on large texts. Related work• Usage of CPU vector registers for bit-parallel matching algorithms:

[Külekci, 2009], [Faro and Külekci, 2012], [Fredriksson, 2003]• Implementation of the Wu-Manber algorithm on GPU: [Li et al.,

2011], [Tran et al., 2012]• Implementation of the Myer bit-parallel pattern matching

algorithm on GPU: [Chacón et al., 2014]

8SBAC-PAD´2014

/ 27

Outline

• Introduction





9SBAC-PAD´2014

/ 27

Notations

• A text T of length n• A pattern P of length m• An alphabet Σ• A maximal edit distance k• A pattern bitmask B:

– |Σ| rows– B[a][i + 1] = 1 if and only if pi = a (a Σ)

• An bit array R:– k + 1 rows– Representation of the matching NFA– Once Ri,j(0 i k;1 < j m)is active the prefixp1p2 … pj is recognized with i errors

10SBAC-PAD´2014

/ 27

The Wu-Manber algorithm

11SBAC-PAD´2014

Initialization)0(10 11]0[ kjR jm

j

)1( | 1) ( | | ])[&)1((

][&)100 | )1((][1

]1[1

]1[1

]1[][

1-m]1[0

][0

ij

ij

iji

ij

ij

iii

RRRtBRRtBRR

For each

) ( 1 10&?

ipositionatmatchaforcheckR mik

)1( niTt i

(match) (insertion) (substitution) (deletion)

Computational complexity: Ο(n.w)

/ 27

Example

12SBAC-PAD´2014

ACTGCATCTGA (deletion)

ACTGCATCTGCA (insertion)

ACTGCATCTGA (substitution)

/ 27

Extended version of the Wu-Manber algorithm

• Vectorizations of the bit-wise operations: – AND – OR – SHIFT_LEFT

• Efficient check for a match

13SBAC-PAD´2014

Initialization)0(10 11]0[ kjR jm

j

)1( | 1) ( | | ])[&)1((

][&)100 | )1((][1

]1[1

]1[1

]1[][

1-m]1[0

][0

ij

ij

iji

ij

ij

iii

RRRtBRRtBRR

For each

) ( 1 10&?

ipositionatmatchaforcheckR mik

)1( niTt i

(match) (insertion) (substitution) (deletion)

/ 27

Vectorization with 512-bit wide registers

14SBAC-PAD´2014

• Use of union: flexible change between intrinsic data format (__m512i) and an array of 16 integers

• Intrinsic bit-wise functions: the elements within the vector are processed independently • Bit-wise AND: __mm512_and_epi32• Bit-wise OR: __mm512_or_epi32• Bit-wise SHIFT_LEFT:

• The left most bit of vi+1 becomes the right most bit of vi

• Combination of 4 intrinsic functions:

#define REG_NUM 16union m512{__m512i m512;unsigned int v[REG_NUM] __attribute__((aligned(64)));

};

v0 v1 … v14 v15

v1 v2 … v15

v1 v2 … v15

31 31 31 31

v0 v2 … v14 v15

1 1 1 1OR

A

B__mm512_alignr_epi32

__mm512_srli_epi32

__mm512_slli_epi32

__mm512_mask_or_epi32

A <<= 1

/ 27

Auto-vectorization• Uses an array of 16 uints to simulate a 512 bit machine words;

• Uses for – loop with directives:– simd assert

– vector aligned

– ivdep

15SBAC-PAD´2014

…/* save the right most bit */#pragma ivdep#pragma simd assertfor(i=1;i<REG_NUM;++i) B[i-1] = (A[i]>>31);/* shift left A by 1 position */#pragma vector aligned#pragma simd assertfor(i=0;i<REG_NUM;++i) A[i] <<= 1;…

/ 27

Data parallelism on the many-core coprocessor

• Given: a pattern P, a collection of text {Ti}

• The matching search of P against any Ti and Tj

(i j) can be performed in parallel

16SBAC-PAD´2014

• Three multi-threaded versions by OpenMP

• wmIntr: Xeon Phi, intrinsic data and functions

• wmAutoVec: Xeon Phi, array of 16 uints, automatical vectorization

• wmHost: multicore CPU, array of 16 uints

/ 27

Outline

• Introduction





17SBAC-PAD´2014

/ 27

Testing Environment and Data

• Intel Xeon Phi 5100P– 60 cores x 1.053 GHz– 8 GB RAM

• Host:– Intel Xeon E5-2670: 8 cores x 2.6 GHz– 64 GB RAM

• Compiler: Intel icc with –O3 option• Data:

– Human chromosome 21 (chr21), – Texts: 32x or 128x of chr21 (1.1 GB or 4.3 GB) – Pattern of length 511, extracted from chr21

• Serial time to evaluate speedups: wmHost with one thread

18SBAC-PAD´2014

/ 27

Scalability with the number of cores

• Scale well with the number of cores of the Xeon Phi

• wmIntr is superior to wmAutoVec

19SBAC-PAD´2014

wmIntr wmAutoVec(Numbers of threads are the multiples of 59)

/ 27

Scalability with the Levenstein distance

The advantage of the use of the intrinsic SIMD data and function of the Xeon Phi

20SBAC-PAD´2014

/ 27

Scalability of wmIntrLinear increase with:• The Levenshtein distance• The size of input texts

21SBAC-PAD´2014

/ 27

Comparisions to related work

• Our work: approximate matching with the pattern longer than the size of common machine words (32 or 64), using the Wu-Manber algorithm.

• [Külekci, 2009], [Faro and Külekci, 2012]: exact matching.• [Fredriksson, 2003], [Chacón et al., 2014]: the Myers algorithm

– Independent of the maximal edit distance (k);– Not easy to be extended to perform matching with wild cards and regular

expressions.

• [Li et al., 2011], [Tran et al., 2012]: focus on pattern length smaller than that of common machine words.

Not identical to compare our performance with the mentioned related works

22SBAC-PAD´2014

/ 27

Outline

• Introduction





23SBAC-PAD´2014

/ 27

Conclusions and Perspectives

Conclusions• Simulation of long machine words on the Intel Xeon Phi

architecture;• Extended implementation of the Wu-Manber

algorithm;• Multi-threads versions of bit-parallel approximate

pattern matching: – Long pattern– High Levenshtein distance– Large target texts

• The source code can be downloaded at: http://xbitpar.sourceforge.net/

24SBAC-PAD´2014

/ 27

Conclusions and Perspectives (cont.)

Perspectives• Matching with wildcard and regular expression• Mapping onto CUDA-enable GPUs (SIMD feature of a

“warp”)• Preprocessing step in bioinformatics sequencing

applications– Fast filtering– Seeding

• Other bit-parallel matching algorithms, such as the Myer algorithm.

• Other bit-parallel applications, such as finding the longest common subsequence (LCS).

25SBAC-PAD´2014

/ 27

Thank you for your attention!

26SBAC-PAD´2014

/ 27

References[Wu and Manber, 1992] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM, vol. 35, no. 10, pp. 83–91, 1992.

[Li et al., 2011] H. Li, B. Ni, M. H. Wong, and K.-S. Leung, “A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching,” in SASP, 2011, pp. 74–77.

[Külekci, 2009] M. O. Külekci, “Filter Based Fast Matching of Long Patterns by Using SIMD Instructions,” in Stringology, 2009, pp. 118–128.

[Faro and Külekci, 2012] S. Faro and M. O. Külekci, “Fast Multiple String Matching Using Streaming SIMD Extensions Technology,” in Proceedings of the 19th International Conference on String Processing and Information Retrieval, ser. SPIRE’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 217–228

[Fredriksson, 2003] K. Fredriksson, “Row-wise Tiling for the Myers’ Bit-Parallel Approximate String Matching Algorithm,” in String Processing and Information Retrieval, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003, vol. 2857, pp. 66–79.

[Tran et al., 2011] T. T. Tran, M. Giraud, and J.-S. Varré, “Bit-Parallel Multiple Pattern Matching,” in Parallel Processing and Applied Mathematics, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, vol. 7204, pp. 292–301.

[Chacón et al., 2014] A. Chacón, S. Marco-Sola, A. Espinosa, P. Ribeca, and J. C. Moure, “Thread-cooperative, Bit-parallel Computation of Levenshtein Distance on GPU,” in Proceedings of the 28th ACM International Conference on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014, pp. 103–112.

27SBAC-PAD´2014

bit-parallel approximate pattern matching on the xeon phi ...€¦ · xeon phi architecture •...

Documents