x86opti 05 s5yata
DESCRIPTION
Remove Branches in BitVector Select Operations - marisa 0.2.2 -TRANSCRIPT
Remove Branches inBitVector Select Operations
- marisa 0.2.2 -
Susumu Yata@s5yata
Brazil, Inc.
30 March 20131
Brazil, Inc.
Who I AmJob
Brazil, Inc. (groonga developer)We need R&D software engineers.
Personal research & developmentTries
darts-clone, marisa-trie, etc.Corpus
Nihongo Web Corpus 2010 (NWC 2010)
30 March 20132
Brazil, Inc.
BitVector and MarisaRelationships between BitVector and Marisa.
30 March 20133
Brazil, Inc.
BitVectorWhat’s BitVector?
A sequence of bits
OperationsBitVector::get(i)BitVector::rank(i)BitVector::select(i)
30 March 20134
Brazil, Inc.
BitVector – Get OperationsInterface
BitVector::get(i)
DescriptionThe i-th bit (“0” or “1”)
30 March 20135
Brazil, Inc.
0 1 2 … i–1 i i+1 … n-2 n-1
0 0 1 … 0 1 1 … 0 0
Get!
BitVector – Rank OperationsInterface
BitVector::rank(i)
DescriptionThe number of “1”s up to the i-th bit
30 March 20136
Brazil, Inc.
0 1 2 … i–1 i i+1 … n-2 n-1
0 0 1 … 0 1 1 … 0 0
How many “1”s?
BitVector – Select Operations
InterfaceBitVector::select(i)
DescriptionThe position of the i-th “1”
30 March 20137
Brazil, Inc.
0 1 2 … … … … … n-2 n-1
0 0 1 … … … … … 0 0
Where is the i-th “1”?
MarisaWho’s Marisa?
An ordinary human magician
What’s Marisa?A static and space-efficient dictionary
Data structureRecursive LOUDS-based Patricia tries
Sitehttp://code.google.com/p/marisa-trie
30 March 20138
Brazil, Inc.
Marisa – PatriciaPatricia is a labeled tree.
Keys = Tree + Labels
Node Label
1 “Ar”
2 “Brazil”
3 ‘C’
4 “gentina”
5 “menia”
6 “anada”
7 “yprus”
30 March 20139
Brazil, Inc.
ID Key
0 “Argentina”
1 “Armenia”
2 “Brazil”
3 “Canada”
4 “Cyprus”
20
3
4
6
7
5
4
6
7
51
Marisa – RecursivenessUnfortunately, this margin is too
small…Keys = Tree + LabelsLabels = Tree + LabelsLabels = Tree + Labels <– ReasonableLabels = Tree + LabelsLabels = Tree + LabelsLabels = Tree + LabelsLabels = Tree + Labels…
30 March 2013 Brazil, Inc.10
Marisa – BitVector UsageLOUDS
Level-Order Unary Degree Sequence
Terminal flagsA node is terminal (“1”) or not (“0”).
Link flagsA node has a link to its multi-byte label
(“1”) or has a built-in single-byte label (“0”).
30 March 2013 Brazil, Inc.11
Marisa – BitVector UsageLOUDS
BitVector::get(), select()
Terminal flagsBitVector::get(), rank(), select()
Link flagsBitVector::get(), rank()
30 March 2013 Brazil, Inc.12
ImplementationsHow to implement Rank/Select operations.
30 March 2013 Brazil, Inc.13
Rank DictionaryIndex structures
r_idx[x].abs = rank(512 ・ x)x = 0, 1, 2, …
r_idx[x].rel[y] =rank(512 ・ x + 64 ・ y) –
rank(512 ・ x)Y = 1, 2, 3, … , 7
Calculationabs + rel + popcnt()
30 March 2013 Brazil, Inc.14
Rank OperationsTime complexity = O(1)
30 March 2013 Brazil, Inc.15
512 512 512 512
r_idx.abs
64 64 64 64 64 64 64 64
512
r_idx.rel
64
popcnt()
Select DictionaryIndex structure
s_idx[x] = select(512 ・ x)i = 0, 1, 2, …
CalculationLimit the range by using s_idx.Limit the range by using r_idx[x].abs.Limit the range by using r_idx[x].rel[y].Find the i-th “1” in the range.
30 March 2013 Brazil, Inc.16
Select Operations
30 March 2013 Brazil, Inc.17
r_idx.abs
64 64 64 64 64 64 64 64
512
r_idx.rel
64
512 512 512512512512
s_idx s_idx
r_idx.abs
Final round
r_idx.rel
Select Final RoundBinary search & table lookup
Three-level branches
30 March 2013 Brazil, Inc.18
8 8 8 8 8 8 8 8
if
if if
if if if if
Table lookup
ImprovementsHow to remove the branches in the final round.
30 March 2013 Brazil, Inc.19
Original// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountif (i < ((x >> 24) & 0xFF)) { // The first-level
branch if (i < ((x >> 8) & 0xFF)) { // The second-level
branch if (i < (x & 0xFF)) { // The third-level branch // The first byte contains the i-th “1”. } else { // The second byte contains the i-th “1”.30 March 2013 Brazil, Inc.
20
Tips – Tricky PopCount
x = x – ((x >> 1) & MASK_55);
x = (x & MASK_33) + ((x >> 2) & MASK_33);
x = (x + (x >> 4)) & MASK_0F;
30 March 2013 Brazil, Inc.21
1 2 0 1
0 1 1 1 0 0 1 0
3 1
4
Tips – Tricky PopCount// MASK_01 = 0x0101010101010101ULL;// x = x | (x << 8) | (x << 16) | (x << 24) | …;x *= MASK_01;
30 March 2013 Brazil, Inc.22
4 1 3 5 2 6 3 4
28
24
23
20
15
13
7
4
+ SSE2 (After PopCount)// y[0 … 7] = i + 1;__m128i y = _mm_cvtsi64_si128((i + 1) * MASK_01);__m128i z = _mm_cvtsi64_si128(x);
// Compare the 16 8-bit signed integers in y and z.// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); // PCMPGTB
// The j-th byte contains the i-th “1”.// TABLE is a 128-byte pre-computed table.uint8_t j = TABLE[_mm_movemask_epi8(y)];
30 March 2013 Brazil, Inc.23
Tips – PCMPGTBy = _mm_cvtsi64_si128((i + 1) * MASK_01);
z = _mm_cvtsi64_si128(x);
// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z);
30 March 2013 Brazil, Inc.24
28 24 23 20 15 13 7 4
20 20 20 20 20 20 20 20
0x00 0x00 0x00 0x00 0xFF 0xFF 0xFF 0xFF
+ Tricks (After Comparison)uint64_t j = _mm_cvtsi128_si64(y);
// Calculation without TABLEj = ((j & MASK_01) * MASK_01) >> 56;
// Calculation with BSRj = (63 – __builtin_clzll(j + 1)) / 8;
// Calculation with popcnt (SSE4.2 or SSE4a)j = __builtin_popcountll(j) / 8;
30 March 2013 Brazil, Inc.25
– SSE2 (Simple and Fast)// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcount
uint64_t y = (i + 1) * MASK_01;uint64_t z = x | MASK_80;// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;uint8_t j = __builtin_ctzll(z) / 8;
30 March 2013 Brazil, Inc.26
Tips – Comparisonuint64_t y = (i + 1) * MASK_01;
uint64_t z = x | MASK_80;
// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;
30 March 2013 Brazil, Inc.27
0x14 0x14 0x14 0x14 0x14 0x14 0x14 0x14
0x9C 0x98 0x97 0x94 0x8F 0x8D 0x87 0x84
0x80 0x80 0x80 0x80 0x00 0x00 0x00 0x00
+ SSSE3 (For PopCount)// Get lower nibbles and upper nibbles of x.__m128i lower = _mm_cvtsi64_si128(x & MASK_0F);__m128i upper = _mm_cvtsi64_si128(x & MASK_F0);upper = _mm_srli_epi32(upper, 4);// Use PSHUFB for counting “1”s in each nibble.__m128i table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1,
0);lower = _mm_shuffle_epi8(table, lower);upper = _mm_shuffle_epi8(table, upper);// Merge the counts to get the number of “1”s in each
byte.x = _mm_cvtsi128_si64(_mm_add_epi8(lower, upper));x *= MASK_01;30 March 2013 Brazil, Inc.
28
Tips – PSHUFBlower = _mm_cvtsi64_si128(x & MASK_0F);
table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, …);
// Perform a parallel 16-way lookup.lower = _mm_shuffle_epi8(table, lower);
30 March 2013 Brazil, Inc.29
12 8 7 4 15 13 7 4
4 3 3 2 3 2 2 1 3 2 2 1 2 1 1 0
2 1 3 1 4 3 3 1
EvaluationHow effective the improvements are.
30 March 2013 Brazil, Inc.30
EnvironmentOS
Mac OSX 10.8.3 (64-bit)CPU
Core i7 3720QM – Ivy Bridge2.6GHz – up to 3.6GHz
CompilerApple LLVM version 4.2 (clang-425.0.24)
(based on LLVM 3.2svn)
30 March 2013 Brazil, Inc.31
DataSource
Japanese Wikipedia page titlesgzip –cd jawiki-20130328-all-titles-in-
ns0.gz | LC_ALL=C sort –R > data
DetailsNumber of keys: 1,367,750Average length: 21.14 bytesTotal length: 28,919,893 bytes
30 March 2013 Brazil, Inc.32
Binariesmarisa 0.2.1
./configure CXX=clang++ --enable-popcnt
maketools/marisa-benchmark < data
marisa 0.2.2./configure CXX=clang++ --enable-sse4maketools/marisa-benchmark < data
30 March 2013 Brazil, Inc.33
Results – marisa 0.2.1Without improvements
Baseline
30 March 2013 Brazil, Inc.34
#Tries Size[KB]
Build[Kqps]
Lookup
[Kqps]
Reverse
[Kqps]
Prefix[Kqps]
Predict
[Kqps]
1 11,811 724 1,105 1,223 1,038 711
2 8,639 632 790 877 753 453
3 8,001 621 750 816 708 406
4 7,788 591 723 791 687 391
5 7,701 590 712 781 680 384
Results – marisa 0.2.2With improvements
Same sizeFaster operations
30 March 2013 Brazil, Inc.35
#Tries Size[KB]
Build[Kqps]
Lookup
[Kqps]
Reverse
[Kqps]
Prefix[Kqps]
Predict
[Kqps]
1 11,811 757 1,198 1,359 1,115 772
2 8,639 657 873 1,000 820 503
3 8,001 621 817 924 770 453
4 7,788 613 797 900 752 438
5 7,701 610 787 884 737 427
Results – ImprovementsImprovement ratios
Same sizeFaster operations
30 March 2013 Brazil, Inc.36
#Tries Size[%]
Build[%]
Lookup
[%]
Reverse
[%]
Prefix[%]
Predict
[%]
1 0.00 +4.56 +8.42 +11.12
+7.42 +8.58
2 0.00 +3.96 +10.52
+14.03
+8.90 +11.04
3 0.00 0.00 +8.93 +13.24
+8.76 +11.58
4 0.00 +3.72 +10.24
+13.78
+9.46 +12.02
5 0.00 +3.39 +10.53
+13.19
+8.38 +11.20
Conclusion
30 March 2013 Brazil, Inc.37
“Any sufficiently advanced technology is indistinguishable
from magic.”
“Any sufficiently advanced technique is indistinguishable from
magic.”
“You are magician.”