binary search doing it less wrong

17
Binary search doing it less wrong Paul Khuong October 30, 2014 Adserver Engineer @ AppNexus

Upload: vuongkiet

Post on 14-Feb-2017

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Binary search doing it less wrong

Binary searchdoing it less wrong

Paul Khuong

October 30, 2014

Adserver Engineer @ AppNexus

Page 2: Binary search doing it less wrong

“Binary search is slow”

I “Linear search is faster for small n” (branches)

I “Fancy layouts scale better” (caches)

Page 3: Binary search doing it less wrong

No branch, no misprediction

Data dependent conditional move.

http://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/

Page 4: Binary search doing it less wrong

Don’t. . . look for early matches

if (*mid == needle) {

return mid;

}

Page 5: Binary search doing it less wrong

Don’t. . . try to adjust bounds tightly

if (needle < *mid) {

len = half;

} else {

low = mid + 1;

len -= half + 1;

}

G++ STL

Page 6: Binary search doing it less wrong

Don’t. . . do both

if (comparison < 0) {

high = mid;

} else if (comparison > 0) {

low = mid + 1;

} else {

return mid;

}

glibc, FreeBSD libc

Page 7: Binary search doing it less wrong

Simple binary search

midpoint (n = 5)

|

v

---------------------

| 0 | 1 | 2 | 3 | 4 |

---------------------

|___________|

n’ = 3

while ((half = n / 2) > 0) {

mid = low + half;

low = (*mid < needle) ? mid : low;

n -= half;

}

So simple, it’s AVX2-able!

Page 8: Binary search doing it less wrong

Assume a decent compiler

loop:

lea (%rdx,%rcx,4), %rdi

cmp (%rdi), %esi

cmovge %rdi, %rdx

sub %rcx, %rax

mov %rax, %rcx

shr %rcx

jnz loop

shr 0

lea 1 sub 1

cmp/load 1 shr 1

cmov 0

cmov 1

lea 2

sub 2

shr 2

Page 9: Binary search doing it less wrong

Microbenchmark

Implementations:

I branch (STL)

I both (libc)

I early (only early termination)

I simple (cmov)

Input: 32 bit ints (random, ≈ 5% density)

I 8, 16, 32, . . . , 1024

I 10, 50, 100, 200, . . . , 1000

Report average of 128 lookups (median, 1st/99th percentile)

Page 10: Binary search doing it less wrong

first last bimodal random intersection

0

20

40

60

80

10 100 100010 100 100010 100 100010 100 100010 100 1000size (n * 32 bit ints)

cycl

e/lo

okup implementation

branchbothearlysimple

Page 11: Binary search doing it less wrong

Caching: n = 2k(−1)⇒ aliasing issues

Midpoints:

0x200000

0x100000

0x080000

0x040000

0x020000

0x010000

....

I Run proper microbenchmarks

I Offset “mid” point ((n / 2) + (n / 64))

I 3-way (“ternary”) search

In the wild: Bentley & Saxe dynamisation or 2k ≤ n < 2k+1.

http://pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/

Page 12: Binary search doing it less wrong

Practical use case: sparse bitmatrix mult + projection

Sparse Bit Matrix Bit Set

(Sparse Bit Vector)

X

Page 13: Binary search doing it less wrong

a.k.a. (pre)sorted equijoin

Inner loop:

I branch-free (simple)

I branches (STL)

I unrolled (3ary branch-free)

Reuse results

I roving lower bound

I galloping search

Page 14: Binary search doing it less wrong

Gather phase around peak time

Page 15: Binary search doing it less wrong

Sorted array search: a decent finger search

Let ∆ = ki − ki−1

Galloping search: ≈ 2 lg ∆ comparisons3ary search: ≈ 2 log3 ∆ D$ missesRoving search: ≈ lg n D$ misses

Page 16: Binary search doing it less wrong

And some spooky action at a distance

Page 17: Binary search doing it less wrong

Sorted arrays work well on contemporary µarch

Don’t be (too) clever:

I Careful with branches

I Avoid cache aliasing/bad benchmarks

I Reuse bounds when repeating searches

Cleverness is useful, but getting simple right goes a long way.