molecular shape searching on gpus: a brave new world
DESCRIPTION
Shape is a fundamental three dimensional molecular property and a powerful descriptor for molecular comparison and similarity assessment; similarity in shape has proven to be a very effective method for predicting similarity in biology. As such shape-based virtual screening has become an integral part of computational drug discovery, due to both its speed and efficacy. OpenEye’s recent port of their shape similarity application, ROCS, to the GPU has resulted in a virtual screening tool of unprecedented power – FastROCS. FastROCS’ speed allows it to perform large-scale calculations of a kind inaccessible in the past and has accelerated more routine shape searching to the point that it has become competitive with more traditional, but less effective, two dimensional methods. Go through the slides to learn more. Try GPUs for free here: www.Nvidia.com/GPUTestDriveTRANSCRIPT
FastROCS: What does it mean to be “fast”?
OpenEye Scienti!c Software Brian Cole
March 26, 2013 © 2013 OpenEye Scienti!c Software
FastROCS and the “Chasm”
OpenEye Scientific Software Brian Cole
© 2013 OpenEye Scientific Software March 26, 2013
ROCS: Rapid Overlay of Chemical Structures
March 26, 2013 © 2013 OpenEye Scienti!c Software
LeadHopper
March 26, 2013 © 2013 OpenEye Scienti!c Software
And then you wait…
March 26, 2013 © 2013 OpenEye Scienti!c Software
What is FastROCS?
CPU GPU
Shap
e Overla
ys per Secon
d
© 2013 OpenEye Scienti!c Software
High is
Best
1
10
100
1,000
10,000
100,000
1,000,000
CPU GPU
Shap
e Overla
ys per Secon
d
What is FastROCS?
© 2013 OpenEye Scienti!c Software
High is
Best
© 2013 OpenEye Scien;fic So>ware
0
100,000
200,000
300,000
400,000
500,000
600,000
CPU GPU
Shap
e Overla
ys per Secon
d
What is FastROCS?
High is
Best
1
10
100
1,000
10,000
100,000
1 10 100
Log (Elapsed
5me in se
cond
s)
Log (cores/GPUs)
March 26, 2013 © 2013 OpenEye Scienti!c Software
But I want it now!
ROCS
FastROCS Low is
Best
Riding Moore’s Law
March 26, 2013 © 2013 OpenEye Scienti!c Software
0 200,000 400,000 600,000 800,000
1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000
C1060 C2050 C2075 C2090 K10 K20
Shap
e Overla
ys per Secon
d
High is
Best
ROCS user base
• Every Pharma R&D • Many BioTechs • Many Universities • National Labs and Research Centers • Other software companies
March 26, 2013 © 2013 OpenEye Scienti!c Software
Licenses by Year
March 26, 2013 © 2013 OpenEye Scienti!c Software
2009 2010 2011 2012
ROCS
FastROCS
High is
Best
Licenses by Year (Linear Scale)
March 26, 2013 © 2013 OpenEye Scienti!c Software
2009 2010 2011 2012
ROCS
FastROCS
%15
Pharmageddon
All ROCS users (linear scale)
March 26, 2013 © 2013 OpenEye Scienti!c Software
2009 2010 2011 2012
Academics
ROCS
FastROCS
%3
Technology Adoption Lifecycle
March 26, 2013 © 2013 OpenEye Scienti!c Software
%2.5 %13.5 %34 %34 %16
FastROCS
What’s in the “chasm”?
• “ROCS is already fast enough”
• “The results aren’t bitwise comparable”
• “There’s nothing else to run on the GPU”
• “GPUs are different”
March 26, 2013 © 2013 OpenEye Scienti!c Software
GTC!
Some other ;me…
FastROCS Quick Start
• crtl-alt-F1 (to switch to a non X-server terminal) • login as root • /sbin/init 3 (to turn off the X-server) • ./NVIDIA-Linux-x86_64-285.05.09.run • reboot • ./cuda.sh to give /dev/nvidia* correct permissions
• tar –xzf fastrocs-1.3.1-RHEL5-x64-OpenCL-1.1-CUDA-4.1.tar.gz • openeye/bin/ShapeDatabaseServer.py database.oeb.gz • openeye/bin/ShapeDatabaseClient.py localhost:8080 query.sdf out.sdf
March 26, 2013 © 2013 OpenEye Scienti!c Software
ROCS Quick Start
• tar –xzf ROCS-3.1.1-RHEL5-x64.tar.gz
• openeye/bin/rocs query.sdf database.oeb.gz
March 26, 2013 © 2013 OpenEye Scienti!c Software
S;ll a barrier to entry to work around!
This is even worse!
fastrocs-1.3.1-RHEL5-x64-OpenCL-1.1-CUDA-4.1.tar.gz
March 26, 2013 © 2013 OpenEye Scienti!c Software
NVidia OpenCL binaries are ;ghtly locked to a par;cular driver version
Worthwhile to upgrade
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
C2050 (260 Driver) C2050 (295 Driver)
Conformers /
Secon
d %11
High is
Best
Needed for new hardware
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
C2050 (295 Driver) M2090 (295 Driver)
Conformers /
Secon
d
High is
Best
Scalability between drivers (4x C2050)
March 26, 2013 © 2013 OpenEye Scienti!c Software
1
2
3
4
1 2 3 4
Speedu
p (Single GPU
5me / Mul5-‐GPU
5me)
Number of GPUs
Ideal
260 driver
295 driver
High is
Best
Really bad for 8x M2090
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Speedu
p (Single GPU
5me / Mul5-‐GPU
5me)
Number of GPUs
High is
Best
Ways to transfer to device
• CL_MEM_USE_HOST_PTR – kernelBuf = clCreateBuffer(CL_MEM_USE_HOST_PTR)
• CL_MEM_ALLOC_HOST_PTR|CL_MEM_COPY_HOST_PTR – kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_COPY_HOST_PTR)
• CL_MEM_ALLOC_HOST_PTR – kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) - cacheable – ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE) – memcpy(ptr, data) – clEnqueueUnmapMemObject(ptr)
• clEnqueueMapBuffer – kernelBuf = clCreateBuffer() - cacheable – ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE) – memcpy(ptr, data) – clEnqueueUnmapMemObject(ptr)
• clEnqueueWriteBuffer – kernelBuf = clCreateBuffer() - cacheable – clEnqueueWriteBuffer(kernelBuf, data)
• oclCopyCompute – pinnedBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_READ_WRITE) – cacheable – pinnedPtr = clEnqueueMapBuffer(pinnedBuf, CL_MAP_WRITE) – cacheable – memcpy(pinnedPtr, data) – kernelBuf = clCreateBuffer() – cacheable – clEnqueueWriteBuffer(kernelBuf, pinnedPtr)
March 26, 2013 © 2013 OpenEye Scienti!c Software
Ways to transfer from device
• CL_MEM_ALLOC_HOST_PTR – kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) - cacheable – ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE) – memcpy(data, ptr) – clEnqueueUnmapMemObject(ptr)
• clEnqueueMapBuffer – kernelBuf = clCreateBuffer() - cacheable – ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE) – memcpy(data, ptr) – clEnqueueUnmapMemObject(ptr)
• clEnqueueReadBuffer – kernelBuf = clCreateBuffer() - cacheable – clEnqueueWriteBuffer(kernelBuf, data)
• oclCopyCompute – pinnedBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_READ_WRITE) –
cacheable – pinnedPtr = clEnqueueMapBuffer(pinnedBuf, CL_MAP_WRITE) – cacheable – memcpy(pinnedPtr, data) – kernelBuf = clCreateBuffer() – cacheable – clEnqueueReadBuffer(kernelBuf, pinnedPtr)
March 26, 2013 © 2013 OpenEye Scienti!c Software
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
1
2
3
4
5
6
7
8
9
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 Speedu
p (Tim
e Sequ
en5a
l / Tim
e Pa
rallel)
Number of GPUs U5lized
FastROCS scalability across 8x M2070
Lessons from the mess
• clEnqueueWriteBuffer > clEnqueueMapBuffer
• clEnqueueMapBuffer >> clEnqueueReadBuffer
• CL_MEM_* constants aren’t worth the effort
March 26, 2013 © 2013 OpenEye Scienti!c Software
CUDA?
• Serious customers will only use NVidia cards
• Pinned memory
• Better support for binaries and compatibility • CUDA support >> OpenCL support
March 26, 2013 © 2013 OpenEye Scienti!c Software
FastROCS CUDA port
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
OpenCL CUDA CUDA-‐pinned
Confom
ers p
er Secon
d
2xC2075 2xC2090 2xK20
High is
Best
CUDA Scaling?
March 26, 2013 © 2013 OpenEye Scienti!c Software
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
1 2 3 4 5 6 7 8
Conformers p
er Secon
d
Number of individual K10 GPUs (Note, each K10 has 2 physical GPUs on the board)
CUDA
OpenCL
Ideal
High is
Best
CUDA vs OpenCL: Ding Ding!
• Portability vs Innovation
• NVidia vs Intel and AMD
• Open vs Proprietary
• Customers don’t care…
March 26, 2013 © 2013 OpenEye Scienti!c Software
ROCS Implementations
• We only care a little…
• Fortran code (1995) • C code (1999) • C++ wrapper code (2003) • OpenCL code (2009) • CUDA code (2012) • C++ thread-safe code (2013)
March 26, 2013 © 2013 OpenEye Scienti!c Software
OpenEye Software
• Lots of Software – 14 products – 13 software libraries
• C++ (no SIMD) – 2.5 million lines
• Python – 416 thousand lines
• Java – 63 thousand lines
• C# – 38 thousand lines
© 2012 OpenEye Scien;fic So>ware
20
12
10 Programmers Hardcore Scripter Other stuff
The People
• GPGPU = ½ of a developer – Only %2.5 of development effort
© 2012 OpenEye Scientific Software
Technology Adoption Lifecycle
March 26, 2013 © 2013 OpenEye Scienti!c Software
%2.5 %13.5 %34 %34 %16
OpenEye GPGPU development
LinkedIn skills
March 26, 2013 © 2013 OpenEye Scienti!c Software
%2.2
Technology Adoption Lifecycle
March 26, 2013 © 2013 OpenEye Scienti!c Software
%2.5 %13.5 %34 %34 %16
GPGPU development
I Believe…
• GPGPU computing can become ubiquitous…
• By expressing parallelism everywhere…
• We can make it easy for our customers… – Pre-installed in every operating system – Integrated seamlessly into every language – Then eventually becoming the CPU
March 26, 2013 © 2013 OpenEye Scienti!c Software
Acknowledgements
• Nikolai Sakharnykh (NVidia) • Dave Mullaly (HP) • Exxact Computing
March 26, 2013 © 2013 OpenEye Scienti!c Software
Father of “ROCS”
Andrew Grant April 28th 1963 - December 29th 2012
March 26, 2013 © 2013 OpenEye Scienti!c Software
March 26, 2013 © 2013 OpenEye Scienti!c Software
Dude, where’s my color?
March 26, 2013 © 2010 OpenEye Scienti!c Software
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ROCS FastROCS
DUD Av
erage AU
C
Shape Only With Color
ROCS vs FastROCS Histogram
March 26, 2013 © 2010 OpenEye Scienti!c Software
0
2
4
6
8
10
12 0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Num
ber o
f Targets
Kendall Tau Correla5on Coefficient