the fastest, distributed, in-memory database accelerated by gpus

17
The fastest, distributed, in-memory database accelerated by GPUs for analyzing large and streaming data Charles Sutton, Managing Director Europe October 27, 2016

Upload: vuonglien

Post on 02-Jan-2017

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The fastest, distributed, in-memory database accelerated by GPUs

The fastest, distributed, in-memory database

accelerated by GPUs

for analyzing large and streaming data

Charles Sutton, Managing Director Europe October 27, 2016

Page 2: The fastest, distributed, in-memory database accelerated by GPUs

Evolution of Data Processing

2

DATA WAREHOUSE

RDBMS & Data Warehouse

technologies enable organizations

to store and analyze growing

volumes of data on high

performance machines, but at high

cost.

DISTRIBUTED STORAGE AFFORDABLE IN-MEMORY

GPU ACCELERATED COMPUTE

Hadoop and MapReduce enables

distributed storage and processing

across multiple machines.

Storing massive volumes of data

becomes more affordable, but

performance is slow

Affordable memory allows for faster

data read and write. HBase, Hana,

MemSQL provide faster analytics.

At scale processing now becomes

the bottleneck

GPU cores bulk process tasks in

parallel - far more efficient for many

data-intensive tasks than CPUs

which process those tasks linearly.

Infinite compute on Power

hardware usher in a new generation

of possibilities….

1990 - 2000’s 2005… 2010… 2016…

2

Page 3: The fastest, distributed, in-memory database accelerated by GPUs

GPU Acceleration Overcomes Processing Bottlenecks

3

4,000+ cores per device in

many cases, versus 8 to 32

cores per typical CPU-based

device.

High performance computing

trend to using GPUs to solve

massive processing

challenges GPU acceleration brings high

performance compute to

Power hardware

Parallel processing is ideal for

scanning entire dataset &

brute force compute.

GPUs are designed around thousands of small, efficient cores that are well suited to performing repeated

similar instructions in parallel. This makes them well-suited to the compute-intensive workloads required of

large data sets.

Page 4: The fastest, distributed, in-memory database accelerated by GPUs

Enabling High-Performance Data Analytics

Massive parallel processingProcess and manage massive amounts of data

cost-effectively

In-memory computingDeliver human acceptable response times to

complex analytical queries

High performance computingLeverage GPU-accelerated hardware to

massively improve performance with better cost-

and energy-efficiency

4

Massive Parallel

Processing

(MPP)

In-Memory

ComputingHPC

Hardware

Page 5: The fastest, distributed, in-memory database accelerated by GPUs

Who is Kinetica?2009

‘HPC Research Project’

incubated by US military

2010

2011

Patent # US8373710

B1 issued to GPUdb

2012

US Army deploys

GPUdb

2013

GPUdb commercially

available

2014

IDC HPC innovation

excellence award

Army

GPUdb goes

into production

at USPS

2015

Iron Net selects

GPUdb for Cyber

Defense

2015

PG&E selects GPUdb

for electric grid

analysis

IDC HPC innovation

excellence award

USPS

2016

Rebrand to

4

Page 6: The fastest, distributed, in-memory database accelerated by GPUs

Enabling New Enterprise Solutions

6

Retail: Customer 360/customer sentiment, supply chain optimizationCorrelating data from point of sales (POS) systems, social media streams, weather forecasts, and even wearable devices. Better able to track inventory in real time, enabling efficient replenishment and avoiding out-of-stock situations.

Powering High Performance “Analytics as a Service” SolutionDelivering customer-focused services by leveraging all available transactional data. Currently no ability for business users to do customized analytics; IT has to. Query response times taking 10s of minutes, some over 2 hours, thus limiting ability to analyze and use data.

Fin servicesLarge scale risk aggregations and billion+ row joins in sub-second time (5TB+ tables choke on RDBMS joins and Hadoop is too slow).

Manufacturing IoTLive streaming analytics on component functionality to ensure safety (avoid failures) and validate warranty claims .

RidesharingView all passengers and drivers to monitor behavioral analytics. Watch for fast acceleration, sudden braking, too many U-turns, etc. to avoid risk/lawsuits of faulty drivers.

Page 7: The fastest, distributed, in-memory database accelerated by GPUs

PERFORMANCE BENCHMARKS

BoundingBox on 1B records: 561x faster on average than MemSQL & NoSQL

Max/Min on 1B records: 712x faster on average than MemSQL & NoSQL

Histogram on 1B records: 327x faster on average than MemSQL & NoSQL

SQL query on 30 nodes: 104x faster than SAP HANA

7

Page 8: The fastest, distributed, in-memory database accelerated by GPUs

Performance vs. HANA

8

IN-MEMORY KNOCKOUT

30 nodes Kinetica cluster tested

against 100+ node HANA cluster

holding 3 years of retail data

(150b rows) with a range of

queries SELECT

GROUP BY (1)

GROUP BY (2)

0 5 10 15 20 25 30 35 40 45 50

Kinetica HANA

Source: Retail customer data for 10 most complex SQL queries that were being run on HANA, 2016

Time (s)

JOIN

GROUP BY

Page 9: The fastest, distributed, in-memory database accelerated by GPUs

Scale Up or Out Predictably

Distributed architecture scales predictably for both:

• Data ingestion

• Query execution

9

Commodity Hardwarew/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardwarew/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardwarew/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardwarew/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

On-demand Linear Scale

Power Serverw/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Power Serverw/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Power Serverw/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

Power SErverw/ GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

MULTI-HEAD INGEST

Page 10: The fastest, distributed, in-memory database accelerated by GPUs

Efficiency = SavingsHardware costs and footprint are a fraction of standard in-memory databases.

10

Page 11: The fastest, distributed, in-memory database accelerated by GPUs

Simplify Your Architecture

Plug into existing data architecture; deep integration with open source and commercial frameworks

No typical tuning, indexing, or tweaking associated with traditional CPU-based solutions

Natural Language Processing based full-text search

Plug-ins with third-party business intelligence applications such as Tableau, Kibana and Caravel

11

Page 12: The fastest, distributed, in-memory database accelerated by GPUs

Advanced VisualizationKinetica native visualization ideal for fast moving, location-based IoT data.

Improving visual rendering of data–

particularly geospatial and temporal data.

• Visualizes billions of data points

• Updates in real time

• Delivers full gamut of geospatial visualization

13

Page 13: The fastest, distributed, in-memory database accelerated by GPUs

Reference Architecture

• High speed in-memory layer for your most critical enterprise data

• Real-time analytics for streaming data

• Accelerate performance of your existing analytics tools and applications

• Offload expensive relational databases

• Get more value from your Hadoop investment

5

APPLICATION LAYER

OLAP | NoSQL | GEOSPATIAL| API’s | TEXT SEARCH

DATA INGEST

DATA LAKE

TRANSACTIONAL

IoT/StreamProcessing

MessageQueue

BatchFeeds

Page 14: The fastest, distributed, in-memory database accelerated by GPUs

Kinetica Architecture

7

VISUALIZATION via ODBC/JDBCAPIs

Java API

JavaScript API

REST API

C++ API

Node.js API

Python API

OPEN SOURCE INTEGRATION

Apache NiFi

Apache Kafka

Apache Spark

Apache Storm

GEOSPATIAL CAPABILITIES

Geometric Objects

Tracks

Geospatial Endpoints

WMS

WKT

KINETICA CLUSTEROn Demand Scale

IBM Power Server W/ GPU’s

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

IBM Power Server W/ GPU’s

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

IBM Power ServerW/ GPU’s

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

IBM Power ServerW/ GPU’s

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Columnar In-memory

HTTP Head Node

OTHERINTEGRATION

Message Queues

ETL Tools

Streaming Tools

• Reliable, Available and Scalable• Disk based persistence• Add nodes on demand• Data replication for high availability• Scale up and/or out

• Performance• GPU Accelerated (1000s Cores per GPU)• Ingest Billions of records per minute• Ultra low latency query performance

• Massive Data Sizes• 10s of Terabytes Scale in Memory• Billions of entries

• Connectors• ODBC/JDBC• Restful Endpoints• Rich API’s• Standard Geospatial Capabilities

Page 15: The fastest, distributed, in-memory database accelerated by GPUs

AT THE CONVERGENCE OF BI & AI

15

BI AI

“To be successful

with AI, you need to

be able to feed your

system relevant,

timely data. From a

data perspective AI

cannot exist without

BI.”

Page 16: The fastest, distributed, in-memory database accelerated by GPUs

Thank You

Charles Sutton – [email protected]

Page 17: The fastest, distributed, in-memory database accelerated by GPUs

Enabling New Enterprise Solutions

17

Logistics, Fleet OptimizationReallocating resources on the fly based upon personnel, environmental and seasonal data. In 2015 alone, USPS delivered 150 billion pieces of mail while driving 70 million fewer miles and saving 7 million gallons of fuel. Track every truck and piece of mail with Kinetica on only 10 nodes.

CybersecurityFusing multiple data feeds with real-time anomaly detection to protect its financial clients against current and emerging cyber threats.

Utility: Smart GridGeospatially visualizing smart grid while ingesting real-time vector data streaming from smart meters. Striving for optimized energy generation and uptime based on fluctuating usage patterns and unpredictable natural disasters.