snappydata overview slidedeck for big data bellevue

34
SnappyData Getting Spark ready for real-time, operational analytics www.snappydata.io Suds Menon Co-Founder SnappyData Jan 2016

Upload: snappydata

Post on 12-Jan-2017

1.111 views

Category:

Software


0 download

TRANSCRIPT

Page 1: SnappyData Overview Slidedeck for Big Data Bellevue

SnappyData Getting Spark ready for real-time,

operational analytics

www.snappydata.io

Suds Menon Co-Founder SnappyData

Jan 2016

Page 2: SnappyData Overview Slidedeck for Big Data Bellevue

Last Week Tonight in Big Data

www.snappydata.io

Page 3: SnappyData Overview Slidedeck for Big Data Bellevue

IoT is what makes the big data challenge very real

A 10 Trillion Device World1

www.snappydata.io

1:http://cacm.acm.org/news/191847-get-ready-to-live-in-a-trillion-device-world/fulltext

Page 4: SnappyData Overview Slidedeck for Big Data Bellevue

Because Insights are like people. Useful for a short period of time

The New Arms Race

www.snappydata.io

●  Sift through data to get insights to improve your business

●  What is your time to insights? ●  What is your time to

operationalizing insights?

Page 5: SnappyData Overview Slidedeck for Big Data Bellevue

Can we use the past to accurately predict the future?

The Holy Grail of Analytics

www.snappydata.io

Page 6: SnappyData Overview Slidedeck for Big Data Bellevue

The faster you go, the bigger your business advantage

Speeding Up Insights

www.snappydata.io

Page 7: SnappyData Overview Slidedeck for Big Data Bellevue

Exploding data volumes fuel the search for distributed solutions

How We Got Here

www.snappydata.io

Teradata Cognos

GreenPlum Netezza, ParAccel

Hadoop (SQL on Hadoop)

Spark (Spark SQL)

Page 8: SnappyData Overview Slidedeck for Big Data Bellevue

Every enterprise today deals with these 4 kinds of data interactions

The Four Horsemen Of Data

www.snappydata.io

OLTP OLAP Streaming Machine Learning

Page 9: SnappyData Overview Slidedeck for Big Data Bellevue

Who Are We? ●  An EMC-Pivotal spinout focused on real time operational

analytics ●  New Spark-based open source project started by Pivotal

GemFire founders+engineers

●  Decades of in-memory data management experience

●  Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database

www.snappydata.io

Page 10: SnappyData Overview Slidedeck for Big Data Bellevue

SnappyData At Cruising Altitude

Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics

Batch design, high throughput

Real time operational Analytics – TBs in memory

RDB

Rows Txn

Columnar

API

Stream processing

ODBC, JDBC, REST

Spark - Scala, Java, Python, R

HDFS AQP

First commercial project on Approximate Query Processing(AQP)

MPP DB

Index

Page 11: SnappyData Overview Slidedeck for Big Data Bellevue

SnappyData: A new approach

Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics

Batch design, high throughput

Real-­‐time  design  center  -­‐  Low  latency,  HA,  

concurrent  

Vision: Drastically reduce the cost and complexity in modern big data

Page 12: SnappyData Overview Slidedeck for Big Data Bellevue

Huge community adoption, slip streaming into Hadoop momentum, great data integration platform

Why Spark? •  Most events in life can be analyzed as micro batches •  Blends streaming, interactive, and batch analytics •  Appeals to Java, R, Python, Scala programmers •  Rich set of transformations and libraries •  RDD and fault tolerance without replication •  Offers Spark SQL as a key capability

www.snappydata.io

Page 13: SnappyData Overview Slidedeck for Big Data Bellevue

Spark is a compute framework that processes data, not an analytics database

Clearing Up Some Spark Myths

www.snappydata.io

●  It is NOT a distributed in-memory database ○  It’s a computational framework with immutable caching

●  It is NOT Highly Available ○  Fault tolerance is not the same as HA

●  NOT well suited for real time, operational environments ○  Does not handle concurrency well ○  Does not share data very well either

Page 14: SnappyData Overview Slidedeck for Big Data Bellevue

SnappyData & Lambda

SnappyData Focus

Page 15: SnappyData Overview Slidedeck for Big Data Bellevue

Perspective on Lambda for real time

In-Memory DB

Interactive queries, updates

Deep Scale, High volume

MPP DB Transform Data-in-motion Analytics

Application

Streams

Alerts

Page 16: SnappyData Overview Slidedeck for Big Data Bellevue

RELEVANT USECASES

www.snappydata.io

Page 17: SnappyData Overview Slidedeck for Big Data Bellevue

Market Surveillance

www.snappydata.io

FLAG DETECT

ANALYZE INGEST

Identify patterns based on query results

Partitioned, HA stream ingestion

Prevent settlement, investigate further

SQL queries & Stream Analytics on microbatches

Page 18: SnappyData Overview Slidedeck for Big Data Bellevue

Contextual Marketing

www.snappydata.io

RESPOND DECIDE

ANALYZE INGEST

Pick Ad based on variety of reference data parameters

Transactional request for Ad placement

Deliver in real time

Join with history, join with user profile, join with location

Page 19: SnappyData Overview Slidedeck for Big Data Bellevue

Location Based Telco Services

www.snappydata.io

Geo Fencing Mobile Marketing Network Analytics

●  INGEST, CORRELATE, JOIN WITH HISTORICAL DATA, RESPOND

Page 20: SnappyData Overview Slidedeck for Big Data Bellevue

Spark Architecture

Driver

Cluster Manager (YARN, Mesos,

Standalone)

Worker Worker

Worker

Executor

Page 21: SnappyData Overview Slidedeck for Big Data Bellevue

REST API for Job

Submission

Worker Worker

Worker Data Server

Executor

Cluster Manager (YARN, Mesos,

Standalone)

Data Server

Executor

Snappy Infused Spark Architecture

JDBC Clients

ODBC Clients

Job Server Lead Node Lead Node

Page 22: SnappyData Overview Slidedeck for Big Data Bellevue

Core Components Of SnappyData

Page 23: SnappyData Overview Slidedeck for Big Data Bellevue

Colocated row/column Tables in Spark

Row Table

Column Table

Spark Executor TASK

Spark Block Manager

Stream processing

Row Table

Column Table

Spark Executor TASK

Spark Block Manager

Stream processing

Row Table

Column Table

Spark Executor TASK

Spark Block Manager

Stream processing

●  Spark Executors are long lived and shared across multiple apps ●  Gem Memory Mgr and Spark Block Mgr integrated

Page 24: SnappyData Overview Slidedeck for Big Data Bellevue

Table can be partitioned or replicated

Replicated Table

Partitioned Table (Buckets A-H) Replicated

Table

Partitioned Table (Buckets I-P)

consistent replica on each node

Partition Replica (Buckets A-H)

Replicated Table

Partitioned Table (Buckets Q-W) Partition

Replica (Buckets I-P)

Data partitioned with one or more replicas

Page 25: SnappyData Overview Slidedeck for Big Data Bellevue

Linearly scale with shared partitions

Spark Executor

Spark Executor

Kafka queue

Subscriber N-Z

Subscriber A-M

Subscriber A-M Ref data

Linearly scale with partition pruning Input queue, Stream, IMDB, Output queue all share the same partitioning strategy

Page 26: SnappyData Overview Slidedeck for Big Data Bellevue

Point access, updates, fast writes

●  Row tables with PKs are distributed HashMaps ○  with secondary indexes

●  Support for transactional semantics ○  read_committed, repeatable_read

●  Support for scalable high write rates ○  streaming data goes through stages ○  queue streams, intermediate storage (Delta row buffer),

immutable compressed columns

Page 27: SnappyData Overview Slidedeck for Big Data Bellevue

Full Spark Compatibility ●  Any table is also visible as a DataFrame

●  Any RDD[T]/DataFrame can be stored in SnappyData tables

●  Tables appear like any JDBC sourced table ○  But, in executor memory by default

●  Addtional API for updates, inserts, deletes //Save a dataFrame using the spark context …

context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");

Page 28: SnappyData Overview Slidedeck for Big Data Bellevue

Extends Spark CREATE  [Temporary]  TABLE  [IF  NOT  EXISTS]  table_name        (                <column  deIinition>          )    USING  ‘JDBC  |  ROW  |  COLUMN  ’  OPTIONS  (        COLOCATE_WITH  'table_name',        //  Default  none        PARTITION_BY  'PRIMARY  KEY  |  column  name',  //  will  be  a  replicated  table,  by  default        REDUNDANCY                '1'  ,          //  Manage  HA      PERSISTENT      "DISKSTORE_NAME  ASYNCHRONOUS  |    SYNCHRONOUS",    

     //  Empty  string  will  map  to  default  disk  store.        OFFHEAP  "true  |  false"        EVICTION_BY    "MEMSIZE  200  |  COUNT  200  |  HEAPPERCENT",  …..      [AS  select_statement];  

Page 29: SnappyData Overview Slidedeck for Big Data Bellevue

Key feature: Synopses Data ●  Maintain stratified samples

○  Intelligent sampling to keep error bounds low

●  Probabilistic data ○  TopK for time series (using time aggregation CMS, item

aggregation) ○  Histograms, HyperLogLog, Bloom Filters, Wavelets

CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS (

BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS

Page 30: SnappyData Overview Slidedeck for Big Data Bellevue

www.snappydata.io

AQP Architecture

Page 31: SnappyData Overview Slidedeck for Big Data Bellevue

www.snappydata.io

Spot The Differences

Page 32: SnappyData Overview Slidedeck for Big Data Bellevue

www.snappydata.io

SnappyData is Open Source ●  Beta will be on github in January. We are looking for

contributors!

●  Learn more & register for beta: www.snappydata.io

●  Connect: ○  twitter: www.twitter.com/snappydata ○  facebook: www.facebook.com/snappydata ○  linkedin: www.linkedin.com/snappydata ○  slack: http://snappydata-slackin.herokuapp.com ○  IRC: irc.freenode.net #snappydata

Page 33: SnappyData Overview Slidedeck for Big Data Bellevue

Q&A

www.snappydata.io

Page 34: SnappyData Overview Slidedeck for Big Data Bellevue

THANK YOU

www.snappydata.io