an introduction to time series with team apache

240
@PatrickMcFadin Patrick McFadin Chief Evangelist for Apache Cassandra, DataStax Process, store, and analyze like a boss with Team Apache: Kafka, Spark, and Cassandra 1

Upload: patrick-mcfadin

Post on 13-Apr-2017

441 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: An Introduction to time series with Team Apache

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra, DataStax

Process, store, and analyze like a boss with Team Apache: Kafka, Spark, and Cassandra

1

Page 2: An Introduction to time series with Team Apache

Agenda

• Lecture

• Kafka

• Spark

• Cassandra

• Hands on

• Verify Cassandra up and running

• Load data into Cassandra

• Break 3:00 - 3:30

• Lecture

• Cassandra (continued)

• Spark and Cassandra

• PySpark

• Hands On

• Spark Shell

• Spark SQL

Section 1 Section 2

Page 3: An Introduction to time series with Team Apache

About me• Chief Evangelist for Apache Cassandra • Senior Solution Architect at DataStax • Chief Architect, Hobsons • Web applications and performance since 1996

Page 4: An Introduction to time series with Team Apache

What is time series data?

A sequence of data points, typically consisting of successive measurements made over a time interval.

Source: https://en.wikipedia.org/wiki/Time_series

Page 5: An Introduction to time series with Team Apache

5

Page 6: An Introduction to time series with Team Apache

6

Underpants Gnomes

Step 1

Data Gnomes

Step 2 Step 3

Collect Data ? Profit!

Page 7: An Introduction to time series with Team Apache

What is time series analysis?

Methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data.

Source: https://en.wikipedia.org/wiki/Time_series

Page 8: An Introduction to time series with Team Apache

V V V

Page 9: An Introduction to time series with Team Apache

Velocity

Volume

Variety

Page 10: An Introduction to time series with Team Apache

Internet of Things

Page 11: An Introduction to time series with Team Apache

June 29, 2007

11

Page 12: An Introduction to time series with Team Apache

Bring in the team

Page 13: An Introduction to time series with Team Apache

Team Apache

Collect Process Store

Page 14: An Introduction to time series with Team Apache

CassandraAkka

SparkKafka

Organize Process Store

Mesos

KafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

Page 15: An Introduction to time series with Team Apache

2.1 Kafka - Architecture and Deployment

Page 16: An Introduction to time series with Team Apache

The problem

Kitchen

Hamburger please

Meat disk on bread please

Page 17: An Introduction to time series with Team Apache

The problem

Kitchen

Page 18: An Introduction to time series with Team Apache

The problem

Kitchen

Order Queue

Hamburger please

Order

Page 19: An Introduction to time series with Team Apache

The problem

Kitchen

Order Queue

Page 20: An Introduction to time series with Team Apache

The problem

Kitchen

Order Queue

Meat disk on bread please

You mean a Hamburger?

Uh yeah. That.

Order

Page 21: An Introduction to time series with Team Apache

Order from chaosProducer

Consumer

Topic = FoodOrder

Page 22: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Consumer

Page 23: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

Consumer

Page 24: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Page 25: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

Page 26: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 27: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 28: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Page 29: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

Page 30: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Page 31: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

Page 32: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 33: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 34: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 35: An Introduction to time series with Team Apache

Order from chaosProducer

Topic = Food

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Page 36: An Introduction to time series with Team Apache

ScaleProducer

Topic = Hamburgers

Order

1

Order

2

Consumer

Order

3

Order

4

Order

5

Topic = Pizza

Order

1

Order

2

Order

3

Order

4

Order

5

Topic = Food

Page 37: An Introduction to time series with Team Apache

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API

Temperature Processor

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

Page 38: An Introduction to time series with Team Apache

KafkaProducer

Topic = Temperature

Temp

1

Temp

2

Consumer

Temp

3

Temp

4

Temp

5

Collection API

Temperature Processor

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5Precipitation Processor

Broker

Partition 0

Partition 0

Page 39: An Introduction to time series with Team Apache

KafkaProducer Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip

1

Precip

2

Precip

3

Precip

4

Precip

5

Broker

Partition 0

Partition 0

Tem1

Temp2

Tem3

Temp4

Temp5

Partition 1 Temperature Processor

Page 40: An Introduction to time series with Team Apache

KafkaProducer Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Temperature Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

Page 41: An Introduction to time series with Team Apache

KafkaProducer

Consumer

Collection API

Temperature Processor

Precipitation Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp

5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1 Temperature

Processor

Topic = Temperature

Tem1

Temp

2Tem

3

Temp

4

Temp5

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Tem1

Temp

2Tem

3

Temp

4

Temp

5Partition 1

Temperature Processor

Temperature Processor

Precipitation Processor

Topic TemperatureReplication Factor = 2

Topic PrecipitationReplication Factor = 2

Page 42: An Introduction to time series with Team Apache

GuaranteesOrder •Messages are ordered as they are sent by the producer

•Consumers see messages in the order they were inserted by the producer

Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages

Page 43: An Introduction to time series with Team Apache

3.1 Spark - Introduction to Spark

Page 44: An Introduction to time series with Team Apache

Map Reduce

Input Data

Map

Reduce

Intermediate Data

Output Data

Disk

Page 45: An Introduction to time series with Team Apache

Data Science at Scale

2009

Page 46: An Introduction to time series with Team Apache

In memory

Input Data

Map

Reduce

Intermediate Data

Output Data

Disk

Page 47: An Introduction to time series with Team Apache

In memory

Input Data

Spark Intermediate Data

Output Data

Disk Memory

Page 48: An Introduction to time series with Team Apache

Resilient Distributed Dataset

Page 49: An Introduction to time series with Team Apache

RDDTranformations •Produces new RDD •Calls: filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

Are •Immutable •Partitioned •Reusable

Actions •Start cluster computing operations •Calls: collect: Array[T], count, fold, reduce..

and Have

Page 50: An Introduction to time series with Team Apache

API

filter groupBy sort union join leftOuterJoin rightOuterJoin

count fold reduceByKey groupByKey cogroup cross zip

sample

take

first partitionBy mapWith pipe

save ...

reducemap

Page 51: An Introduction to time series with Team Apache

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Page 52: An Introduction to time series with Team Apache

Spark Streaming

Petabytes of data

Gigabytes Per Second

Page 53: An Introduction to time series with Team Apache

3.1.1 Spark - Architecture

Page 54: An Introduction to time series with Team Apache

Directed Acyclic Graph

Resilient Distributed Dataset

Page 55: An Introduction to time series with Team Apache

DAG

RDD

Page 56: An Introduction to time series with Team Apache

DAG

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Page 57: An Introduction to time series with Team Apache

RDDRDD

Data

Input Source

• File

• Database

• Stream

• Collection

Page 58: An Introduction to time series with Team Apache

RDDRDD

Data

.count() -> 100

Page 59: An Introduction to time series with Team Apache

PartitionsRDD

Data

Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9

Server 1

Server 2

Server 3

Server 4

Server 5

Page 60: An Introduction to time series with Team Apache

PartitionsRDD

Data

Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9

Server 2

Server 3

Server 4

Server 5

Page 61: An Introduction to time series with Team Apache

PartitionsRDD

Data

Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9

Server 2

Server 3

Server 4

Server 5

Page 62: An Introduction to time series with Team Apache

Workflow

RDDtextFile(“words.txt”)

countWords()

Action

DAG SchedulerPlan

Stage one - Count words

P0

P1

P2

P0

Stage two - Collect counts

Page 63: An Introduction to time series with Team Apache

Executer

Master

Worker

Executer

Executer

Server

DataStorage

Page 64: An Introduction to time series with Team Apache

Master

Worker

Worker

Worker Worker

Storage

Storage Storage

Storage

Stage one - Count words

P0

P1

P2

DAG Scheduler

Executer

Narrow Transformation

• filter

• map

• sample

• flatMap

Page 65: An Introduction to time series with Team Apache

Master

Worker

Worker

Worker Worker

Storage

Storage Storage

Storage

Wide Transformation

P0

Stage two - Collect counts

Shuffle!• join • reduceByKey • union • groupByKey

Page 66: An Introduction to time series with Team Apache

3.2 Spark - Spark Streaming

Page 67: An Introduction to time series with Team Apache

The problem domain

Petabytes of data

Gigabytes Per Second

Page 68: An Introduction to time series with Team Apache

Input Sources

Page 69: An Introduction to time series with Team Apache

Input Sources

Page 70: An Introduction to time series with Team Apache

Receiver Based ApproachProducer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Page 71: An Introduction to time series with Team Apache

Receiver Based ApproachProducer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Streaming

Lost Data

Page 72: An Introduction to time series with Team Apache

Receiver Based ApproachProducer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Streaming

Write Ahead Log

Page 73: An Introduction to time series with Team Apache

val kafkaStream = KafkaUtils.createStream(streamingContext, [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])

ZookeeperServer IP Consumer

Group CreatedIn Kafka

List of Kafka topics and number of threads per topic

Receiver Based Approach

Page 74: An Introduction to time series with Team Apache

Producer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Direct Based Approach

Page 75: An Introduction to time series with Team Apache

Producer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Direct Based Approach

Page 76: An Introduction to time series with Team Apache

Producer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Direct Based Approach

Streaming

Page 77: An Introduction to time series with Team Apache

Producer

Topic = Temperature

Temp1

Temp2

Consumer

Temp3

Temp4

Temp5

Collection API

Topic = Precipitation

Precip1

Precip2

Precip3

Precip4

Precip5

Broker

Partition 0

Partition 0

Streaming

Streaming

Direct Based Approach

Streaming

Page 78: An Introduction to time series with Team Apache

Direct Based Approach

val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])

List of Kafka brokers(and any other params) Kafka topics

Page 79: An Introduction to time series with Team Apache

3.2.2 Spark - Streaming Windows and Slides

Page 80: An Introduction to time series with Team Apache

Discretized Stream

Page 81: An Introduction to time series with Team Apache

DStream

Kafka

Page 82: An Introduction to time series with Team Apache

DStream

Kafka

Page 83: An Introduction to time series with Team Apache

DStream

Kafka

Page 84: An Introduction to time series with Team Apache

DStream

Kafka

Page 85: An Introduction to time series with Team Apache

DStream

Kafka

Page 86: An Introduction to time series with Team Apache

DStream

Kafka

Page 87: An Introduction to time series with Team Apache

DStream

Kafka

Page 88: An Introduction to time series with Team Apache

DStream

Kafka

Page 89: An Introduction to time series with Team Apache

DStream

Kafka

Page 90: An Introduction to time series with Team Apache

DStream

Kafka

Page 91: An Introduction to time series with Team Apache

DStream

Kafka

Page 92: An Introduction to time series with Team Apache

DStream

Kafka

Page 93: An Introduction to time series with Team Apache

DStream

Kafka

Page 94: An Introduction to time series with Team Apache

DStream

Kafka

Page 95: An Introduction to time series with Team Apache

DStream

Kafka

Discrete by time

Page 96: An Introduction to time series with Team Apache

DStream

Individual Events

Discrete by timeDStream = RDD

Page 97: An Introduction to time series with Team Apache

DStream

X Seconds

DStream

Transform

.countByValue

.reduceByKey

.join

.map

Page 98: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

1 SecWindow

Page 99: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

Transform

Page 100: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

Transform

Page 101: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

Transform

Page 102: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

Page 103: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

SlideTransform

Page 104: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

SlideTransform

Page 105: An Introduction to time series with Team Apache

T0 1 2 3 4 5 6 7 8 9 10 11

Event DStream

Transform DStream

Transform

Page 106: An Introduction to time series with Team Apache

Window •Amount of time in seconds to sample data •Larger size creates memory pressure

Slide •Amount of time in seconds to advance window

DStream •Window of data as a set •Same operations as an RDD

Page 107: An Introduction to time series with Team Apache

4.1 Cassandra - Introduction

Page 108: An Introduction to time series with Team Apache

My Background

…ran into this problem

Page 109: An Introduction to time series with Team Apache

How did we get here?

1960s and 70s

Page 110: An Introduction to time series with Team Apache

How did we get here?

1960s and 70s 1980s and 90s

Page 111: An Introduction to time series with Team Apache

How did we get here?

1960s and 70s 1980s and 90s 2000s

Page 112: An Introduction to time series with Team Apache

How did we get here?

1960s and 70s 1980s and 90s 2000s 2010

Page 113: An Introduction to time series with Team Apache

Gave it my best shot

shard 1 shard 2 shard 3 shard 4

router

client

Patrick,All your wildest

dreams will come true.

Page 114: An Introduction to time series with Team Apache

Just add complexity!

Page 115: An Introduction to time series with Team Apache

A new plan

Page 116: An Introduction to time series with Team Apache

Dynamo Paper(2007)• How do we build a data store that is:

• Reliable • Performant • “Always On”

• Nothing new and shiny

Evolutionary. Real. Computer Science

Also the basis for Riak and Voldemort

Page 117: An Introduction to time series with Team Apache

BigTable(2006)

• Richer data model • 1 key. Lots of values • Fast sequential access • 38 Papers cited

Page 118: An Introduction to time series with Team Apache

Cassandra(2008)

• Distributed features of Dynamo • Data Model and storage from

BigTable • February 17, 2010 it graduated to

a top-level Apache project

Page 119: An Introduction to time series with Team Apache

Cassandra - More than one server

• All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server

119

Page 120: An Introduction to time series with Team Apache

120

Cassandra HBase Redis MySQL

THRO

UG

HPU

T O

PS/S

EC)

VLDB benchmark (RWS)

Page 121: An Introduction to time series with Team Apache

Cassandra - Fully Replicated

• Client writes local • Data syncs across WAN • Replication per Data Center

121

Page 122: An Introduction to time series with Team Apache

A Data Ocean or Pond., Lake

An In-Memory Database

A Key-Value Store

A magical database unicorn that farts rainbows

Page 123: An Introduction to time series with Team Apache

Cassandra for Applications

APACHE

CASSANDRA

Page 124: An Introduction to time series with Team Apache

Hands On!

https://github.com/killrweather/killrweather/wiki/6.-Cassandra-Exercises-on-Killrvideo-Data

KillrWeather Wiki

Page 125: An Introduction to time series with Team Apache

4.1.2 Cassandra - Basic Architecture

Page 126: An Introduction to time series with Team Apache

Row

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Page 127: An Introduction to time series with Team Apache

Partition

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Page 128: An Introduction to time series with Team Apache

Partition with Clustering

Cluster 1

Partition Key 1

Column 1

Column 2

Column 3

Cluster 2

Partition Key 1

Column 1

Column 2

Column 3

Cluster 3

Partition Key 1

Column 1

Column 2

Column 3

Cluster 4

Partition Key 1

Column 1

Column 2

Column 3

Page 129: An Introduction to time series with Team Apache

Table Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Column 2

Column 3

Column 4

Column 1

Column 2

Column 3

Column 4

Column 1

Column 2

Column 3

Column 4

Partition Key 2

Partition Key 2

Partition Key 2

Page 130: An Introduction to time series with Team Apache

Keyspace

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 1

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Column 1

Partition Key 2

Column 2

Column 3

Column 4

Table 1 Table 2Keyspace 1

Page 131: An Introduction to time series with Team Apache

NodeServer

Page 132: An Introduction to time series with Team Apache

TokenServer•Each partition is a 128 bit value

•Consistent hash between 2-63 and 264 •Each node owns a range of those values

•The token is the beginning of that range to the next node’s token value

•Virtual Nodes break these down further

Data

Token Range

0 …

Page 133: An Introduction to time series with Team Apache

The cluster Server

Token Range

0 0-100

0-100

Page 134: An Introduction to time series with Team Apache

The cluster Server

Token Range

0 0-50

51 51-100

Server

0-50

51-100

Page 135: An Introduction to time series with Team Apache

The cluster Server

Token Range

0 0-25

26 26-50

51 51-75

76 76-100Server

ServerServer

0-25

76-100

26-5051-75

Page 136: An Introduction to time series with Team Apache

4.1.3 Cassandra - Replication, High Availability and Multi-datacenter

Page 137: An Introduction to time series with Team Apache

Replication10.0.0.1 00-25

DC1

DC1: RF=1

Node Primary

10.0.0.1 00-25

10.0.0.2 26-50

10.0.0.3 51-75

10.0.0.4 76-100

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Page 138: An Introduction to time series with Team Apache

Replication10.0.0.1

00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=2

Node Primary Replica

10.0.0.1 00-25 76-100

10.0.0.2 26-50 00-25

10.0.0.3 51-75 26-50

10.0.0.4 76-100 51-75

76-100

00-25

26-50

51-75

Page 139: An Introduction to time series with Team Apache

ReplicationDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Page 140: An Introduction to time series with Team Apache

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

Page 141: An Introduction to time series with Team Apache

Repair

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

ClientRepair = Am I consistent?

You are missing some data. Here. Have some of mine.

Page 142: An Introduction to time series with Team Apache

Consistency level

Consistency Level Number of Nodes Acknowledged

One One - Read repair triggered

Local One One - Read repair in local DC

Quorum 51%

Local Quorum 51% in local DC

Page 143: An Introduction to time series with Team Apache

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

Page 144: An Introduction to time series with Team Apache

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= One

Page 145: An Introduction to time series with Team Apache

ConsistencyDC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15 CL= Quorum

Page 146: An Introduction to time series with Team Apache

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

DC2: RF=3

Page 147: An Introduction to time series with Team Apache

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

DC2: RF=3Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

Page 148: An Introduction to time series with Team Apache

Multi-datacenterDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Write to partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

DC2: RF=3Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

Page 149: An Introduction to time series with Team Apache

4.2.1 Cassandra - Weather Website Example

Page 150: An Introduction to time series with Team Apache

Example: Weather Station

• Weather station collects data • Cassandra stores in sequence • Application reads in sequence • Aggregations in fast lookup table

Windsor California July 1, 2014

High: 73.4 Low : 51.4

Precipitation: 0.0 2014 Total: 8.3”

Weather for Windsor, California as of 9PM PST July 7th 2015

Current Temp: 71 F

Daily Precipitation: 0.0”

Up-to-date Weather

High: 85 F

Low 58 F

2015 Total Precipitation: 12.0 “

Page 151: An Introduction to time series with Team Apache

Weather Web Site

CassandraOnly DC

Cassandra+ Spark DC

Spark Jobs

Spark Streaming

Page 152: An Introduction to time series with Team Apache

Success starts with…

The data model!

Page 153: An Introduction to time series with Team Apache

Relational Data Models• 5 normal forms • Foreign Keys • Joins

deptId First Last1 Edgar Codd2 Raymond Boyce

id Dept

1 Engineering

2 Math

Employees

Department

Page 154: An Introduction to time series with Team Apache

Relational Modeling

Data

Models

Application

Page 155: An Introduction to time series with Team Apache

Cassandra Modeling

Data

Models

Application

Page 156: An Introduction to time series with Team Apache

CQL vs SQL• No joins • Limited aggregations

deptId First Last1 Edgar Codd2 Raymond Boyce

id Dept

1 Engineering

2 Math

Employees

DepartmentSELECT e.First, e.Last, d.DeptFROM Department d, Employees eWHERE ‘Codd’ = e.LastAND e.deptId = d.id

Page 157: An Introduction to time series with Team Apache

Denormalization• Combine table columns into a single view • No joins

SELECT First, Last, Dept FROM employees WHERE id = ‘1’

id First Last Dept

1 Edgar Codd Engineering

2 Raymond Boyce Math

Employees

Page 158: An Introduction to time series with Team Apache

Queries supported

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Page 159: An Introduction to time series with Team Apache

Aggregation Queries

CREATE TABLE daily_aggregate_temperature ( wsid text, year int, month int, day int, high double, low double, mean double, variance double, stdev double, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

Get temperature stats given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Windsor California July 1, 2014

High: 73.4

Low : 51.4

Page 160: An Introduction to time series with Team Apache

daily_aggregate_precip

CREATE TABLE daily_aggregate_precip ( wsid text, year int, month int, day int, precipitation counter, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

Get precipitation stats given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time

Windsor California July 1, 2014

High: 73.4 Low : 51.4 Precipitation: 0.0

Page 161: An Introduction to time series with Team Apache

year_cumulative_precip

CREATE TABLE year_cumulative_precip ( wsid text, year int, precipitation counter, PRIMARY KEY ((wsid), year) ) WITH CLUSTERING ORDER BY (year DESC);

Get latest yearly precipitation accumulation •Weather Station ID •Weather Station ID and Time •Provide fast lookup

Windsor California July 1, 2014

High: 73.4 Low : 51.4

Precipitation: 0.0 2014 Total: 8.3”

Page 162: An Introduction to time series with Team Apache

4.2.1.1.1 Cassandra - CQL

Page 163: An Introduction to time series with Team Apache

Table

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );

Table Name

Column NameColumn CQL Type

Primary Key Designation Partition Key

Page 164: An Introduction to time series with Team Apache

Table

CREATE TABLE daily_aggregate_precip ( wsid text, year int, month int, day int, precipitation counter, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

Partition KeyClustering Columns

Order Override

Page 165: An Introduction to time series with Team Apache

Insert

INSERT INTO weather_station (id, call_sign, country_code, elevation, lat, long, name, state_code) VALUES ('727930:24233', 'KSEA', 'US', 121.9, 47.467, -122.32, 'SEATTLE SEATTLE-TACOMA INTL A', ‘WA');

Table Name Fields

Values

Partition Key: Required

Page 166: An Introduction to time series with Team Apache

Lightweight Transactions

INSERT INTO weather_station (id, call_sign, country_code, elevation, lat, long, name, state_code) VALUES ('727930:24233', 'KSEA', 'US', 121.9, 47.467, -122.32, 'SEATTLE SEATTLE-TACOMA INTL A', ‘WA’) IF NOT EXISTS;

Don’t overwrite!

Page 167: An Introduction to time series with Team Apache

Lightweight Transactions

CREATE TABLE IF NOT EXISTS weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );

No-op. Don’t throw error

Page 168: An Introduction to time series with Team Apache

Select

id | call_sign | country_code | elevation | lat | long | name | state_code--------------+-----------+--------------+-----------+--------+---------+-------------------------------+------------727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SEATTLE SEATTLE-TACOMA INTL A | WA

SELECT id, call_sign, country_code, elevation, lat, long, name, state_codeFROM weather_stationWHERE id = '727930:24233';

Fields

Table Name

Primary Key: Partition Key Required

Page 169: An Introduction to time series with Team Apache

Update

UPDATE weather_stationSET name = 'SeaTac International Airport'WHERE id = '727930:24233';

id | call_sign | country_code | elevation | lat | long | name | state_code--------------+-----------+--------------+-----------+--------+---------+------------------------------+------------727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SeaTac International Airport | WA

Table Name Fields to Update: Not in Primary Key

Primary Key

Page 170: An Introduction to time series with Team Apache

Lightweight Transactions

UPDATE weather_stationSET name = 'SeaTac International Airport'WHERE id = ‘727930:24233’; IF name = 'SEATTLE SEATTLE-TACOMA INTL A’;

Don’t overwrite!

Page 171: An Introduction to time series with Team Apache

Delete

DELETE FROM weather_stationWHERE id = '727930:24233';

Table Name

Primary Key: Required

Page 172: An Introduction to time series with Team Apache

CollectionsSet

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text> PRIMARY KEY(id) );

equipment set<text>

CQL Type: For Ordering

Column Name

Page 173: An Introduction to time series with Team Apache

CollectionsSet

List

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text>, service_dates list<timestamp>, PRIMARY KEY(id) );

equipment set<text>

service_dates list<timestamp>Column Name

CQL Type: For Ordering

Column Name

CQL Type

Page 174: An Introduction to time series with Team Apache

CollectionsSet

List

Map

CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text>, service_dates list<timestamp>, service_notes map<timestamp,text>, PRIMARY KEY(id) );

equipment set<text>

service_dates list<timestamp>

service_notes map<timestamp,text>

Column Name

Column Name

CQL Key Type CQL Value Type

CQL Type: For Ordering

Column Name

CQL Type

Page 175: An Introduction to time series with Team Apache

User Defined Functions*

*As of Cassandra 2.2

•Built-in: avg, min, max, count(<column name>) •Runs on server •Always use with partition key

Page 176: An Introduction to time series with Team Apache

User Defined Functions

CREATE FUNCTION maxI(current int, candidate int) CALLED ON NULL INPUTRETURNS int LANGUAGE java AS'if (current == null) return candidate; else return Math.max(current, candidate);' ; CREATE AGGREGATE maxAgg(int) SFUNC maxISTYPE intINITCOND null;

CQL Type

Pure Function

SELECT maxAgg(temperature) FROM raw_weather_dataWHERE wsid='10010:99999' AND year = 2005 AND month = 12 AND day = 1

Aggregate usingfunction overpartition

Page 177: An Introduction to time series with Team Apache

4.2.1.1.2 Cassandra - Partitions and clustering

Page 178: An Introduction to time series with Team Apache

Primary Key

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Page 179: An Introduction to time series with Team Apache

Primary key relationship

PRIMARY KEY ((wsid),year,month,day,hour)

Page 180: An Introduction to time series with Team Apache

Primary key relationship

Partition Key

PRIMARY KEY ((wsid),year,month,day,hour)

Page 181: An Introduction to time series with Team Apache

Primary key relationship

PRIMARY KEY ((wsid),year,month,day,hour)

Partition Key Clustering Columns

Page 182: An Introduction to time series with Team Apache

Primary key relationship

Partition Key Clustering Columns

10010:99999

PRIMARY KEY ((wsid),year,month,day,hour)

Page 183: An Introduction to time series with Team Apache

2005:12:1:10

-5.6

Primary key relationship

Partition Key Clustering Columns

10010:99999-5.3-4.9-5.1

2005:12:1:9 2005:12:1:8 2005:12:1:7

PRIMARY KEY ((wsid),year,month,day,hour)

Page 184: An Introduction to time series with Team Apache

Clustering

200510010:99999 12 1 10

200510010:99999 12 1 9

raw_weather_data

-5.6

-5.1

200510010:99999 12 1 8

200510010:99999 12 1 7

-4.9

-5.3

Order By

DESC

Page 185: An Introduction to time series with Team Apache

Partition keys

10010:99999 Murmur3 Hash Token = 7224631062609997448

722266:13850 Murmur3 Hash Token = -6804302034103043898

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);

Consistent hash. 128 bit number between 2-63 and 264

Page 186: An Introduction to time series with Team Apache

Partition keys

10010:99999 Murmur3 Hash Token = 15

722266:13850 Murmur3 Hash Token = 77

For this example, let’s make it a reasonable number

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);

Page 187: An Introduction to time series with Team Apache

Data LocalityDC1

DC1: RF=3Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

Client

Read partition 15

DC2

10.1.0.1 00-25

10.1.0.4 76-100

10.1.0.2 26-50

10.1.0.3 51-75

76-100 51-75

00-25 76-100

26-50 00-25

51-75 26-50

DC2: RF=3

Client

Read partition 15

Node Primary Replica Replica

10.1.0.1 00-25 76-100 51-75

10.1.0.2 26-50 00-25 76-100

10.1.0.3 51-75 26-50 00-25

10.1.0.4 76-100 51-75 26-50

Page 188: An Introduction to time series with Team Apache

Data Locality

wsid=‘10010:99999’ ?

1000 Node Cluster

You are here!

Page 189: An Introduction to time series with Team Apache

4.2.1.1.3 Cassandra - Read and Write Path

Page 190: An Introduction to time series with Team Apache

WritesCREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Page 191: An Introduction to time series with Team Apache

WritesCREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.6);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-5.1);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-4.9);

INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.3);

Page 192: An Introduction to time series with Team Apache

Write PathClient INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)

VALUES (‘10010:99999’,2005,12,1,7,-5.3);

year 1wsid 1 month 1 day 1 hour 1

year 2wsid 2 month 2 day 2 hour 2

Memtable

SSTable

SSTable

SSTable

SSTable

Node

Commit Log Data * Compaction *

Temp

Temp

Page 193: An Introduction to time series with Team Apache

Storage Model - Logical View

2005:12:1:10

-5.6

2005:12:1:9

-5.1

2005:12:1:8

-4.9

10010:99999

10010:99999

10010:99999

wsid hour temperature

2005:12:1:7

-5.310010:99999

SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

Page 194: An Introduction to time series with Team Apache

2005:12:1:10

-5.6 -5.3-4.9-5.1

Storage Model - Disk Layout

2005:12:1:9 2005:12:1:810010:99999

2005:12:1:7

Merged, Sorted and Stored Sequentially

SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

Page 195: An Introduction to time series with Team Apache

2005:12:1:10

-5.6

2005:12:1:11

-4.9 -5.3-4.9-5.1

Storage Model - Disk Layout

2005:12:1:9 2005:12:1:810010:99999

2005:12:1:7

Merged, Sorted and Stored Sequentially

SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

Page 196: An Introduction to time series with Team Apache

2005:12:1:10

-5.6

2005:12:1:11

-4.9 -5.3-4.9-5.1

Storage Model - Disk Layout

2005:12:1:9 2005:12:1:810010:99999

2005:12:1:7

Merged, Sorted and Stored Sequentially

SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

2005:12:1:12

-5.4

Page 197: An Introduction to time series with Team Apache

Read PathClient

SSTableSSTable

SSTable

Node

Data

SELECT wsid,hour,temperatureFROM raw_weather_dataWHERE wsid='10010:99999'AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;

year 1wsid 1 month 1 day 1 hour 1

year 2wsid 2 month 2 day 2 hour 2

Memtable

Temp

Temp

Page 198: An Introduction to time series with Team Apache

Query patterns• Range queries • “Slice” operation on disk

Single seek on disk

10010:99999

Partition key for locality

SELECT wsid,hour,temperatureFROM raw_weather_dataWHERE wsid='10010:99999'AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;

2005:12:1:10

-5.6 -5.3-4.9-5.1

2005:12:1:9 2005:12:1:8 2005:12:1:7

Page 199: An Introduction to time series with Team Apache

Query patterns• Range queries • “Slice” operation on disk

Programmers like this

Sorted by event_time2005:12:1:10

-5.6

2005:12:1:9

-5.1

2005:12:1:8

-4.9

10010:99999

10010:99999

10010:99999

weather_station hour temperature

2005:12:1:7

-5.310010:99999

SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;

Page 200: An Introduction to time series with Team Apache

5.1 Spark and Cassandra - Architecture

Page 201: An Introduction to time series with Team Apache

Great combo

Store a ton of data Analyze a ton of data

Page 202: An Introduction to time series with Team Apache

Great combo

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Page 203: An Introduction to time series with Team Apache

Great comboSpark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Spark Connector

Page 204: An Introduction to time series with Team Apache

Executer

Master

Worker

Executer

Executer

Server

Page 205: An Introduction to time series with Team Apache

Master

Worker

Worker

Worker Worker

0-24Token Ranges 0-100

25-49

50-74

75-99

I will only analyze 25% of the data.

Page 206: An Introduction to time series with Team Apache

Master

Worker

Worker

Worker Worker

0-24

25-49

50-74

75-9975-99

0-24

25-49

50-74

AnalyticsTransactional

Page 207: An Introduction to time series with Team Apache

Executer

Master

Worker

Executer

Executer

75-99

SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Spark Connector

Page 208: An Introduction to time series with Team Apache

Executer

Master

Worker

Executer

Executer

75-99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Page 209: An Introduction to time series with Team Apache

Spark ConnectorCassandra

Cassandra + Spark

Joins and Unions No Yes

Transformations Limited Yes

Outside Data Integration

No Yes

Aggregations Limited Yes

Page 210: An Introduction to time series with Team Apache

Type mappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

Page 211: An Introduction to time series with Team Apache

Execution of jobsLocal Cluster

•Connect to localhost master

•Single system dev •Runs stand alone

•Connect to spark master IP

•Production configuration •Submit using spark-submit

Page 212: An Introduction to time series with Team Apache

Summary

•Cassandra acts as the storage layer for Spark •Deploy in a mixed cluster configuration •Spark executors access Cassandra using the DataStax connector

•Deploy your jobs in either local or cluster modes

Page 213: An Introduction to time series with Team Apache

5.2 Spark and Cassandra - Analyzing Cassandra Data

Page 214: An Introduction to time series with Team Apache

Attaching to Spark and Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objectsimport org.apache.spark.{SparkContext, SparkConf}import com.datastax.spark.connector._

/** The setMaster("local") lets us run & test the job right in our IDE */val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster(“local[*]") .setAppName(getClass.getName) // Optionally .set("cassandra.username", "cassandra") .set("cassandra.password", “cassandra") val sc = new SparkContext(conf)

Page 215: An Introduction to time series with Team Apache

Weather station example

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Page 216: An Introduction to time series with Team Apache

Simple example

/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()

Page 217: An Introduction to time series with Team Apache

Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()

Executer

SELECT * FROM isd_weather_data.raw_weather_data

Spark RDD

Spark Partition

Spark Connector

Page 218: An Introduction to time series with Team Apache

Using CQL

SELECT temperatureFROM raw_weather_dataWHERE wsid = '724940:23234'AND year = 2008AND month = 12AND day = 1;

val cqlRRD = sc.cassandraTable("isd_weather_data", "raw_weather_data") .select("temperature") .where("wsid = ? AND year = ? AND month = ? AND DAY = ?", "724940:23234", "2008", "12", “1")

Page 219: An Introduction to time series with Team Apache

Using SQL!

spark-sql> SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day;

724940:23234 2008 6 1 15.6 10.0 724940:23234 2008 6 2 15.6 10.0 724940:23234 2008 6 3 17.2 11.7 724940:23234 2008 6 4 17.2 10.0 724940:23234 2008 6 5 17.8 10.0 724940:23234 2008 6 6 17.2 10.0 724940:23234 2008 6 7 20.6 8.9

Page 220: An Introduction to time series with Team Apache

SQL with a Join

spark-sql> SELECT ws.name, raw.hour, raw.temperature FROM raw_weather_data raw JOIN weather_station ws ON raw.wsid = ws.id WHERE raw.wsid = '724940:23234' AND raw.year = 2008 AND raw.month = 6 AND raw.day = 1;

SAN FRANCISCO INTL AP 23 15.0 SAN FRANCISCO INTL AP 22 15.0 SAN FRANCISCO INTL AP 21 15.6 SAN FRANCISCO INTL AP 20 15.0 SAN FRANCISCO INTL AP 19 15.0 SAN FRANCISCO INTL AP 18 14.4

Page 221: An Introduction to time series with Team Apache

Analyzing large data sets

val spanRDD = sc.cassandraTable[Double]("isd_weather_data", "raw_weather_data") .select("temperature") .where("wsid = ? AND year = ? AND month = ? AND DAY = ?", "724940:23234", "2008", "12", "1").spanBy(row => (row.getString("wsid")))

•Specify partition grouping •Use with large partitions •Perfect for time series

Page 222: An Introduction to time series with Team Apache

Saving back the weather data

val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

Page 223: An Introduction to time series with Team Apache

Guest speaker!

Chief Data Scientist Jon Haddad - Jon Haddad

Page 224: An Introduction to time series with Team Apache

In the beginning… there was RDDsc = SparkContext(appName="PythonPi") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions

def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 < 1 else 0

count = sc.parallelize(range(1, n + 1), partitions).\ map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()

Page 225: An Introduction to time series with Team Apache

Why Not Python + RDDs?

RDDJavaGatewayServer

Py4JRDD

Page 226: An Introduction to time series with Team Apache

DataFrames• Abstraction over RDDs • Modeled after Pandas & R • Structured data • Python passes commands only • Commands are pushed down • Data Never Leaves the JVM • You can still use the RDD if you

want • Dataframe.rdd

RDD

DataFrame

Page 227: An Introduction to time series with Team Apache

Let's play with code

Page 228: An Introduction to time series with Team Apache

Sample Dataset - Movielens• Subset of movies (1-100) • ~800k ratings

CREATE TABLE movielens.movie ( movie_id int PRIMARY KEY, genres set<text>, title text )

CREATE TABLE movielens.rating ( movie_id int, user_id int, rating decimal, ts int, PRIMARY KEY (movie_id, user_id) )

Page 229: An Introduction to time series with Team Apache

Reading Cassandra Tables• DataFrames has a standard

interface for reading • Cache if you want to keep dataset

in memory

cl = "org.apache.spark.sql.cassandra"

movies = sql.read.format(cl).\ load(keyspace="movielens", table="movie").cache()

ratings = sql.read.format(cl).\ load(keyspace="movielens", table="rating").cache()

Page 230: An Introduction to time series with Team Apache

Filtering• Select specific rows matching

various patterns • Fields do not require indexes • Filtering occurs in memory • You can use DSE Solr Search

Queries • Filtering returns a DataFrame

movies.filter(movies.movie_id == 1) movies[movies.movie_id == 1] movies.filter("movie_id=1")

movie_id title genres

44 Mortal Kombat (1995)['Action', 'Adventure', 'Fantasy']

movies.filter("title like '%Kombat%'")

Page 231: An Introduction to time series with Team Apache

Filtering• Helper function: explode()

• select() to keep specific columns

• alias() to renametitle

Broken Arrow (1996)GoldenEye (1995)

Mortal Kombat (1995)

White Squall (1996)

Nick of Time (1995)

from pyspark.sql import functions as F movies.select("title", F.explode("genres").\ alias("genre")).\ filter("genre = 'Action'").select("title")

title genre

Broken Arrow (1996) Action

Broken Arrow (1996) Adventure

Broken Arrow (1996) Thriller

Page 232: An Introduction to time series with Team Apache

Aggregation• Count, sum, avg • in SQL: GROUP BY • Useful with spark streaming • Aggregate raw data • Send to dashboards

ratings.groupBy("movie_id").\ agg(F.avg("rating").alias('avg'))

ratings.groupBy("movie_id").avg("rating")

movie_id avg

31 3.24

32 3.8823

33 3.021

Page 233: An Introduction to time series with Team Apache

Joins• Inner join by default • Can do various outer joins

as well • Returns a new DF with all

the columns

ratings.join(movies, "movie_id")

DataFrame[movie_id: int, user_id: int,

rating: decimal(10,0), ts: int, genres: array<string>, title: string]

Page 234: An Introduction to time series with Team Apache

Chaining Operations

• Similar to SQL, we can build up in complexity

• Combine joins with aggregations, limits & sorting

ratings.groupBy("movie_id").\ agg(F.avg("rating").\ alias('avg')).\ sort("avg", ascending=False).\ limit(3).\ join(movies, "movie_id").\ select("title", "avg")

title avg

Usual Suspects, The (1995) 4.32

Seven (a.k.a. Se7en) (1995) 4.054

Persuasion (1995) 4.053

Page 235: An Introduction to time series with Team Apache

SparkSQL• Register DataFrame as Table • Query using HiveSQL syntax

movies.registerTempTable("movie") ratings.registerTempTable("rating") sql.sql("""select title, avg(rating) as avg_rating from movie join rating on movie.movie_id = rating.movie_id group by title order by avg_rating DESC limit 3""")

Page 236: An Introduction to time series with Team Apache

Database Migrations• DataFrame reader supports JDBC • JOIN operations can be cross DB • Read dataframe from JDBC, write

to Cassandra

Page 237: An Introduction to time series with Team Apache

Inter-DB Migration

from pyspark.sql import SQLContext sql = SQLContext(sc)

m_con = "jdbc:mysql://127.0.0.1:3307/movielens?user=root"

movies = sql.read.jdbc(m_con, "movielens.movies")

movies.write.format("org.apache.spark.sql.cassandra").\ options(table="movie", keyspace="lens").\ save(mode="append")

http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/

Page 238: An Introduction to time series with Team Apache

Visualization

• dataframe.toPandas()• Matplotlib • Seaborn (looks nicer) • Crunch big data in spark

Page 239: An Introduction to time series with Team Apache

Jupyter Notebooks• Iterate quickly • Test ideas • Graph results

Page 240: An Introduction to time series with Team Apache

Hands On!

https://github.com/killrweather/killrweather/wiki/7.-Spark-and-Cassandra-Exercises-for-KillrWeather-data

KillrWeather Wiki