lambda at weather scale by robbie strickland

Lambdaat Weather Scale Robbie Strickland

Who Am I?

Robbie StricklandDirector of Engineering, [email protected]@rs_atl

An IBM Business

Who Am I?• Contributor to C*

community since 2010

• DataStax MVP 2014/15

• Author, Cassandra High Availability

• Founder, ATL Cassandra User Group

About TWC

~30 billion API requests per day

About TWC


~120 million active mobile users

About TWC



#3 most active mobile user base

About TWC




~360 PB of traffic daily

About TWC




~360 PB of traffic daily

Most weather data comes from us

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Keep data forever

Use CaseEfficient batch + streaming analysis


Self-serve data science


Self-serve data science

BI / visualization tool support

Architecture

Attempt[0] ArchitectureOperational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Events

3rd Party

Other DBs

S3

Stream Processing

Batch Sources

Storage and Processing

Consumers

Data Access

Kafka

Streaming

Custom Ingestion Pipeline

ETL

Streaming Sources

RESTful Enqueue service

SQL

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };




Event payload == schema-less JSON




Partitioned by time bucket + type




Time-series data good fit for DTCS

Attempt[0] tl;drC* everywhere

Attempt[0] tl;drC* everywhereStreaming data via custom ingest process

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful service

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via Informatica

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBC

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payload

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction

Attempt[0] LessonsBatch loading large data sets into C* is silly

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOW

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessary

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driver

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driverDTCS is broken (see CASSANDRA-9666)

Attempt[0] LessonsSchema-less == bad:


Must parse JSON to extract key data


Must parse JSON to extract key dataExpensive to analyze by event type


Must parse JSON to extract key dataExpensive to analyze by event typeCannot tune by event type

Attempt[1] Architecture

Data Lake

Operational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Stream Processing

Long Term Raw Storage

Short Term Storage and Big Data Processing

Consumers

Amazon SQS

Streaming

Custom Ingestion Pipeline

Events

3rd Party

Other DBs

S3

Batch Sources

Streaming Sources

ETL

Data Access

SQL

Attempt[1] Data ModelEach event type gets its own table

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workload

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:


We’re reading everything anyway


We’re reading everything anywayMakes subsequent analysis much easier


We’re reading everything anywayMakes subsequent analysis much easierAllows us to filter junk early

Attempt[1] tl;drUse C* for streaming data


Rolling time window (TTL depends on type)


Rolling time window (TTL depends on type)Real-time access to events


Rolling time window (TTL depends on type)Real-time access to eventsData locality makes Spark jobs faster

Attempt[1] tl;drEverything else in S3


Batch data loads (mostly logs)


Batch data loads (mostly logs)Daily C* backups


Batch data loads (mostly logs)Daily C* backupsStored as Parquet


Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storage


Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from Spark


Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externally


Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externallyOpen source Hive support

Attempt[1] tl;drKafka replaced by SQS:


Scalable & reliable


Scalable & reliableAlready fronted by a RESTful interface


Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)


Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security model


Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platform


Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platformBuilt-in monitoring

Attempt[1] tl;drDTCS replaced by Time-Window Compaction


Developed by Jeff Jirsa at CrowdStrike


Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations together


Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply delete expired sstables


Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Version incompatibilities abound (check sbt file for Spark-Cassandra connector)

Fine PrintTwo main Spark clusters:


Co-located with C* for heavy analysisPredictable loadEfficient C* access


Co-located with C* for heavy analysisPredictable loadEfficient C* access

Self-serve in same DC but not co-locatedUnpredictable loadFavors mining S3 dataIsolated from production jobs

Data Modeling

PartitioningOpposite strategy from “normal” C* modeling


Model for good parallelism


Model for good parallelism… not for single-partition queries



Avoid shuffling for most cases



Avoid shuffling for most casesShuffles occur when NOT grouping by partition key



Avoid shuffling for most casesShuffles occur when NOT grouping by partition keyPartition for your most common grouping

Secondary IndexesUseful for C*-level filtering

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprint

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprintLow cardinality is still the rule

Secondary Indexes (Client Access)

Secondary Indexes (with Spark)

Full-text IndexesEnabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)



Great for C*-side filters



Great for C*-side filtersSame access pattern as secondary indexes

Full-text IndexesCREATE CUSTOM INDEX email_index on emails(lucene)USING 'com.stratio.cassandra.lucene.Index'WITH OPTIONS = {

'refresh_seconds':'1','schema': '{

fields: {id : {type : "integer"},

user : {type : "string"},subject : {type : "text", analyzer : "english"},body : {type : "text", analyzer : "english"},time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}}

}'};

Full-text IndexesSELECT * FROM emails WHERE lucene='{

filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"},query : {type:"phrase", field:"subject", values:["test"]}

}';

SELECT * FROM emails WHERE lucene='{filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}

}';

WIDE ROWS

Caution:

Wide RowsIt only takes one to ruin your day

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytes

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytesUse toppartitions to find hot keys

Avoid NullsNulls are deletes

Avoid NullsNulls are deletesDeletes create tombstones

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!Beware of nulls in prepared statements

Data Exploration

Data Warehouse Paradigm - Old

Ingest Model Transform Design

Visualize

Data Warehouse Paradigm - New

Ingest Explore Analyze Deploy

Visualize

VisualizationCritical to understanding your data

VisualizationCritical to understanding your dataReduced time to visualization

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)Waterfall to agile

ZeppelinOpen source Spark notebook

ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & more


CQL, Hive, Shell, & moreData visualizations


CQL, Hive, Shell, & moreData visualizationsScheduled jobs

Zeppelin

Future Work

FiloDBLow latency time-series aggregations using

Spark + Cassandra/in-memory storage


Spark + Cassandra/in-memory storageSpace efficient – similar to Parquet


Spark + Cassandra/in-memory storageSpace efficient – similar to ParquetSQL queries using ODBC/JDBC

Direct to ParquetStream to Parquet directly

Direct to ParquetStream to Parquet directlyEliminate interim storage

Direct to ParquetStream to Parquet directlyEliminate interim storageCurrently in R&D

We’re Hiring!

Robbie [email protected]

lambda at weather scale by robbie strickland

Data & Analytics