lambda at weather scale by robbie strickland
TRANSCRIPT
Lambdaat Weather Scale Robbie Strickland
Who Am I?• Contributor to C*
community since 2010
• DataStax MVP 2014/15
• Author, Cassandra High Availability
• Founder, ATL Cassandra User Group
About TWC
~30 billion API requests per day
About TWC
~30 billion API requests per day
~120 million active mobile users
About TWC
~30 billion API requests per day
~120 million active mobile users
#3 most active mobile user base
About TWC
~30 billion API requests per day
~120 million active mobile users
#3 most active mobile user base
~360 PB of traffic daily
About TWC
~30 billion API requests per day
~120 million active mobile users
#3 most active mobile user base
~360 PB of traffic daily
Most weather data comes from us
Use CaseBillions of events per day (~1.3M per sec)
Web/mobile beaconsLogsWeather conditions + forecastsetc.
Use CaseBillions of events per day (~1.3M per sec)
Web/mobile beaconsLogsWeather conditions + forecastsetc.
Keep data forever
Use CaseEfficient batch + streaming analysis
Use CaseEfficient batch + streaming analysis
Self-serve data science
Use CaseEfficient batch + streaming analysis
Self-serve data science
BI / visualization tool support
Architecture
Attempt[0] ArchitectureOperational Analytics
Business Analytics
Executive Dashboards
Data Discovery
Data Science
3rd Party
System Integration
Events
3rd Party
Other DBs
S3
Stream Processing
Batch Sources
Storage and Processing
Consumers
Data Access
Kafka
Streaming
Custom Ingestion Pipeline
ETL
Streaming Sources
RESTful Enqueue service
SQL
Attempt[0] Data ModelCREATE TABLE events (
timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Attempt[0] Data ModelCREATE TABLE events (
timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Event payload == schema-less JSON
Attempt[0] Data ModelCREATE TABLE events (
timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Partitioned by time bucket + type
Attempt[0] Data ModelCREATE TABLE events (
timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Time-series data good fit for DTCS
Attempt[0] tl;drC* everywhere
Attempt[0] tl;drC* everywhereStreaming data via custom ingest process
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful service
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via Informatica
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBC
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payload
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction
Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction
Attempt[0] LessonsBatch loading large data sets into C* is silly
Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive
Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOW
Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessary
Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driver
Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driverDTCS is broken (see CASSANDRA-9666)
Attempt[0] LessonsSchema-less == bad:
Attempt[0] LessonsSchema-less == bad:
Must parse JSON to extract key data
Attempt[0] LessonsSchema-less == bad:
Must parse JSON to extract key dataExpensive to analyze by event type
Attempt[0] LessonsSchema-less == bad:
Must parse JSON to extract key dataExpensive to analyze by event typeCannot tune by event type
Attempt[1] Architecture
Data Lake
Operational Analytics
Business Analytics
Executive Dashboards
Data Discovery
Data Science
3rd Party
System Integration
Stream Processing
Long Term Raw Storage
Short Term Storage and Big Data Processing
Consumers
Amazon SQS
Streaming
Custom Ingestion Pipeline
Events
3rd Party
Other DBs
S3
Batch Sources
Streaming Sources
ETL
Data Access
SQL
Attempt[1] Data ModelEach event type gets its own table
Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workload
Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:
Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:
We’re reading everything anyway
Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:
We’re reading everything anywayMakes subsequent analysis much easier
Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:
We’re reading everything anywayMakes subsequent analysis much easierAllows us to filter junk early
Attempt[1] tl;drUse C* for streaming data
Attempt[1] tl;drUse C* for streaming data
Rolling time window (TTL depends on type)
Attempt[1] tl;drUse C* for streaming data
Rolling time window (TTL depends on type)Real-time access to events
Attempt[1] tl;drUse C* for streaming data
Rolling time window (TTL depends on type)Real-time access to eventsData locality makes Spark jobs faster
Attempt[1] tl;drEverything else in S3
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backups
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backupsStored as Parquet
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storage
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from Spark
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externally
Attempt[1] tl;drEverything else in S3
Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externallyOpen source Hive support
Attempt[1] tl;drKafka replaced by SQS:
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliable
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliableAlready fronted by a RESTful interface
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security model
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platform
Attempt[1] tl;drKafka replaced by SQS:
Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platformBuilt-in monitoring
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Developed by Jeff Jirsa at CrowdStrike
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations together
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply delete expired sstables
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput
Attempt[1] tl;drDTCS replaced by Time-Window Compaction
Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput
Fine PrintUse C* >= 2.1.8
CASSANDRA-9637 - fixes Spark input split computation
CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast
addresses for Spark/cloud environments
Fine PrintUse C* >= 2.1.8
CASSANDRA-9637 - fixes Spark input split computation
CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast
addresses for Spark/cloud environments
Version incompatibilities abound (check sbt file for Spark-Cassandra connector)
Fine PrintTwo main Spark clusters:
Fine PrintTwo main Spark clusters:
Co-located with C* for heavy analysisPredictable loadEfficient C* access
Fine PrintTwo main Spark clusters:
Co-located with C* for heavy analysisPredictable loadEfficient C* access
Self-serve in same DC but not co-locatedUnpredictable loadFavors mining S3 dataIsolated from production jobs
Data Modeling
PartitioningOpposite strategy from “normal” C* modeling
PartitioningOpposite strategy from “normal” C* modeling
Model for good parallelism
PartitioningOpposite strategy from “normal” C* modeling
Model for good parallelism… not for single-partition queries
PartitioningOpposite strategy from “normal” C* modeling
Model for good parallelism… not for single-partition queries
Avoid shuffling for most cases
PartitioningOpposite strategy from “normal” C* modeling
Model for good parallelism… not for single-partition queries
Avoid shuffling for most casesShuffles occur when NOT grouping by partition key
PartitioningOpposite strategy from “normal” C* modeling
Model for good parallelism… not for single-partition queries
Avoid shuffling for most casesShuffles occur when NOT grouping by partition keyPartition for your most common grouping
Secondary IndexesUseful for C*-level filtering
Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprint
Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprintLow cardinality is still the rule
Secondary Indexes (Client Access)
Secondary Indexes (with Spark)
Full-text IndexesEnabled via Stratio-Lucene custom index
(https://github.com/Stratio/cassandra-lucene-index)
Full-text IndexesEnabled via Stratio-Lucene custom index
(https://github.com/Stratio/cassandra-lucene-index)
Great for C*-side filters
Full-text IndexesEnabled via Stratio-Lucene custom index
(https://github.com/Stratio/cassandra-lucene-index)
Great for C*-side filtersSame access pattern as secondary indexes
Full-text IndexesCREATE CUSTOM INDEX email_index on emails(lucene)USING 'com.stratio.cassandra.lucene.Index'WITH OPTIONS = {
'refresh_seconds':'1','schema': '{
fields: {id : {type : "integer"},
user : {type : "string"},subject : {type : "text", analyzer : "english"},body : {type : "text", analyzer : "english"},time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}}
}'};
Full-text IndexesSELECT * FROM emails WHERE lucene='{
filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"},query : {type:"phrase", field:"subject", values:["test"]}
}';
SELECT * FROM emails WHERE lucene='{filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}
}';
WIDE ROWS
Caution:
Wide RowsIt only takes one to ruin your day
Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytes
Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytesUse toppartitions to find hot keys
Avoid NullsNulls are deletes
Avoid NullsNulls are deletesDeletes create tombstones
Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!
Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!Beware of nulls in prepared statements
Data Exploration
Data Warehouse Paradigm - Old
Ingest Model Transform Design
Visualize
Data Warehouse Paradigm - New
Ingest Explore Analyze Deploy
Visualize
VisualizationCritical to understanding your data
VisualizationCritical to understanding your dataReduced time to visualization
VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)
VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)Waterfall to agile
ZeppelinOpen source Spark notebook
ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,
CQL, Hive, Shell, & more
ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,
CQL, Hive, Shell, & moreData visualizations
ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,
CQL, Hive, Shell, & moreData visualizationsScheduled jobs
Zeppelin
Zeppelin
Zeppelin
Future Work
FiloDBLow latency time-series aggregations using
Spark + Cassandra/in-memory storage
FiloDBLow latency time-series aggregations using
Spark + Cassandra/in-memory storageSpace efficient – similar to Parquet
FiloDBLow latency time-series aggregations using
Spark + Cassandra/in-memory storageSpace efficient – similar to ParquetSQL queries using ODBC/JDBC
Direct to ParquetStream to Parquet directly
Direct to ParquetStream to Parquet directlyEliminate interim storage
Direct to ParquetStream to Parquet directlyEliminate interim storageCurrently in R&D
We’re Hiring!
Robbie [email protected]