going real-time: creating frequently-updating datasets for personalization: spark summit east talk...
TRANSCRIPT
![Page 1: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/1.jpg)
Shriya Arora
Senior Data EngineerPersonalization Analytics
Streaming datasets for Personalization
![Page 2: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/2.jpg)
What is Netflix’s Mission?
Entertaining you by allowing you to streamcontent anywhere, anytime
![Page 3: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/3.jpg)
What is Netflix’s Mission?
Entertaining you by allowing you to stream personalizedcontent anywhere, anytime
![Page 4: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/4.jpg)
How much data do we process to have a personalized Netflix for everyone?
● 93M+ active members
● 125M hours/ day
● 190 countries with unique catalogs
● 450B unique events/day
● 600+ Kafka topics
Image credit:http://www.bigwisdom.net/
![Page 5: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/5.jpg)
![Page 6: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/6.jpg)
Data Infrastructure
Raw data(S3/hdfs)
Stream Processing (Spark, Flink …)
Processed data(Tables/Indexers)
Batch processing(Spark/Pig/Hive/MR)
Application instances
Keystone IngestionPipeline
Rohan’s talk tomorrow @12:20
![Page 7: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/7.jpg)
User watches a video on Netflix
Data flows through Netflix Servers
Our problem: Using user plays for feature generation, discovery, clustering ..
![Page 8: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/8.jpg)
Why have data later when you can have it now?
● Business Wins○ Algorithms can be trained with the latest + greatest data ○ Enhances research ○ Creates opportunity for new types of algorithms
● Technical Wins○ Save on storage costs○ Avoid long running jobs
![Page 9: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/9.jpg)
Source of Discovery / Source of Play
![Page 10: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/10.jpg)
Source-of-Discovery pipeline
Playback sessions
Spark- Streaming app
Discoveryservice
REST API
Video Metadata
Backup
![Page 11: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/11.jpg)
Spark Streaming
● Needs a StreamingContext and a batch duration● Data received in DStreams, which are easily converted to RDDs● Support all fundamental RDD transformations and operations● Time-based windowing● Checkpointing support for resilience to failures ● Deployment
![Page 12: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/12.jpg)
Performance tuning your Spark streaming application
![Page 13: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/13.jpg)
Performance tuning your Spark streaming application
● Choice of micro-batch interval○ The most important parameter
● Cluster memory○ Large batch intervals need more memory
● Parallelism○ DStreams naturally partitioned to Kafka partitions○ Repartition can help with increased parallelism at the cost of shuffle
● # of CPUs○ <= number of tasks ○ Depends on how computationally intensive your processing is
![Page 14: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/14.jpg)
Performance tuning your Spark streaming application
![Page 15: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/15.jpg)
Challenges with Spark
● Not a ‘pure’ event streaming system○ Minimum latency of batch interval○ Un-intuitive to design for a stream-only world
● Choice of batch interval is a little too critical ○ Everything can go wrong, if you choose this wrong○ Build-up of scheduling delay can lead to data loss
● Only time-based windowing* ○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations
* I used 1.6.1
![Page 16: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/16.jpg)
Challenges with Streaming
● Pioneer Tax○ Training ML models on streaming data
is new ground
● Increased criticality of outages○ Batch failures have to be addressed
urgently, Streaming failures have to be addressed immediately.
● Infrastructure investment○ Monitoring, Alerts, ○ Non-trivial deployments
There are two kinds of pain...
![Page 17: Going Real-Time: Creating Frequently-Updating Datasets for Personalization: Spark Summit East talk by Shriya Arora](https://reader034.vdocument.in/reader034/viewer/2022051523/58abca951a28ab68068b59d1/html5/thumbnails/17.jpg)
Questions?
Stay in touch!
@NetflixData