streaming data mining
TRANSCRIPT
Streaming Data Mining
04/13/2023 Streaming Data Mining 1
Once upon a time.
• Life was easy– Eg. Org. has only transaction data, analyst were happy analyzing them.– Competition was less.– Customer had lesser options to review product.
• Wait! Web- 2.(oh)0– Customer who consumed data started generating data - tweets, blogs,
facebook comments, reviews………..– Another burst came when the Mobile era came in.
• Apps recording customers location• Actions on apps.• Pattern of app use.
04/13/2023 Streaming Data Mining 2
Server
DB
DB
DB
DB
DB
DB
04/13/2023 Streaming Data Mining
Its All About the Numbers!
4
58M/Day 500Tb/Day 2.1M GB/Hr 4B view/day
So, Its GOOD to have data, Right?
Digging Into the Data
• Analyze to understand customer.• Identify Patterns
• Machine Learning• Statistical Model Building• Natural Language Processing• …….
04/13/2023 Streaming Data Mining 7
Usual Pipeline in Data Mining.
04/13/2023 Streaming Data Mining 8
Data of Entire Population
Sample Population
Cleaning and Preprocessing
Training and testing Models
Production Server
Why?
Huge Training Data Set - Volume
• Organizations these days have huge datasets that can be used to train their models.
• But Main Memory Restrictions.– Machine Learning Algorithm.– Batch Processing.
• Y no Sampling??
04/13/2023 Streaming Data Mining 12
Streams - Velocity
• Ubiquitous Computing, Mobile Devices, Social Media.
• Potentially of Infinite length
• Usual Strategy – Batch Mode.
04/13/2023 Streaming Data Mining 13
Contextual Trends.
• Trending topic on social media.• Weather• Location• Demographics• Market Dynamics
• Jargon Alert : Concept Drift
04/13/2023 Streaming Data Mining 14
What we want today?
Consume Real time data and extract insights.
Wait.! Can I say Analyze Streams?
Streaming Data Mining!
Philosophy
• Continuous Data Record aka Data Streams• Bounded Storage• Single Pass• Real Time• Concept Drift
04/13/2023 Streaming Data Mining 17
So What… We have Hadoop…
• The big Elephant doesn’t fit in here.• Hadoop – Batch Processing• We need Storm
– Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.
04/13/2023 Streaming Data Mining 18
Algorithms.
• The conventional Machine learning algorithm were designed for batch processing.– The Algorithm needs to load entire dataset into the memory.– Computes the necessary statistics, example entropy\information gain
in decision trees.
• With Streams?– Streams are of infinite length– Storing everything, if you can, will be an issue on the memory of the
system $$$$
04/13/2023 Streaming Data Mining 19
Streaming Machine Learning
• When?– High Data volume– Rate at which data comes is high.– Unbound, will always arrive in the system and we wont be able to fit it
in our memory
• Requirements to be adhered.– Each input element to be processed atmost once.– Space– Time– Start predicting from t0
04/13/2023 Streaming Data Mining 20
General Flow of Streaming Algorithms
04/13/2023 Streaming Data Mining 21
Spam Detection
• Models trained in the past by traditional data mining strategy will become obsolete as spammers will find a way out.
• Solution : VFDT - Hoeffding Tree Steam Classification• Train the model in streaming setup.• When new spam pattern detected, people mark them as
spams.• Use them to retrain the model in real time.
Concept Drift! Win!
04/13/2023 Streaming Data Mining 22
Answering Todays BigData Needs
• Streaming Data Mining– Storm– MOA– SAMOA– KAFKA– ……
04/13/2023 Streaming Data Mining 23
Thank You!Ankit Solanki
Neil Shah