learning with big data data analytics and... · 2019. 8. 20. · learning with big data •ml...
TRANSCRIPT
Learning with Big DataWhat is it all about…
IEEE CIS Summer School 2019
Data Analytics and Stream Processing:
Tools, Techniques and Applications
IIIT Allahabad
Learning with Big Data
• ML algorithms were designed for smaller datasets, with the assumption that
• the entire dataset can fit in memory.
• the entire dataset is available for processing at the time of training.
• Big Data break these assumptions, rendering traditional algorithms unusable orgreatly impeding their performance.
2L'Heureux, Alexandra, et al. "Machine Learning with Big Data: Challenges and Approaches."IEEE Access(2017).
Learning with Big Data
• Big Data are described by its dimensions• volume, velocity, variety and veracity.• value is often added as a 5th V
Volume
• The amount, size, and scale of the data
• Size
• vertically by the number of records or samples in a dataset
• horizontally by the number of features or attributes it contains
Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.
Processing Performance
• As the scale becomes large, even trivial operations can become costly.
• SVM: training time complexity of O(m3) and a space complexity of O(m2),
• PCA: O(mn2 + n3)
• Logistic Regression: O(mn2 + n3)
• Time needed to perform the computations will increase exponentially withincreasing data size.
• Performance becomes dependent upon the data structure used to store andmove data.
Ng, Andrew Y., et al. "YuanYuan Yu. MapReduce for machine learning on multicore."NIPS, December(2006).
Curse of Modularity
• Algorithms rely on the assumption that the data being processed can be heldentirely in memory or in a single file on a disk.
• When data size leads to the failure of this premise, entire families of algorithmsare affected.
• Solution - MapReduce (may not be useful for all ML algorithms)
5Kumar, K. Ashwin, et al. "Hone: Scaling down Hadoop on shared-memory systems."Proceedings of the VLDB Endowment6.12 (2013): 1354-1357
Class Imbalance
• As datasets grow larger, the assumption that the data are uniformly distributedacross all classes is often broken.
• Performance of a ML algorithm can be negatively affected when datasets containdata from classes with various probabilities of occurrence.• Especially prominent when some classes are represented by a large number of samples and some by
very few
6Ghanavati, Mojgan, et al. "An effective integrated method for learning big imbalanced data."Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 2014
Curse of Big Dimensionality
• “Big Dimensionality" - explosion and variety of features.
• Dimensionality affects processing performance
• time & space complexity of ML algorithms is closely related to data dimensionality
• Feature Engineering
• As the dataset grows, both vertically and horizontally, it becomes more difficult tocreate new, highly relevant features.
• Feature Selection becomes difficult.
Domingos, Pedro. "A few useful things to know about machine learning."Communications of the ACM55.10 (2012): 78-87
Myriad of Features
Bonferroni's Principle
• If one is looking for a specific type of event within a certain amount of data, thelikelihood of finding this event is high.
• More often than not, these occurrences are bogus
• Preventing false positives becomes important.
Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive datasets. Cambridge university press, 2014.
Variance & Bias
• Generalization error has two components: variance and bias
• Variance describes the consistency of a learners ability to predict random things
• Bias describes the ability of a learner to learn the wrong thing
• In Big data, learner may become too closely biased to the training set and may beunable to generalize adequately for new data.
• Solution:
• Regularization to improve generalization and reduce overfitting
Data Heterogeneity & Noisy Data
• Heterogeneity
• Syntactic & semantic
• Noise:
Swan, Melanie. "The quantified self: Fundamental disruption in big data science and biological discovery."Big Data1.2 (2013): 85-99.
Data Availability
• ML approach depends on data availability
• before learning begins, the entire dataset is assumed to be present.
• typically learns from the training set and then performs the learned task.
• To adapt to new information, algorithms must support incremental learning
Gu, Bin, et al. "Incremental learning for -support vector regression."Neural Networks67 (2015): 140-150.
Real-time Processing/Streaming
• need for real-time or near-real-time processing of fast-arriving data.
Neumeyer, Leonardo, et al. "S4: Distributed stream computing platform."Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010
Concept Drift
• CD: changes in the conditional distribution of the target output given the input,while the distribution of the input itself may remain unchanged
• Big Data are non-stationary
• new data are arriving continuously.
• It cannot be determined whether the current data follow the same distribution asfuture data.
Gama, Joo, et al. "A survey on concept drift adaptation."ACM Computing Surveys (CSUR)46.4 (2014): 44.
iid rv
• i.i.d. requires data to be in random order while many datasets have a pre-existing non-random order.
• Solution:
• randomize the data before applying the algorithms
• Problems
• Big Data are fast and continuous.
• Not realistic to randomize a dataset that is still incomplete
• Not possible to wait for all the data to arrive.
Ghahramani, Zoubin. "Probabilistic machine learning and artificial intelligence."Nature521.7553 (2015): 452-459
Veracity and Data Uncertainty
• Veracity
• reliability of the data forming a dataset
• inherent unreliability of data sources
• Data are being gathered about various aspects in different ways
• Methods used to gather data can introduce uncertainty and impact veracity of adataset.
Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.Cao, Nan, et al. "Socialhelix: visual analysis of sentiment divergence in social media."Journal of Visualization18.2 (2015) 221-235.
Algorithm Modifications for Big Data
Big data vary in shape. These call for different approaches.
Wide Data
Tall Data
Thousands / Millions of Variables
Hundreds of Samples
Screening and fdr, Lasso, SVM, StepwiseWe have too many variables; prone to overfitting.
Need to remove variables, or regularize, or both.
Tens / Hundreds of Variables
Thousands / Millions of Samples
GLM, Random Forests,
Boosting, Deep LearningSometimes simple models (linear) don’t suffice.
We have enough samples to fit nonlinear models with many interactions, and not too
many variables.
Good automatic methods for doing this.
Big data vary in shape. These call for different approaches.
Tall and Wide Data
Thousands / Millions of VariablesMillions to Billions of Samples
Tricks of the Trade
Exploit sparsityRandom projections / hashingVariable screeningSubsample rowsDivide and recombineCase/ control samplingMapReduceADMM (divide and conquer)
Examples of Big Data Learning Problems• Click-through rate. Based on the search term, knowledge of this user (IPAddress), and the
Webpage about to be served, what is the probability that each of the 30 candidate ads in anad campaign would be clicked if placed in the right-hand panel.
• Logistic regression with billions of training observations. Each ad exchange does this, then bidson their top candidates, and if they win, serve the ad --- all within 10ms!
Big Data Learning Problems
• Recommender systems: Amazon online store, online DVD rentals, Kindle books…
• Based on my past experiences, and those of others like me, what else would Ichoose?
ML approaches & the challenges they address
Infinite Data
Filtering data streams
Queries on streams
Data Streams
• In many data mining situations, we do not know the entire data set in advance
• Stream Management is important when the input rate is controlled externally:
• Google queries
• Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary
• the distribution changes over time (Concept Drift)
The Stream Model
• Input elements enter at a rapid rate, at one or more input ports (i.e., streams)
• elements of the stream tuples
• The system cannot store the entire stream accessibly
What is a stream?
• Unbounded data
• Conceptually infinite, ever growing set of data items / events
• Practically continuous stream of data, which needs to be processed / analyzed
• Push model
• Data production and procession is controlled by the source
• Concept of time
• Often need to reason about when data is produced and when processed data shouldbe output
• Time agnostic, processing time, ingestion time, event time
based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
The Stream Model
• The Question is :
How do you make critical calculations about the stream using a limited amount of (secondary) memory?
Example: SGD is a Streaming Algorithm
• Machine Learning
• Allows for modeling problems where we have a continuous stream of data
• We want an algorithm to learn from it and slowly adapt to the changes in data
• Idea: Do slow updates to the model
• SGD (SVM, Perceptron) makes small updates
• So: First train the classifier on training data.
• Then: For every example from the stream, we slightly update the model (using smalllearning rate)
General Stream Processing Model
Processor
Limited
Working
Storage
. . . 1, 5, 2, 7, 0, 9, 3
. . . a, r, v, t, y, h, b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Ad-Hoc
Queries
Output
Archival
Storage
Standing
Queries
Problems on Data Streams
• Types of queries one wants on answer on a data stream:
• Sampling data from a stream• Construct a random sample
• Queries over sliding windows• Number of items of type x in the last k elements of the stream
Problems on Data Streams
• Types of queries one wants on answer ona data stream:
• Filtering a data stream• Select elements with property x from the stream
• Counting distinct elements• Number of distinct elements in the last k elements of the stream
• Estimating moments• Estimate avg./std. dev. of last k elements
• Perform Dimensionality reduction (Streaming PCA)
Applications
• Mining query streams
• Google wants to know what queries are more frequent today than yesterday
• Mining click streams
• Bing wants to know which of its pages are getting an unusual number of hits inthe past hour
• Mining social network news feeds
• E.g., look for trending topics on Twitter, Facebook
Applications
• Sensor Networks
• Many sensors feeding into a central controller
• Telephone call records
• Data feeds into customer bills as well as settlements between telephone companies
• IP packets monitored at a switch
• Gather information for optimal routing
• Detect denial-of-service attacks
Sampling from a Data Stream:Sampling a fixed-size sample
As the stream grows, the sample is of fixed size
Maintaining a fixed-size sample
• Problem: Fixed-size sample
• Suppose we need to maintain a random sample S of size exactly s tuples
• E.g., main memory size constraint
• Why? Don’t know length of stream in advance
• Suppose at time n we have seen n items
• Each item is in the sample S with equal prob. s/n
Solution: Fixed Size Sample
Algorithm (a.k.a. Reservoir Sampling)
• Store all the first s elements of the stream to S
• Suppose we have seen n-1 elements, and now the nth element arrives (n > s)
• With probability s/n, keep the nth element, else discard it
• If we picked the nth element, then it replaces one of the s elements in the sample S,picked uniformly at random
Reservoir Sampling
Queries over a (long) Sliding Window
Sliding Windows
• A useful model of stream processing is that queries are about a window of lengthN – the N most recent elements received
• Interesting case: N is so large that the data cannot be stored in memory, or evenon disk
• Or, there are so many streams that windows for all cannot be stored
• Amazon example:
• For every product X we keep 0/1 stream of whether that product was sold in thenth transaction
• We want answer queries, how many times have we sold X in the last k sales
Sliding Window: 1 Stream
• Sliding window on a single stream:
q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
Past Future
N = 6
42
Counting Bits
Problem:
• Given a stream of 0s and 1s
• Be prepared to answer queries of the form
• How many 1s are in the last k bits? where k ≤ N
Obvious solution:
• Store the most recent N bits
• When new bit comes in, discard the N+1st bit
0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0
Past Future
Suppose N=6
Counting Bits
• You can not get an exact answer without storing the entire window
Real Problem:
• What if we cannot afford to store N bits?
• E.g., we’re processing 1 billion streams and N = 1 billion
• But we are happy with an approximate answer
0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0
Past Future
An attempt: Simple solution
• Q: How many 1s are in the last N bits?
• A simple solution that does not really solve our problem: Uniformity assumption
• Maintain 2 counters:
• S: number of 1s from the beginning of the stream
• Z: number of 0s from the beginning of the stream
• How many 1s are in the last N bits? 𝑵 ∙𝑺
𝑺+𝒁
• But, what if stream is non-uniform?
• What if distribution changes over time?
0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N
Past Future
DGIM Method
• DGIM solution that does not assume uniformity
• We store 𝑶(log𝟐𝑵) bits per stream
• Solution gives approximate answer, never off by more than 50%
• Error factor can be reduced to any fraction > 0, with more complicated algorithmand proportionally more stored bits
Min-wise Sampling
• For each item, pick a random fraction between 0 and 1
• Store item(s) with the smallest random tag
0.391 0.908 0.291 0.555 0.619 0.273
Each item has same chance of least tag, so uniform
Can run on multiple streams separately, then merge
47
Sketches
• Not every problem can be solved with sampling
• Example: counting how many distinct items in the stream
• If a large fraction of items aren’t sampled, don’t know if they are all same or alldifferent
• Techniques take advantage that the algorithm can “see” all the data even if itcan’t “remember” it all
• “Sketch”: essentially, a linear transform of the input
• Model stream as defining a vector, sketch is result of multiplying stream vector byan (implicit) matrix
linear projection
Count-Min Sketch
• Simple sketch idea, can be used for as the basis of many different stream miningtasks
• Join aggregates, range queries, moments, …
1. Model input stream as a vector A of dimension N
2. Create a small summary as an array of w d in size
3. Use d hash functions to map vector entries to [1..w]
4. Works on arrivals only and arrivals & departures streamsW
dArray: CM[i,j]
1. Filtering a data stream: Bloom filters
• Select elements with property x from stream
2. Counting distinct elements: Flajolet-Martin
• Number of distinct elements in the last k elements of the stream
3. Estimating moments: AMS method
• Estimate std. dev. of last k elements
(More) Algorithms for Streams
Filtering Data Streams
Filtering Data Streams
• Each element of data stream is a tuple
Problem
• Given a list of keys S
• Determine which tuples of stream are in S
Obvious solution: Hash table
• But suppose we do not have enough memory to store all of S in a hash table
• E.g., we might be processing millions of filters on the same stream
Applications
Example:
• Email spam filtering
• We know 1 billion “good” email addresses
• If an email comes from one of these, it is NOT spam
• Publish-subscribe systems
• You are collecting lots of messages (news articles)
• People express interest in certain sets of keywords
• Determine whether each message matches user’s interest
52
Counting Distinct Elements
Counting Distinct Elements
Problem:
• Data stream consists of a universe of elements chosen from a set of size N
• Maintain a count of the number of distinct elements seen so far
Obvious approach:
• Maintain the set of elements seen so far
• That is, keep a hash table of all the distinct elements seen so far
Applications
• How many different words are found among the Web pages being crawled at asite?
• Unusually low or high numbers could indicate artificial pages (spam?)
• How many different Web pages does each customer request in a week?
• How many distinct products have we sold in the last week?
Using Small Storage
• Real problem: What if we do not have space to maintain the set of elements seenso far?
• Estimate the count in an unbiased way
• Accept that the count may have a little error, but limit the probability that theerror is large
Flajolet-Martin Approach
• Pick a hash function h that maps each of the N elements to at least log2N bits
• For each stream element a, let r(a) be the number of trailing 0s in h(a)
• r(a) = position of first 1 counting from the right• E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
• Record R = the maximum r(a) seen
• R = maxa r(a), over all the items a seen so far
• Estimated number of distinct elements = 2R
Why It Works: Intuition
• h(a) hashes a with equal prob. to any of N values
• Then h(a) is a sequence of log2 N bits, where 2-r fraction of all as have a tail of rzeros
• About 50% of as hash to ***0
• About 25% of as hash to **00
• So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we haveprobably seen about 4 distinct items so far
• So, it takes to hash about 2r items before we see one with zero-suffix of length r
Computing Moments
Generalization: Moments
• Suppose a stream has elements chosen from a set A of N values
• Let mi be the number of times value i occurs in the stream
• The kthmoment is Ai
k
im )(
AMS Method – 2nd moment
• AMS method works for all moments
• Gives an unbiased estimate
• We pick and keep track of many variables X:
• For each variable X we store X.el and X.val• X.el corresponds to the item i
• X.val corresponds to the count of item i
• Note this requires a count in main memory, so number of Xs is limited
• Our goal is to compute 𝑺 = σ𝒊𝒎𝒊𝟐
[Alon, Matias, and Szegedy]
Counting Itemsets
Counting Itemsets
• Given a stream, which items appear more than s times in the window?
• Possible solution: Think of the stream of baskets as one binary stream per item
• 1 = item present; 0 = not present
• Use DGIM to estimate counts of 1s for all items
0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N
01
12
23
4
106
Extensions
• In principle, you could count frequent pairs or even larger sets the same way
• One stream per itemset
• Drawbacks:
• Only approximate
• Number of itemsets is way too big
What we have
Categorization of data storage systems
What we have
Categorization of data-processing systems
Requirements -Machine Learning for Big Data
• Distributed ML systems on clusters critical to applying advanced ML algorithms atindustrial scales for data analysis.
• ML models that can reach billions of parameters with massive amount of data.
• Common formalism for data/model parallelism as a guidance for future systemdevelopment.
• Conjuncture of Big Data and HPC
• Most importantly• big data specific algorithms• Energy efficient algorithms
Other (more important !!) Requirements
• Novel Data Model
• with increasing scale and heterogeneous data
• Novel Processing Model
• different classes of applications with different requirement of resources.
Thank you