learning with big data data analytics and... · 2019. 8. 20. · learning with big data •ml...

69
Learning with Big Data What is it all about… IEEE CIS Summer School 2019 Data Analytics and Stream Processing: Tools, Techniques and Applications IIIT Allahabad

Upload: others

Post on 09-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Learning with Big DataWhat is it all about…

IEEE CIS Summer School 2019

Data Analytics and Stream Processing:

Tools, Techniques and Applications

IIIT Allahabad

Page 2: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Learning with Big Data

• ML algorithms were designed for smaller datasets, with the assumption that

• the entire dataset can fit in memory.

• the entire dataset is available for processing at the time of training.

• Big Data break these assumptions, rendering traditional algorithms unusable orgreatly impeding their performance.

2L'Heureux, Alexandra, et al. "Machine Learning with Big Data: Challenges and Approaches."IEEE Access(2017).

Page 3: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Learning with Big Data

• Big Data are described by its dimensions• volume, velocity, variety and veracity.• value is often added as a 5th V

Page 4: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Volume

• The amount, size, and scale of the data

• Size

• vertically by the number of records or samples in a dataset

• horizontally by the number of features or attributes it contains

Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.

Page 5: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Processing Performance

• As the scale becomes large, even trivial operations can become costly.

• SVM: training time complexity of O(m3) and a space complexity of O(m2),

• PCA: O(mn2 + n3)

• Logistic Regression: O(mn2 + n3)

• Time needed to perform the computations will increase exponentially withincreasing data size.

• Performance becomes dependent upon the data structure used to store andmove data.

Ng, Andrew Y., et al. "YuanYuan Yu. MapReduce for machine learning on multicore."NIPS, December(2006).

Page 6: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Curse of Modularity

• Algorithms rely on the assumption that the data being processed can be heldentirely in memory or in a single file on a disk.

• When data size leads to the failure of this premise, entire families of algorithmsare affected.

• Solution - MapReduce (may not be useful for all ML algorithms)

5Kumar, K. Ashwin, et al. "Hone: Scaling down Hadoop on shared-memory systems."Proceedings of the VLDB Endowment6.12 (2013): 1354-1357

Page 7: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Class Imbalance

• As datasets grow larger, the assumption that the data are uniformly distributedacross all classes is often broken.

• Performance of a ML algorithm can be negatively affected when datasets containdata from classes with various probabilities of occurrence.• Especially prominent when some classes are represented by a large number of samples and some by

very few

6Ghanavati, Mojgan, et al. "An effective integrated method for learning big imbalanced data."Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 2014

Page 8: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Curse of Big Dimensionality

• “Big Dimensionality" - explosion and variety of features.

• Dimensionality affects processing performance

• time & space complexity of ML algorithms is closely related to data dimensionality

• Feature Engineering

• As the dataset grows, both vertically and horizontally, it becomes more difficult tocreate new, highly relevant features.

• Feature Selection becomes difficult.

Domingos, Pedro. "A few useful things to know about machine learning."Communications of the ACM55.10 (2012): 78-87

Page 9: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Myriad of Features

Page 10: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Bonferroni's Principle

• If one is looking for a specific type of event within a certain amount of data, thelikelihood of finding this event is high.

• More often than not, these occurrences are bogus

• Preventing false positives becomes important.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive datasets. Cambridge university press, 2014.

Page 11: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Variance & Bias

• Generalization error has two components: variance and bias

• Variance describes the consistency of a learners ability to predict random things

• Bias describes the ability of a learner to learn the wrong thing

• In Big data, learner may become too closely biased to the training set and may beunable to generalize adequately for new data.

• Solution:

• Regularization to improve generalization and reduce overfitting

Page 12: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Data Heterogeneity & Noisy Data

• Heterogeneity

• Syntactic & semantic

• Noise:

Swan, Melanie. "The quantified self: Fundamental disruption in big data science and biological discovery."Big Data1.2 (2013): 85-99.

Page 13: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Data Availability

• ML approach depends on data availability

• before learning begins, the entire dataset is assumed to be present.

• typically learns from the training set and then performs the learned task.

• To adapt to new information, algorithms must support incremental learning

Gu, Bin, et al. "Incremental learning for -support vector regression."Neural Networks67 (2015): 140-150.

Page 14: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Real-time Processing/Streaming

• need for real-time or near-real-time processing of fast-arriving data.

Neumeyer, Leonardo, et al. "S4: Distributed stream computing platform."Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010

Page 15: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Concept Drift

• CD: changes in the conditional distribution of the target output given the input,while the distribution of the input itself may remain unchanged

• Big Data are non-stationary

• new data are arriving continuously.

• It cannot be determined whether the current data follow the same distribution asfuture data.

Gama, Joo, et al. "A survey on concept drift adaptation."ACM Computing Surveys (CSUR)46.4 (2014): 44.

Page 16: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

iid rv

• i.i.d. requires data to be in random order while many datasets have a pre-existing non-random order.

• Solution:

• randomize the data before applying the algorithms

• Problems

• Big Data are fast and continuous.

• Not realistic to randomize a dataset that is still incomplete

• Not possible to wait for all the data to arrive.

Ghahramani, Zoubin. "Probabilistic machine learning and artificial intelligence."Nature521.7553 (2015): 452-459

Page 17: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Veracity and Data Uncertainty

• Veracity

• reliability of the data forming a dataset

• inherent unreliability of data sources

• Data are being gathered about various aspects in different ways

• Methods used to gather data can introduce uncertainty and impact veracity of adataset.

Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.Cao, Nan, et al. "Socialhelix: visual analysis of sentiment divergence in social media."Journal of Visualization18.2 (2015) 221-235.

Page 18: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Algorithm Modifications for Big Data

Page 19: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Big data vary in shape. These call for different approaches.

Wide Data

Tall Data

Thousands / Millions of Variables

Hundreds of Samples

Screening and fdr, Lasso, SVM, StepwiseWe have too many variables; prone to overfitting.

Need to remove variables, or regularize, or both.

Tens / Hundreds of Variables

Thousands / Millions of Samples

GLM, Random Forests,

Boosting, Deep LearningSometimes simple models (linear) don’t suffice.

We have enough samples to fit nonlinear models with many interactions, and not too

many variables.

Good automatic methods for doing this.

Page 20: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Big data vary in shape. These call for different approaches.

Tall and Wide Data

Thousands / Millions of VariablesMillions to Billions of Samples

Tricks of the Trade

Exploit sparsityRandom projections / hashingVariable screeningSubsample rowsDivide and recombineCase/ control samplingMapReduceADMM (divide and conquer)

Page 21: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Examples of Big Data Learning Problems• Click-through rate. Based on the search term, knowledge of this user (IPAddress), and the

Webpage about to be served, what is the probability that each of the 30 candidate ads in anad campaign would be clicked if placed in the right-hand panel.

• Logistic regression with billions of training observations. Each ad exchange does this, then bidson their top candidates, and if they win, serve the ad --- all within 10ms!

Page 22: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Big Data Learning Problems

• Recommender systems: Amazon online store, online DVD rentals, Kindle books…

• Based on my past experiences, and those of others like me, what else would Ichoose?

Page 23: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

ML approaches & the challenges they address

Page 24: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Infinite Data

Filtering data streams

Queries on streams

Page 25: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Data Streams

• In many data mining situations, we do not know the entire data set in advance

• Stream Management is important when the input rate is controlled externally:

• Google queries

• Twitter or Facebook status updates

• We can think of the data as infinite and non-stationary

• the distribution changes over time (Concept Drift)

Page 26: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

The Stream Model

• Input elements enter at a rapid rate, at one or more input ports (i.e., streams)

• elements of the stream tuples

• The system cannot store the entire stream accessibly

Page 27: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

What is a stream?

• Unbounded data

• Conceptually infinite, ever growing set of data items / events

• Practically continuous stream of data, which needs to be processed / analyzed

• Push model

• Data production and procession is controlled by the source

• Concept of time

• Often need to reason about when data is produced and when processed data shouldbe output

• Time agnostic, processing time, ingestion time, event time

based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Page 28: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

The Stream Model

• The Question is :

How do you make critical calculations about the stream using a limited amount of (secondary) memory?

Page 29: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Example: SGD is a Streaming Algorithm

• Machine Learning

• Allows for modeling problems where we have a continuous stream of data

• We want an algorithm to learn from it and slowly adapt to the changes in data

• Idea: Do slow updates to the model

• SGD (SVM, Perceptron) makes small updates

• So: First train the classifier on training data.

• Then: For every example from the stream, we slightly update the model (using smalllearning rate)

Page 30: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

General Stream Processing Model

Processor

Limited

Working

Storage

. . . 1, 5, 2, 7, 0, 9, 3

. . . a, r, v, t, y, h, b

. . . 0, 0, 1, 0, 1, 1, 0

time

Streams Entering.

Each is stream is

composed of

elements/tuples

Ad-Hoc

Queries

Output

Archival

Storage

Standing

Queries

Page 31: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Problems on Data Streams

• Types of queries one wants on answer on a data stream:

• Sampling data from a stream• Construct a random sample

• Queries over sliding windows• Number of items of type x in the last k elements of the stream

Page 32: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Problems on Data Streams

• Types of queries one wants on answer ona data stream:

• Filtering a data stream• Select elements with property x from the stream

• Counting distinct elements• Number of distinct elements in the last k elements of the stream

• Estimating moments• Estimate avg./std. dev. of last k elements

• Perform Dimensionality reduction (Streaming PCA)

Page 33: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Applications

• Mining query streams

• Google wants to know what queries are more frequent today than yesterday

• Mining click streams

• Bing wants to know which of its pages are getting an unusual number of hits inthe past hour

• Mining social network news feeds

• E.g., look for trending topics on Twitter, Facebook

Page 34: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Applications

• Sensor Networks

• Many sensors feeding into a central controller

• Telephone call records

• Data feeds into customer bills as well as settlements between telephone companies

• IP packets monitored at a switch

• Gather information for optimal routing

• Detect denial-of-service attacks

Page 35: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Sampling from a Data Stream:Sampling a fixed-size sample

As the stream grows, the sample is of fixed size

Page 36: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Maintaining a fixed-size sample

• Problem: Fixed-size sample

• Suppose we need to maintain a random sample S of size exactly s tuples

• E.g., main memory size constraint

• Why? Don’t know length of stream in advance

• Suppose at time n we have seen n items

• Each item is in the sample S with equal prob. s/n

Page 37: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Solution: Fixed Size Sample

Algorithm (a.k.a. Reservoir Sampling)

• Store all the first s elements of the stream to S

• Suppose we have seen n-1 elements, and now the nth element arrives (n > s)

• With probability s/n, keep the nth element, else discard it

• If we picked the nth element, then it replaces one of the s elements in the sample S,picked uniformly at random

Page 38: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Reservoir Sampling

Page 39: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Queries over a (long) Sliding Window

Page 40: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Sliding Windows

• A useful model of stream processing is that queries are about a window of lengthN – the N most recent elements received

• Interesting case: N is so large that the data cannot be stored in memory, or evenon disk

• Or, there are so many streams that windows for all cannot be stored

• Amazon example:

• For every product X we keep 0/1 stream of whether that product was sold in thenth transaction

• We want answer queries, how many times have we sold X in the last k sales

Page 41: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Sliding Window: 1 Stream

• Sliding window on a single stream:

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

Past Future

N = 6

Page 42: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

42

Counting Bits

Problem:

• Given a stream of 0s and 1s

• Be prepared to answer queries of the form

• How many 1s are in the last k bits? where k ≤ N

Obvious solution:

• Store the most recent N bits

• When new bit comes in, discard the N+1st bit

0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0

Past Future

Suppose N=6

Page 43: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Counting Bits

• You can not get an exact answer without storing the entire window

Real Problem:

• What if we cannot afford to store N bits?

• E.g., we’re processing 1 billion streams and N = 1 billion

• But we are happy with an approximate answer

0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0

Past Future

Page 44: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

An attempt: Simple solution

• Q: How many 1s are in the last N bits?

• A simple solution that does not really solve our problem: Uniformity assumption

• Maintain 2 counters:

• S: number of 1s from the beginning of the stream

• Z: number of 0s from the beginning of the stream

• How many 1s are in the last N bits? 𝑵 ∙𝑺

𝑺+𝒁

• But, what if stream is non-uniform?

• What if distribution changes over time?

0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N

Past Future

Page 45: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

DGIM Method

• DGIM solution that does not assume uniformity

• We store 𝑶(log𝟐𝑵) bits per stream

• Solution gives approximate answer, never off by more than 50%

• Error factor can be reduced to any fraction > 0, with more complicated algorithmand proportionally more stored bits

Page 46: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Min-wise Sampling

• For each item, pick a random fraction between 0 and 1

• Store item(s) with the smallest random tag

0.391 0.908 0.291 0.555 0.619 0.273

Each item has same chance of least tag, so uniform

Can run on multiple streams separately, then merge

Page 47: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

47

Sketches

• Not every problem can be solved with sampling

• Example: counting how many distinct items in the stream

• If a large fraction of items aren’t sampled, don’t know if they are all same or alldifferent

• Techniques take advantage that the algorithm can “see” all the data even if itcan’t “remember” it all

• “Sketch”: essentially, a linear transform of the input

• Model stream as defining a vector, sketch is result of multiplying stream vector byan (implicit) matrix

linear projection

Page 48: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Count-Min Sketch

• Simple sketch idea, can be used for as the basis of many different stream miningtasks

• Join aggregates, range queries, moments, …

1. Model input stream as a vector A of dimension N

2. Create a small summary as an array of w d in size

3. Use d hash functions to map vector entries to [1..w]

4. Works on arrivals only and arrivals & departures streamsW

dArray: CM[i,j]

Page 49: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

1. Filtering a data stream: Bloom filters

• Select elements with property x from stream

2. Counting distinct elements: Flajolet-Martin

• Number of distinct elements in the last k elements of the stream

3. Estimating moments: AMS method

• Estimate std. dev. of last k elements

(More) Algorithms for Streams

Page 50: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Filtering Data Streams

Page 51: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Filtering Data Streams

• Each element of data stream is a tuple

Problem

• Given a list of keys S

• Determine which tuples of stream are in S

Obvious solution: Hash table

• But suppose we do not have enough memory to store all of S in a hash table

• E.g., we might be processing millions of filters on the same stream

Page 52: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Applications

Example:

• Email spam filtering

• We know 1 billion “good” email addresses

• If an email comes from one of these, it is NOT spam

• Publish-subscribe systems

• You are collecting lots of messages (news articles)

• People express interest in certain sets of keywords

• Determine whether each message matches user’s interest

52

Page 53: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Counting Distinct Elements

Page 54: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Counting Distinct Elements

Problem:

• Data stream consists of a universe of elements chosen from a set of size N

• Maintain a count of the number of distinct elements seen so far

Obvious approach:

• Maintain the set of elements seen so far

• That is, keep a hash table of all the distinct elements seen so far

Page 55: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Applications

• How many different words are found among the Web pages being crawled at asite?

• Unusually low or high numbers could indicate artificial pages (spam?)

• How many different Web pages does each customer request in a week?

• How many distinct products have we sold in the last week?

Page 56: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Using Small Storage

• Real problem: What if we do not have space to maintain the set of elements seenso far?

• Estimate the count in an unbiased way

• Accept that the count may have a little error, but limit the probability that theerror is large

Page 57: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Flajolet-Martin Approach

• Pick a hash function h that maps each of the N elements to at least log2N bits

• For each stream element a, let r(a) be the number of trailing 0s in h(a)

• r(a) = position of first 1 counting from the right• E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2

• Record R = the maximum r(a) seen

• R = maxa r(a), over all the items a seen so far

• Estimated number of distinct elements = 2R

Page 58: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Why It Works: Intuition

• h(a) hashes a with equal prob. to any of N values

• Then h(a) is a sequence of log2 N bits, where 2-r fraction of all as have a tail of rzeros

• About 50% of as hash to ***0

• About 25% of as hash to **00

• So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we haveprobably seen about 4 distinct items so far

• So, it takes to hash about 2r items before we see one with zero-suffix of length r

Page 59: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Computing Moments

Page 60: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Generalization: Moments

• Suppose a stream has elements chosen from a set A of N values

• Let mi be the number of times value i occurs in the stream

• The kthmoment is Ai

k

im )(

Page 61: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

AMS Method – 2nd moment

• AMS method works for all moments

• Gives an unbiased estimate

• We pick and keep track of many variables X:

• For each variable X we store X.el and X.val• X.el corresponds to the item i

• X.val corresponds to the count of item i

• Note this requires a count in main memory, so number of Xs is limited

• Our goal is to compute 𝑺 = σ𝒊𝒎𝒊𝟐

[Alon, Matias, and Szegedy]

Page 62: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Counting Itemsets

Page 63: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Counting Itemsets

• Given a stream, which items appear more than s times in the window?

• Possible solution: Think of the stream of baskets as one binary stream per item

• 1 = item present; 0 = not present

• Use DGIM to estimate counts of 1s for all items

0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N

01

12

23

4

106

Page 64: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Extensions

• In principle, you could count frequent pairs or even larger sets the same way

• One stream per itemset

• Drawbacks:

• Only approximate

• Number of itemsets is way too big

Page 65: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

What we have

Categorization of data storage systems

Page 66: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

What we have

Categorization of data-processing systems

Page 67: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Requirements -Machine Learning for Big Data

• Distributed ML systems on clusters critical to applying advanced ML algorithms atindustrial scales for data analysis.

• ML models that can reach billions of parameters with massive amount of data.

• Common formalism for data/model parallelism as a guidance for future systemdevelopment.

• Conjuncture of Big Data and HPC

• Most importantly• big data specific algorithms• Energy efficient algorithms

Page 68: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Other (more important !!) Requirements

• Novel Data Model

• with increasing scale and heterogeneous data

• Novel Processing Model

• different classes of applications with different requirement of resources.

Page 69: Learning with Big Data Data Analytics and... · 2019. 8. 20. · Learning with Big Data •ML algorithms were designed for smaller datasets, with the assumption that •the entire

Thank you