learning with big data data analytics and... · 2019. 8. 20. · learning with big data •ml...

Learning with Big DataWhat is it all about…

IEEE CIS Summer School 2019

Data Analytics and Stream Processing:

Tools, Techniques and Applications

IIIT Allahabad

Learning with Big Data

• ML algorithms were designed for smaller datasets, with the assumption that

• the entire dataset can fit in memory.

• the entire dataset is available for processing at the time of training.

• Big Data break these assumptions, rendering traditional algorithms unusable orgreatly impeding their performance.

2L'Heureux, Alexandra, et al. "Machine Learning with Big Data: Challenges and Approaches."IEEE Access(2017).

Learning with Big Data

• Big Data are described by its dimensions• volume, velocity, variety and veracity.• value is often added as a 5th V

Volume

• The amount, size, and scale of the data

• Size

• vertically by the number of records or samples in a dataset

• horizontally by the number of features or attributes it contains

Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.

Processing Performance

• As the scale becomes large, even trivial operations can become costly.

• SVM: training time complexity of O(m3) and a space complexity of O(m2),

• PCA: O(mn2 + n3)

• Logistic Regression: O(mn2 + n3)

• Time needed to perform the computations will increase exponentially withincreasing data size.

• Performance becomes dependent upon the data structure used to store andmove data.

Ng, Andrew Y., et al. "YuanYuan Yu. MapReduce for machine learning on multicore."NIPS, December(2006).

Curse of Modularity

• Algorithms rely on the assumption that the data being processed can be heldentirely in memory or in a single file on a disk.

• When data size leads to the failure of this premise, entire families of algorithmsare affected.

• Solution - MapReduce (may not be useful for all ML algorithms)

5Kumar, K. Ashwin, et al. "Hone: Scaling down Hadoop on shared-memory systems."Proceedings of the VLDB Endowment6.12 (2013): 1354-1357

Class Imbalance

• As datasets grow larger, the assumption that the data are uniformly distributedacross all classes is often broken.

• Performance of a ML algorithm can be negatively affected when datasets containdata from classes with various probabilities of occurrence.• Especially prominent when some classes are represented by a large number of samples and some by

very few

6Ghanavati, Mojgan, et al. "An effective integrated method for learning big imbalanced data."Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 2014

Curse of Big Dimensionality

• “Big Dimensionality" - explosion and variety of features.

• Dimensionality affects processing performance

• time & space complexity of ML algorithms is closely related to data dimensionality

• Feature Engineering

• As the dataset grows, both vertically and horizontally, it becomes more difficult tocreate new, highly relevant features.

• Feature Selection becomes difficult.

Domingos, Pedro. "A few useful things to know about machine learning."Communications of the ACM55.10 (2012): 78-87

Myriad of Features

Bonferroni's Principle

• If one is looking for a specific type of event within a certain amount of data, thelikelihood of finding this event is high.

• More often than not, these occurrences are bogus

• Preventing false positives becomes important.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive datasets. Cambridge university press, 2014.

Variance & Bias

• Generalization error has two components: variance and bias

• Variance describes the consistency of a learners ability to predict random things

• Bias describes the ability of a learner to learn the wrong thing

• In Big data, learner may become too closely biased to the training set and may beunable to generalize adequately for new data.

• Solution:

• Regularization to improve generalization and reduce overfitting

Data Heterogeneity & Noisy Data

• Heterogeneity

• Syntactic & semantic

• Noise:

Swan, Melanie. "The quantified self: Fundamental disruption in big data science and biological discovery."Big Data1.2 (2013): 85-99.

Data Availability

• ML approach depends on data availability

• before learning begins, the entire dataset is assumed to be present.

• typically learns from the training set and then performs the learned task.

• To adapt to new information, algorithms must support incremental learning

Gu, Bin, et al. "Incremental learning for -support vector regression."Neural Networks67 (2015): 140-150.

Real-time Processing/Streaming

• need for real-time or near-real-time processing of fast-arriving data.

Neumeyer, Leonardo, et al. "S4: Distributed stream computing platform."Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010

Concept Drift

• CD: changes in the conditional distribution of the target output given the input,while the distribution of the input itself may remain unchanged

• Big Data are non-stationary

• new data are arriving continuously.

• It cannot be determined whether the current data follow the same distribution asfuture data.

Gama, Joo, et al. "A survey on concept drift adaptation."ACM Computing Surveys (CSUR)46.4 (2014): 44.

iid rv

• i.i.d. requires data to be in random order while many datasets have a pre-existing non-random order.

• Solution:

• randomize the data before applying the algorithms

• Problems

• Big Data are fast and continuous.

• Not realistic to randomize a dataset that is still incomplete

• Not possible to wait for all the data to arrive.

Ghahramani, Zoubin. "Probabilistic machine learning and artificial intelligence."Nature521.7553 (2015): 452-459

Veracity and Data Uncertainty

• Veracity

• reliability of the data forming a dataset

• inherent unreliability of data sources

• Data are being gathered about various aspects in different ways

• Methods used to gather data can introduce uncertainty and impact veracity of adataset.

Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics."International Journal of Information Management35.2 (2015): 137-144.Cao, Nan, et al. "Socialhelix: visual analysis of sentiment divergence in social media."Journal of Visualization18.2 (2015) 221-235.

Algorithm Modifications for Big Data

Big data vary in shape. These call for different approaches.

Wide Data

Tall Data

Thousands / Millions of Variables

Hundreds of Samples

Screening and fdr, Lasso, SVM, StepwiseWe have too many variables; prone to overfitting.

Need to remove variables, or regularize, or both.

Tens / Hundreds of Variables

Thousands / Millions of Samples

GLM, Random Forests,

Boosting, Deep LearningSometimes simple models (linear) don’t suffice.

We have enough samples to fit nonlinear models with many interactions, and not too

many variables.

Good automatic methods for doing this.

Big data vary in shape. These call for different approaches.

Tall and Wide Data

Thousands / Millions of VariablesMillions to Billions of Samples

Tricks of the Trade

Exploit sparsityRandom projections / hashingVariable screeningSubsample rowsDivide and recombineCase/ control samplingMapReduceADMM (divide and conquer)

Examples of Big Data Learning Problems• Click-through rate. Based on the search term, knowledge of this user (IPAddress), and the

Webpage about to be served, what is the probability that each of the 30 candidate ads in anad campaign would be clicked if placed in the right-hand panel.

• Logistic regression with billions of training observations. Each ad exchange does this, then bidson their top candidates, and if they win, serve the ad --- all within 10ms!

Big Data Learning Problems

• Recommender systems: Amazon online store, online DVD rentals, Kindle books…

• Based on my past experiences, and those of others like me, what else would Ichoose?

ML approaches & the challenges they address

Infinite Data

Filtering data streams

Queries on streams

Data Streams

• In many data mining situations, we do not know the entire data set in advance

• Stream Management is important when the input rate is controlled externally:

• Google queries

• Twitter or Facebook status updates

• We can think of the data as infinite and non-stationary

• the distribution changes over time (Concept Drift)

The Stream Model

• Input elements enter at a rapid rate, at one or more input ports (i.e., streams)

• elements of the stream tuples

• The system cannot store the entire stream accessibly

What is a stream?

• Unbounded data

• Conceptually infinite, ever growing set of data items / events

• Practically continuous stream of data, which needs to be processed / analyzed

• Push model

• Data production and procession is controlled by the source

• Concept of time

• Often need to reason about when data is produced and when processed data shouldbe output

• Time agnostic, processing time, ingestion time, event time

based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

The Stream Model

• The Question is :

How do you make critical calculations about the stream using a limited amount of (secondary) memory?

Example: SGD is a Streaming Algorithm

• Machine Learning

• Allows for modeling problems where we have a continuous stream of data

• We want an algorithm to learn from it and slowly adapt to the changes in data

• Idea: Do slow updates to the model

• SGD (SVM, Perceptron) makes small updates

• So: First train the classifier on training data.

• Then: For every example from the stream, we slightly update the model (using smalllearning rate)

General Stream Processing Model

Processor

Limited

Working

Storage

. . . 1, 5, 2, 7, 0, 9, 3

. . . a, r, v, t, y, h, b

. . . 0, 0, 1, 0, 1, 1, 0

time

Streams Entering.

Each is stream is

composed of

elements/tuples

Ad-Hoc

Queries

Output

Archival

Storage

Standing

Queries

Problems on Data Streams

• Types of queries one wants on answer on a data stream:

• Sampling data from a stream• Construct a random sample

• Queries over sliding windows• Number of items of type x in the last k elements of the stream

Problems on Data Streams

• Types of queries one wants on answer ona data stream:

• Filtering a data stream• Select elements with property x from the stream

• Counting distinct elements• Number of distinct elements in the last k elements of the stream

• Estimating moments• Estimate avg./std. dev. of last k elements

• Perform Dimensionality reduction (Streaming PCA)

Applications

• Mining query streams

• Google wants to know what queries are more frequent today than yesterday

• Mining click streams

• Bing wants to know which of its pages are getting an unusual number of hits inthe past hour

• Mining social network news feeds

• E.g., look for trending topics on Twitter, Facebook

Applications

• Sensor Networks

• Many sensors feeding into a central controller

• Telephone call records

• Data feeds into customer bills as well as settlements between telephone companies

• IP packets monitored at a switch

• Gather information for optimal routing

• Detect denial-of-service attacks

Sampling from a Data Stream:Sampling a fixed-size sample

As the stream grows, the sample is of fixed size

Maintaining a fixed-size sample

• Problem: Fixed-size sample

• Suppose we need to maintain a random sample S of size exactly s tuples

• E.g., main memory size constraint

• Why? Don’t know length of stream in advance

• Suppose at time n we have seen n items

• Each item is in the sample S with equal prob. s/n

Solution: Fixed Size Sample

Algorithm (a.k.a. Reservoir Sampling)

• Store all the first s elements of the stream to S

• Suppose we have seen n-1 elements, and now the nth element arrives (n > s)

• With probability s/n, keep the nth element, else discard it

• If we picked the nth element, then it replaces one of the s elements in the sample S,picked uniformly at random

Reservoir Sampling

Queries over a (long) Sliding Window

Sliding Windows

• A useful model of stream processing is that queries are about a window of lengthN – the N most recent elements received

• Interesting case: N is so large that the data cannot be stored in memory, or evenon disk

• Or, there are so many streams that windows for all cannot be stored

• Amazon example:

• For every product X we keep 0/1 stream of whether that product was sold in thenth transaction

• We want answer queries, how many times have we sold X in the last k sales

Sliding Window: 1 Stream

• Sliding window on a single stream:

q w e r t y u i o p a s d f g h j k l z x c v b n m




Past Future

N = 6

42

Counting Bits

Problem:

• Given a stream of 0s and 1s

• Be prepared to answer queries of the form

• How many 1s are in the last k bits? where k ≤ N

Obvious solution:

• Store the most recent N bits

• When new bit comes in, discard the N+1st bit

0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0

Past Future

Suppose N=6

Counting Bits

• You can not get an exact answer without storing the entire window

Real Problem:

• What if we cannot afford to store N bits?

• E.g., we’re processing 1 billion streams and N = 1 billion

• But we are happy with an approximate answer

0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0

Past Future

An attempt: Simple solution

• Q: How many 1s are in the last N bits?

• A simple solution that does not really solve our problem: Uniformity assumption

• Maintain 2 counters:

• S: number of 1s from the beginning of the stream

• Z: number of 0s from the beginning of the stream

• How many 1s are in the last N bits? 𝑵 ∙𝑺

𝑺+𝒁

• But, what if stream is non-uniform?

• What if distribution changes over time?

0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N

Past Future

DGIM Method

• DGIM solution that does not assume uniformity

• We store 𝑶(log𝟐𝑵) bits per stream

• Solution gives approximate answer, never off by more than 50%

• Error factor can be reduced to any fraction > 0, with more complicated algorithmand proportionally more stored bits

Min-wise Sampling

• For each item, pick a random fraction between 0 and 1

• Store item(s) with the smallest random tag

0.391 0.908 0.291 0.555 0.619 0.273

Each item has same chance of least tag, so uniform

Can run on multiple streams separately, then merge

47

Sketches

• Not every problem can be solved with sampling

• Example: counting how many distinct items in the stream

• If a large fraction of items aren’t sampled, don’t know if they are all same or alldifferent

• Techniques take advantage that the algorithm can “see” all the data even if itcan’t “remember” it all

• “Sketch”: essentially, a linear transform of the input

• Model stream as defining a vector, sketch is result of multiplying stream vector byan (implicit) matrix

linear projection

Count-Min Sketch

• Simple sketch idea, can be used for as the basis of many different stream miningtasks

• Join aggregates, range queries, moments, …

1. Model input stream as a vector A of dimension N

2. Create a small summary as an array of w d in size

3. Use d hash functions to map vector entries to [1..w]

4. Works on arrivals only and arrivals & departures streamsW

dArray: CM[i,j]

1. Filtering a data stream: Bloom filters

• Select elements with property x from stream

2. Counting distinct elements: Flajolet-Martin

• Number of distinct elements in the last k elements of the stream

3. Estimating moments: AMS method

• Estimate std. dev. of last k elements

(More) Algorithms for Streams

Filtering Data Streams

Filtering Data Streams

• Each element of data stream is a tuple

Problem

• Given a list of keys S

• Determine which tuples of stream are in S

Obvious solution: Hash table

• But suppose we do not have enough memory to store all of S in a hash table

• E.g., we might be processing millions of filters on the same stream

Applications

Example:

• Email spam filtering

• We know 1 billion “good” email addresses

• If an email comes from one of these, it is NOT spam

• Publish-subscribe systems

• You are collecting lots of messages (news articles)

• People express interest in certain sets of keywords

• Determine whether each message matches user’s interest

52

Counting Distinct Elements

Counting Distinct Elements

Problem:

• Data stream consists of a universe of elements chosen from a set of size N

• Maintain a count of the number of distinct elements seen so far

Obvious approach:

• Maintain the set of elements seen so far

• That is, keep a hash table of all the distinct elements seen so far

Applications

• How many different words are found among the Web pages being crawled at asite?

• Unusually low or high numbers could indicate artificial pages (spam?)

• How many different Web pages does each customer request in a week?

• How many distinct products have we sold in the last week?

Using Small Storage

• Real problem: What if we do not have space to maintain the set of elements seenso far?

• Estimate the count in an unbiased way

• Accept that the count may have a little error, but limit the probability that theerror is large

Flajolet-Martin Approach

• Pick a hash function h that maps each of the N elements to at least log2N bits

• For each stream element a, let r(a) be the number of trailing 0s in h(a)

• r(a) = position of first 1 counting from the right• E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2

• Record R = the maximum r(a) seen

• R = maxa r(a), over all the items a seen so far

• Estimated number of distinct elements = 2R

Why It Works: Intuition

• h(a) hashes a with equal prob. to any of N values

• Then h(a) is a sequence of log2 N bits, where 2-r fraction of all as have a tail of rzeros

• About 50% of as hash to ***0

• About 25% of as hash to **00

• So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we haveprobably seen about 4 distinct items so far

• So, it takes to hash about 2r items before we see one with zero-suffix of length r

Computing Moments

Generalization: Moments

• Suppose a stream has elements chosen from a set A of N values

• Let mi be the number of times value i occurs in the stream

• The kthmoment is Ai

k

im )(

AMS Method – 2nd moment

• AMS method works for all moments

• Gives an unbiased estimate

• We pick and keep track of many variables X:

• For each variable X we store X.el and X.val• X.el corresponds to the item i

• X.val corresponds to the count of item i

• Note this requires a count in main memory, so number of Xs is limited

• Our goal is to compute 𝑺 = σ𝒊𝒎𝒊𝟐

[Alon, Matias, and Szegedy]

Counting Itemsets

Counting Itemsets

• Given a stream, which items appear more than s times in the window?

• Possible solution: Think of the stream of baskets as one binary stream per item

• 1 = item present; 0 = not present

• Use DGIM to estimate counts of 1s for all items

0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0N

01

12

23

4

106

Extensions

• In principle, you could count frequent pairs or even larger sets the same way

• One stream per itemset

• Drawbacks:

• Only approximate

• Number of itemsets is way too big

What we have

Categorization of data storage systems

What we have

Categorization of data-processing systems

Requirements -Machine Learning for Big Data

• Distributed ML systems on clusters critical to applying advanced ML algorithms atindustrial scales for data analysis.

• ML models that can reach billions of parameters with massive amount of data.

• Common formalism for data/model parallelism as a guidance for future systemdevelopment.

• Conjuncture of Big Data and HPC

• Most importantly• big data specific algorithms• Energy efficient algorithms

Other (more important !!) Requirements

• Novel Data Model

• with increasing scale and heterogeneous data

• Novel Processing Model

• different classes of applications with different requirement of resources.

Thank you

learning with big data data analytics and... · 2019. 8. 20. · learning with big data •ml...

Documents