big data, computation and statistics

50
Big Data, Computation and Statistics Michael I. Jordan February 23, 2012 1

Upload: armina

Post on 25-Feb-2016

21 views

Category:

Documents


1 download

DESCRIPTION

Big Data, Computation and Statistics. Michael I. Jordan February 23, 2012 . What Is the Big Data Phenomenon?. Big Science is generating massive datasets to be used both for classical testing of theories and for exploratory science - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Big Data, Computation and Statistics

Big Data, Computation and Statistics

Michael I. Jordan

February 23, 2012

1

Page 2: Big Data, Computation and Statistics

What Is the Big Data Phenomenon?

• Big Science is generating massive datasets to be used both for classical testing of theories and for exploratory science

• Measurement of human activity, particularly online activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

• Sensor networks are becoming pervasive

Page 3: Big Data, Computation and Statistics

What Is the Big Data Problem?

• Computer science studies the management of resources, such as time and space and energy

Page 4: Big Data, Computation and Statistics

What Is the Big Data Problem?

• Computer science studies the management of resources, such as time and space and energy

• Data has not been viewed as a resource, but as a “workload”

Page 5: Big Data, Computation and Statistics

What Is the Big Data Problem?

• Computer science studies the management of resources, such as time and space and energy

• Data has not been viewed as a resource, but as a “workload”

• The fundamental issue is that data now needs to be viewed as a resource– the data resource combines with other resources to yield timely,

cost-effective, high-quality decisions and inferences

Page 6: Big Data, Computation and Statistics

What Is the Big Data Problem?

• Computer science studies the management of resources, such as time and space and energy

• Data has not been viewed as a resource, but as a “workload”

• The fundamental issue is that data now needs to be viewed as a resource– the data resource combines with other resources to yield timely,

cost-effective, high-quality decisions and inferences• Just as with time or space, it should be the case (to first

order) that the more of the data resource the better

Page 7: Big Data, Computation and Statistics

What Is the Big Data Problem?

• Computer science studies the management of resources, such as time and space and energy

• Data has not been viewed as a resource, but as a “workload”

• The fundamental issue is that data now needs to be viewed as a resource– the data resource combines with other resources to yield timely,

cost-effective, high-quality decisions and inferences• Just as with time or space, it should be the case (to first

order) that the more of the data resource the better– is that true in our current state of knowledge?

Page 8: Big Data, Computation and Statistics

• No, for two main reasons:– query complexity grows faster than number of data points

• the more rows in a table, the more columns• the more columns, the more hypotheses that can be considered • indeed, the number of hypotheses grows exponentially in the

number of columns• so, the more data the greater the chance that random

fluctuations look like signal (e.g., more false positives)

Page 9: Big Data, Computation and Statistics

• No, for two main reasons:– query complexity grows faster than number of data points

• the more rows in a table, the more columns• the more columns, the more hypotheses that can be considered • indeed, the number of hypotheses grows exponentially in the

number of columns• so, the more data the greater the chance that random

fluctuations look like signal (e.g., more false positives)– the more data the less likely a sophisticated algorithm will

run in an acceptable time frame• and then we have to back off to cheaper algorithms that may be

more error-prone• or we can subsample, but this requires knowing the statistical

value of each data point, which we generally don’t know a priori

Page 10: Big Data, Computation and Statistics

A Formulation of the Problem

• Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

Page 11: Big Data, Computation and Statistics

A Formulation of the Problem

• Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)– this is far from being achieved in the current state of

the literature!– a solution will blend computer science and statistics,

and fundamentally change both fields

Page 12: Big Data, Computation and Statistics

A Formulation of the Problem

• Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)– this is far from being achieved in the current state of

the literature!– a solution will blend computer science and statistics,

and fundamentally change both fields• I believe that it is an achievable goal, but one

that will require a major, sustained effort

Page 13: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

Page 14: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

• It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set

Page 15: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

• It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set– seemingly a wonderful connection with modern multi-core and cloud

computing architectures

Page 16: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

• It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set– seemingly a wonderful connection with modern multi-core and cloud

computing architectures

• But if we have a terabyte of data, each resampled data set is 632 gigabytes… – and error bars computed on subsamples have the wrong scale

Page 17: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

• It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set– seemingly a wonderful connection with modern multi-core and cloud

computing architectures

• But if we have a terabyte of data, each resampled data set is 632 gigabytes… – and error bars computed on subsamples have the wrong scale

• So the classical bootstrap is dead in the water for massive data

Page 18: Big Data, Computation and Statistics

An Example

• The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

• It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set– seemingly a wonderful connection with modern multi-core and cloud

computing architectures

• But if we have a terabyte of data, each resampled data set is 632 gigabytes… – and error bars computed on subsamples have the wrong scale

• So the classical bootstrap is dead in the water for massive data– this issue hasn’t been considered before in the statistics literature

Page 19: Big Data, Computation and Statistics

The Big Data Bootstrap

Ariel Kleiner, Purnamrita Sarkar and Ameet Talwalkar and Michael I. Jordan

University of California, Berkeley

Page 20: Big Data, Computation and Statistics

Assessing the Quality of Inference

Observe data X1, ..., Xn

Form a parameter estimate n = (X1, ..., Xn)

Want to compute an assessment of the quality of our estimate n

(e.g., a confidence region)

Page 21: Big Data, Computation and Statistics

The Goal

A procedure for quantifying estimator quality which is

accurateautomaticscalable

Page 22: Big Data, Computation and Statistics

The Unachievable Frequentist Ideal

Ideally, we would

① Observe many independent datasets of size n.

② Compute n on each.

③ Compute based on these multiple realizations of n.

Page 23: Big Data, Computation and Statistics

The Unachievable Frequentist Ideal

Ideally, we would① Observe many independent datasets of size n.② Compute n on each.

③ Compute based on these multiple realizations of n.

Page 24: Big Data, Computation and Statistics

The Unachievable Frequentist Ideal

Ideally, we would① Observe many independent datasets of size n.② Compute n on each.

③ Compute based on these multiple realizations of n.

But, we only observe one dataset of size n.

Page 25: Big Data, Computation and Statistics

The Underlying Population

Page 26: Big Data, Computation and Statistics

Sampling

Page 27: Big Data, Computation and Statistics

Approximation

Page 28: Big Data, Computation and Statistics

Pretend The Sample Is The Population

Page 29: Big Data, Computation and Statistics

The Bootstrap

Use the observed data to simulate multiple datasets of size n:① Repeatedly resample n points with replacement from the

original dataset of size n.② Compute *

n on each resample.

③ Compute based on these multiple realizations of *n as

our estimate of for n.

(Efron, 1979)

Page 30: Big Data, Computation and Statistics

The Bootstrap:Computational Issues

• Expected number of distinct points in a bootstrap resample is ~ 0.632n.

• Resources required to compute generally scale in number of distinct data points.– this is true of many commonly used algorithms (e.g.,

SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.).

– use weighted representation of resampled datasets to avoid physical data replication.

– example: If original dataset has size 1 TB, then expect resample to have size ~ 632 GB.

Page 31: Big Data, Computation and Statistics

Subsampling

n

(Politis, Romano & Wolf, 1999)

Page 32: Big Data, Computation and Statistics

Subsampling

n

b

Page 33: Big Data, Computation and Statistics

Subsampling

b

Page 34: Big Data, Computation and Statistics

Subsampling

Compute only on smaller subsets of the data of size b(n) < n, and analytically correct our estimates of :

① Repeatedly subsample b(n) < n points without replacement from the original dataset of size n.

② Compute b(n) on each subsample.

③ Compute based on these multiple realizations of b(n).

④ Analytically correct to produce final estimate of for n.

Much more favorable computational profile than bootstrap.

Page 35: Big Data, Computation and Statistics

Empirical Results:Bootstrap and Subsampling

• Multivariate linear regression with d = 100 and n = 50,000 on synthetic data.

• x coordinates sampled independently from StudentT(3).• y = wTx + , where w in Rd is a fixed weight vector and is

Gaussian noise.• Estimate n = wn in Rd via least squares.• Compute a marginal confidence interval for each component

of wn and assess accuracy via relative mean (across components) absolute deviation from true confidence interval size.

• For subsampling, use b(n) = n for various values of .• Similar results obtained with Normal and Gamma data

generating distributions, as well as if estimate a misspecified model.

Page 36: Big Data, Computation and Statistics

Empirical Results:Bootstrap and Subsampling

Page 37: Big Data, Computation and Statistics

Subsampling

n

b

Page 38: Big Data, Computation and Statistics

Subsampling

b

Page 39: Big Data, Computation and Statistics

Approximation

Page 40: Big Data, Computation and Statistics

Pretend the Subsample is the Population

Page 41: Big Data, Computation and Statistics

Pretend the Subsample is the Population

• And bootstrap the subsample!• This means resampling n times with replacement,

not b times as in subsampling

Page 42: Big Data, Computation and Statistics

The Bag of Little Bootstraps (BLB)

① Repeatedly subsample b(n) < n points without replacement from the original dataset of size n.

② For each subsample do:a) Repeatedly resample n points with replacement from the

subsample.b) Compute *

n on each resample.c) Compute an estimate of based on these multiple resampled

realizations of *n.

③ We now have one estimate of per subsample. Output their average as our final estimate of for n.

Page 43: Big Data, Computation and Statistics

The Bag of Little Bootstraps (BLB)

Page 44: Big Data, Computation and Statistics

Bag of Little Bootstraps (BLB)Computational Considerations

A key point:• Resources required to compute generally scale in

number of distinct data points.• This is true of many commonly used estimation algorithms

(e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.).

• Use weighted representation of resampled datasets to avoid physical data replication.

Example: if original dataset has size 1 TB with each data point 1 MB, and we take b(n) = n0.6, then expect

• subsampled datasets to have size ~ 4 GB• resampled datasets to have size ~ 4 GB

(in contrast, bootstrap resamples have size ~ 632 GB)

Page 45: Big Data, Computation and Statistics

Empirical Results:Bag of Little Bootstraps (BLB)

Page 46: Big Data, Computation and Statistics

BLB: Theoretical ResultsHigher-Order Correctness

Then:

Page 47: Big Data, Computation and Statistics

BLB: Theoretical Results

BLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap.

Theorem (asymptotic consistency): Under standard assumptions (particularly that is Hadamard differentiable and is continuous), the output of BLB converges to the population value of as n, b approach ∞.

Page 48: Big Data, Computation and Statistics

BLB: Theoretical ResultsHigher-Order Correctness

Assume: is a studentized statistic. (Qn(P)), the population value of for n, can be written as

where the pk are polynomials in population moments.• The empirical version of based on resamples of size n

from a single subsample of size b can also be written as

where the are polynomials in the empirical moments of subsample j.

• b ≤ n and

Page 49: Big Data, Computation and Statistics

BLB: Theoretical ResultsHigher-Order Correctness

Also, if BLB’s outer iterations use disjoint chunks of data rather than random subsamples, then

Page 50: Big Data, Computation and Statistics

Conclusions

• An initial foray into the evaluation of estimators for Big Data

• Many further directions to pursue– BLB for time series and graphs– streaming BLB– m out of n / BLB hybrids– super-scalability

• Technical Report available at:

www.cs.berkeley.edu/~jordan/publications.html