yo. big data. understanding data science in the era of big data

40
Yo. Big Data understanding data science in the era of big data. Natalino Busa @natalinobusa

Upload: natalino-busa

Post on 12-May-2015

618 views

Category:

Science


2 download

DESCRIPTION

We talk a lot these days about data science, and how it will pave our paths with beautiful insights and unexpected new relations and connections in our given datasets, and even across datasets. But how to maintain the "Science" part in "Data Science"? After some time working in this field I appreciate more and more the critical thinking which has characterized the progress in science. Hypothesis, facts, prove and/or disprove the thesis. This is how science has progressed in the past centuries. This method has been formalized by Popper and categorize as non-science all disciplines where the statements cannot be falsified. In other words, if a statement cannot be disproved, we cannot talk of science, since there is no mechanism to left to verify the solution or to prove it wrong. When that happens the argument can still be accepted, but not scientifically accepted. Ways of accepting or refuting a non falsifiable statement are for instance based on aesthetic, authority or pragmatic or philosophical considerations. All valid but not scientific. This applies for instance to statements in the disciplines of politics, teology, ethics, etc. Science has definitely progressed since then. For instance, Bayesian networks and statistical inductions are currently part of the arsenal of the (data) scientist weapons. But, no matter how the baseline is set, critical thinking and a rigorous method are definitely helpful in understanding the results produced by science in particular when this is based on large amount of data and computational in nature, rather than formula/model driven. Data Science has currently many different connotations. On one side it praises the "artistry", the genius of laying out connections between disciplines and concepts. This is a truly great aspect of scientists and creativity is definitely very welcome in all data science profiles. With the fun of creating new insights and new data golden eggs, a data scientist has to put up with those annoying criteria of reproducibility, falsifiability and peer reviewing. Sometimes these elements are postponed or left behind in name of the artistry. Granted, it's just hard to find metrics and baselines in order to compare models and data science solutions. But the scientific method has proven to be solid over the centuries and has proven to allow factual scientific discussion between scientists and a to allow selection between models based on objective agreed criteria.

TRANSCRIPT

Page 1: Yo. big data. understanding data science in the era of big data

Yo. Big Dataunderstanding data science in the era of big data.

Natalino Busa@natalinobusa

Page 2: Yo. big data. understanding data science in the era of big data

Parallelism Mathematics Programming

Languages Machine Learning Statistics

Big Data Algorithms Cloud Computing

Natalino Busa@natalinobusa

www.natalinobusa.com

Page 3: Yo. big data. understanding data science in the era of big data

Understanding Big Data

Page 4: Yo. big data. understanding data science in the era of big data

What is life?

Page 5: Yo. big data. understanding data science in the era of big data

Why are we?

Page 6: Yo. big data. understanding data science in the era of big data

What is reality ?

Page 7: Yo. big data. understanding data science in the era of big data

● (almost) everything is a number

● A few guys came with some good ideas: Aristoteles, Galileo, Popper, Fisher, Pearson, Bayes

What has changed in 2500 years?

Page 8: Yo. big data. understanding data science in the era of big data

Aristoteles

Analytical reasoning

induction

deduction

Causality

Ontology

Page 9: Yo. big data. understanding data science in the era of big data

Galileo

Scientific method

experiment

reproducibility

math formula’s as models

Page 10: Yo. big data. understanding data science in the era of big data

Popper

Falsification

Exact sciences

Models have to adhere reality

Statistical inference:

Can we falsify beliefs?

Page 11: Yo. big data. understanding data science in the era of big data

Pearson

Statistical method

Null hypothesis

hypothesis testing

Principal Component Analysis

Correlation Coefficient

Page 12: Yo. big data. understanding data science in the era of big data

Fisher

Statistical method

Likelihood function

Significance

Distribution

Sufficient statistics

Page 13: Yo. big data. understanding data science in the era of big data

Bayes

Math of belief

belief inference

network of beliefs

hypothesis -> beliefs

Page 14: Yo. big data. understanding data science in the era of big data

What about it?

The shocking truth:

1) we use these concepts every day

2) we have a pre-scientific intuition of these ideas

Page 15: Yo. big data. understanding data science in the era of big data

Why do we bother?New problems are related to understanding human behavior:

understand needs, desires, dreams, ambitions, cravings, and hopes.

Models have a great side effect: they help us predicting the future.

three weapons:Processing power: Models becomes faster: can unroll for everybody’s profilesSources: extract more data features, use different data.Context: exploring information in order to understand the person.

Page 16: Yo. big data. understanding data science in the era of big data

So, why data?

Data is our way of understanding life and reality.

Page 17: Yo. big data. understanding data science in the era of big data

How to deal with it?

Well, it’s quite simple, in a nutshell:

This is what (data) science is about:

data -> hypothesis -> validation

Page 18: Yo. big data. understanding data science in the era of big data

… but what we (mostly) really do is:

Use very little data

-> apply it to pre-formulated beliefs

-> come up with some “gut feeling”

Validate it:

It didn’t work? “Well, I am still right. ”

Page 19: Yo. big data. understanding data science in the era of big data

Just buy the damn’d thing.

Page 20: Yo. big data. understanding data science in the era of big data

What’s the problem with it?

● Context○ we could use some more data○ insufficient feature engineering

● Add more hypotheses○ we could explore more scenarios, “pivoting”○ look at the problem from other angles○ need data “artistry”

Page 21: Yo. big data. understanding data science in the era of big data
Page 22: Yo. big data. understanding data science in the era of big data

Big data to the rescue?

Big Data is the domain which:

transforms numbers to insights

services to experiences

Page 23: Yo. big data. understanding data science in the era of big data

Big data to the rescue?

by aggregating data sources across users across applications across domains

Page 24: Yo. big data. understanding data science in the era of big data

Big data to the rescue?

in order to providing personalized and relevant results

to the consumer of the given service anywhere, anytime.

Page 25: Yo. big data. understanding data science in the era of big data

Some small headaches

users != consumers

N=all : doesn’t mean you don’t need to clean it

Not all data is born equal

you don’t know what you don’t know

Page 26: Yo. big data. understanding data science in the era of big data

Keep exploring.

Your problem might not be captured by your data features.

Page 27: Yo. big data. understanding data science in the era of big data

Some small headachesTough to inspect big data.

Tough to reason about big data.

representativity/bias, support, and segmentation

signal to noise ratio:

look at GFT (Google Flu Trends) for instance

Page 28: Yo. big data. understanding data science in the era of big data
Page 29: Yo. big data. understanding data science in the era of big data

Diminishingreturns

Most of models pretty good after a few weeks

winner added just about 5% moreafter 1 year, 300 ensemble model

moral:move on, get a new angle

Page 30: Yo. big data. understanding data science in the era of big data

How to compare?You know the answer (supervised methods)

confusion matrix

ROC (Receiver Operating Characteristic)

Mean Square Error (MSE)

You don’t know the answer (unsupervised methods)

objective function

access ground truth

A/B testing

Page 31: Yo. big data. understanding data science in the era of big data

Which is right?

Page 32: Yo. big data. understanding data science in the era of big data

Beware the modeling risksOverfitting train data

Not enough “support” in the population

Not enough features available/discovered

Not well defined objective function

Page 33: Yo. big data. understanding data science in the era of big data

Object functions

“ you can please some of the people some of the time”

Page 34: Yo. big data. understanding data science in the era of big data

Object functionsMany want a slice of the cake when it’s about object functions

● what the user wants

● what the community wants

● what marketing wants

● what business wants

● what finance/monetization wants

Page 35: Yo. big data. understanding data science in the era of big data

Data scientistsData artists,Data analystsData scientistsData engineers

confirmatory analysis: domain knowledge, statisticians and data analysis

exploratory analysis : data artists/scientists

operational analysis: data engineers , data technologists

Page 36: Yo. big data. understanding data science in the era of big data

When is data science cool?

Page 37: Yo. big data. understanding data science in the era of big data
Page 38: Yo. big data. understanding data science in the era of big data

What do we look in the haystack?outliers

outliers are indicators and/or noise

groups

(Similarity metrics, PCA, SVD)

Big data as pragmatic approach to:

cheap storage

distributed computing

Page 39: Yo. big data. understanding data science in the era of big data

How to enjoy and compare data science?

enjoy the artistryappreciate the genius

cross-validationavoid falling into the trap of over-fitted models

define baselineavoid qualitative methods

define a metric, put the models to the bench, compare results

Page 40: Yo. big data. understanding data science in the era of big data

Parallelism Mathematics Programming

Languages Machine Learning Statistics

Big Data Algorithms Cloud Computing

Natalino Busa@natalinobusa

www.natalinobusa.com

Thanks !Any questions?