data science

73
Data Science joaquin vanschoren

Upload: joaquin-vanschoren

Post on 15-Jan-2015

401 views

Category:

Data & Analytics


2 download

DESCRIPTION

Tutorial on data science, what's it like to be a data scientist, big data, the data scientific method, probabilistic algorithms, map-reduce, sensor data analysis, visualization of twitter and foursquare feeds, open source tools (R, Python, NoSQL)

TRANSCRIPT

Page 1: Data science

Data Sciencejoaquin vanschoren

Page 2: Data science
Page 3: Data science

WHAT IS DATA SCIENCE?

Page 4: Data science

Hacking skills

Maths & StatsExpertise

Page 5: Data science

Expertise

!

Maths &

Stats

Hacking skills !

!

!Danger zone!

Machine Learning

Research

Data Science

[Drew Conway]

Page 6: Data science

(data) science officer

Page 7: Data science

Maths &

Stats !

!

Hacking skills

!

!

!

!

Evil

Expertise

danger zone!

!machine learning

!James Bond

villain

data science outside

committee member

identity thief

thesis advisor

office mate

NSA

[Joel Grus]

Page 8: Data science

– David Coallier

“Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, then turn it into a

rainbow with a unicorn dancing on it.”

THE HYPE

– Harvard Business Review“Data Scientist: The Sexiest Job of the 21st Century”

Page 9: Data science

THE REALITY

• You’ll clean a lot of data. A LOT

• A lot of mathematics. Get over it

• Some days will be long. Get more coffee

• Not everything is about Big Data

• Most people don’t care about data

• Spend time finding the right questions

[David Coallier]

Page 10: Data science

BIG DATA THE END OF THEORY?

– Chris Anderson, WIRED

Out with every theory of human behavior, from linguistics to sociology. Who knows why people do what they do? The point is

they do it… With enough data, the numbers speak for themselves.

Page 11: Data science

All models are wrong.

But some are useful.

Page 12: Data science

All models are wrong, and

increasingly you can succeed

without them.

Page 13: Data science

Big data is a step forward.

But our problems are

not lack of access to data, but

understanding them.

Page 14: Data science

Data Scientific Method

[DJ Patil, J Elman]

Page 15: Data science

START WITH A QUESTION

Based on an observation

Page 16: Data science

START WITH A QUESTION

What (just) happened? Why did it happen?

What will/could happen next?

Page 17: Data science

ANALYSE CURRENT DATA

Create an Hypothesis

Page 18: Data science

CREATE FEATURES, EXPERIMENT

Test Hypothesis

Page 19: Data science

ANALYSE RESULTS

Won’t be pretty, repeat

Page 20: Data science

LET DATA FRAME THE CONVERSATION

Data gives you the what Humans give you the why

Page 21: Data science

LET DATA FRAME THE CONVERSATION

Let the dataset change your mindset

Page 22: Data science

CONVERSE• What data is missing? Where can

we get it?

• Automate data collection

• Clean data, then clean it more

• Visualize data: the brain sees

• Merge various sources of information

• Reformulate hypotheses

• Reformulate questions

Page 23: Data science

DATA SCIENCE TOOLS

Page 24: Data science

We can't solve problems at the same level of thinking with which

we've created them…

Page 25: Data science

Probabilistic algorithmsWhen polynomial time is just too slow

Page 26: Data science

[1,54,853,23,4,73,…] Have we seen 73?

min H(1,54,853,…) < H(73) ?min H2(1,54,853,…) < H2(73) ?min H3(1,54,853,…) < H3(73) ?

Page 27: Data science

MINHASHINGBloom filters

Find similar documents, photos, …

Page 28: Data science

MapReduce

Page 29: Data science

map

map

map

split

reduce

reduce

reduce

shuffle (remote read)

read

data

(H

DFS

)

data

(H

DFS

)w

rite

read

read

worker nodes (local) worker nodes (local)

split

1sp

lit 2

split

n

Page 30: Data science

mapper node

reducer node

remote read

Page 31: Data science
Page 32: Data science
Page 33: Data science
Page 34: Data science
Page 35: Data science
Page 36: Data science
Page 37: Data science

Mapper

<a,apple> <a’,slices>

Reducer

Input file

Intermediate file (local)

Output file

Page 38: Data science

<p,pineapple><p’,slices>

<a,apple>

<o,orange>

<a’,slices>

<o’,slices>

Input file Output fileIntermediate file

Page 39: Data science

split 0

split 1

split 2

1 mapper/split 1 reducer/key(set)

shufflesplit

Page 40: Data science

map

map

map

split

reduce

reduce

reduce

shuffle (remote read)

read

data

(H

DFS

)

data

(H

DFS

)w

rite

read

read

worker nodes (local) worker nodes (local)

split

1sp

lit 2

split

n

Page 41: Data science

map

map

map

split

reduce

reduce

shuffle + parallel sort (remote read)

read

data

(H

DFS

)

data

(H

DFS

)w

rite

read

read

split

1sp

lit 2

split

n

master nodeassigns map/reduce jobs

reassigns if nodes fail

Page 42: Data science
Page 43: Data science
Page 44: Data science
Page 45: Data science
Page 46: Data science

Nearest bar

? nearest within distance d?

Input graph (node,label)

Page 47: Data science

Nearest bar

Map ∀ , search graph with radius d < ,{ ,distance} >

Input graph (node,label)

Page 48: Data science

Nearest bar

Map ∀ , search graph < ,{ ,distance} >

Shuffle/ Sort by id

Input graph (node,label)

Page 49: Data science

Nearest bar

Map ∀ , search graph < ,{ ,distance} >

Input graph (node,label)

Shuffle/ Sort by id

Reduce < ,[{ ,distance}, { ,distance}] > -> min()

Output < , > < , > marked graph

Page 50: Data science

Sensor data

Page 51: Data science

Vibration

Strain (longitudinal)

Strain (transverse)

Temperature

Page 52: Data science

NOISE

Page 53: Data science

MATHSConvolution

signal

kernel

convolution

✐✐

“nr3” — 2007/5/1 — 20:53 — page 642 — #664 ✐✐

✐ ✐

642 Chapter 13. Fourier and Spectral Applications

sj0

0

0

N − 1rk

(r*s)j

N − 1

N − 1

Figure 13.1.2. Convolution of discretely sampled functions. Note how the response function for negativetimes is wrapped around and stored at the extreme right end of the array rk .

Example: A response function with r0 D 1 and all other rk’s equal to zerois just the identity filter. Convolution of a signal with this response function givesidentically the signal. Another example is the response function with r14 D 1:5 andall other rk’s equal to zero. This produces convolved output that is the input signalmultiplied by 1:5 and delayed by 14 sample intervals.

Evidently, we have just described in words the following definition of discreteconvolution with a response function of finite duration M :

.r ! s/j "M=2X

kD!M=2C1sj!k rk (13.1.1)

If a discrete response function is nonzero only in some range #M=2 < k $ M=2,where M is a sufficiently large even integer, then the response function is called afinite impulse response (FIR), and its duration is M . (Notice that we are defining Mas the number of nonzero values of rk ; these values span a time interval of M # 1sampling times.) In most practical circumstances the case of finite M is the case ofinterest, either because the response really has a finite duration, or because we chooseto truncate it at some point and approximate it by a finite-duration response function.

The discrete convolution theorem is this: If a signal sj is periodic with periodN , so that it is completely determined by theN values s0; : : : ; sN!1, then its discreteconvolution with a response function of finite durationN is a member of the discreteFourier transform pair,

✐✐

“nr3” — 2007/5/1 — 20:53 — page 602 — #624 ✐✐

✐ ✐

602 Chapter 12. Fast Fourier Transform

h.at/” 1

jajH.f

a/ time scaling (12.0.5)

1

jbjh.t

b/” H.bf / frequency scaling (12.0.6)

h.t ! t0/” H.f / e2! if t0 time shifting (12.0.7)

h.t/ e!2! if0t” H.f ! f0/ frequency shifting (12.0.8)

With two functions h.t/ and g.t/, and their corresponding Fourier transformsH.f / andG.f /, we can form two combinations of special interest. The convolutionof the two functions, denoted g " h, is defined by

g " h #Z 1

!1g.!/h.t ! !/ d! (12.0.9)

Note that g " h is a function in the time domain and that g " h D h " g. It turns outthat the function g " h is one member of a simple transform pair,

g " h” G.f /H.f / convolution theorem (12.0.10)

In other words, the Fourier transform of the convolution is just the product of theindividual Fourier transforms.

The correlation of two functions, denoted Corr.g; h/, is defined by

Corr.g; h/ #Z 1

!1g.! C t /h.!/ d! (12.0.11)

The correlation is a function of t , which is called the lag. It therefore lies in the timedomain, and it turns out to be one member of the transform pair:

Corr.g; h/” G.f /H".f / correlation theorem (12.0.12)

[More generally, the second member of the pair is G.f /H.!f /, but we are restrict-ing ourselves to the usual case in which g and h are real functions, so we take theliberty of setting H.!f / D H".f /.] This result shows that multiplying the Fouriertransform of one function by the complex conjugate of the Fourier transform of theother gives the Fourier transform of their correlation. The correlation of a functionwith itself is called its autocorrelation. In this case (12.0.12) becomes the transformpair

Corr.g; g/”jG.f /j2 Wiener-Khinchin theorem (12.0.13)

The total power in a signal is the same whether we compute it in the timedomain or in the frequency domain. This result is known as Parseval’s theorem:

total power #Z 1

!1jh.t/j2 dt D

Z 1

!1jH.f /j2 df (12.0.14)

Frequently one wants to know “how much power” is contained in the frequencyinterval between f and f C df . In such circumstances, one does not usually distin-guish between positive and negative f , but rather regards f as varying from 0 (“zerofrequency” or D.C.) toC1. In such cases, one defines the one-sided power spectraldensity (PSD) of the function h as

Ph.f / # jH.f /j2 C jH.!f /j2 0 $ f <1 (12.0.15)

Page 54: Data science

COMPLEXITY

Page 55: Data science

MATHS, AGAINscale space decomposition

signal

kernel 1

convolution 2

kernel 2

convolution 1

Page 56: Data science

SCALE-SPACE

Page 57: Data science

Baseline (σ64)

Traffic Jams (σ16 -σ64)

Slowdown (σ4 -σ16)Vehicles (σ0 -σ4)

SCALE-SPACE DECOMPOSITION

Page 58: Data science

VOLUMEMapReduce

145 sensors 100Hz 5GB/day 2TB/year 50MB/s disk I/O

Page 59: Data science

Reduce Reduce Reduce Reduce Reduce

Map Map Map MapMapBuild

windowsBuild

windowsBuild

windowsBuild

windowsBuild

windows

Shuffle

Convolute Convolute Convolute Convolute Convolute

CONVOLUTION

Page 60: Data science

CONVOLUTE-ADD

Map (convolute with 0-padding)

Reduce (add)

Add values in overlapping regions

0 00 0

0 0

A A+B B B+C C

Page 61: Data science

SEGMENTATION

• You don’t need 100Hz data for everything

• Approximate signal with linear segments

• Key points: 0-crossings of 1st, 2nd, 3rd derivative

• Maths: derivative of smoothed signal = convolution with derivative of kernel

Page 62: Data science

1st, 2nd,3rd degree derivatives

signalconvolutionsegmentation

Page 63: Data science

SEGMENTATION RESULT

Page 64: Data science

VISUALIZATION

Page 65: Data science

Twitter data

Page 66: Data science

TRACKING NEWS STORIES

Page 67: Data science

Geospacial data

Page 68: Data science

OPEN SOURCE TOOLS

Page 69: Data science

modelling, testing, prototyping

lubridate, zoo: dates, time series reshape2: reshape data ggplot2: visualize data RCurl, RJSONIO: find more data HMisc: miscellaneous DMwR, mlr: machine learning Forecast: time series forecasting garch: time series modelling quantmod: statistical financial trading xts: extensible time series igraph: study networks maptools: read and view maps

R

Page 70: Data science

scientific computing

numpy: linear algebra scipy: optimization, signal/image processing, … scikits: toolkits for scipy scikit-learn: machine learning toolkit statsmodels: advanced statistic modelling matplotlib: plotting NLTK: natural language processing PyBrain: more machine learning PyMC: Bayesian inference Pattern: Web mining NetworkX: Study networks Pandas: easy-to-use data structures

PYTHON

Page 71: Data science

CouchDB

OTHER

D3.js

Page 72: Data science
Page 73: Data science

@joavanschoren

[email protected]