on the code of data science

73
On the code of data science Varoquaux Ga¨ el import data science data science.discover()

Upload: gael-varoquaux

Post on 16-Apr-2017

2.973 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: On the code of data science

On the code of data science

VaroquauxGael

import data sciencedata science.discover()

Disclaimer:A bit of a Python biasKey ideas carry over

Page 2: On the code of data science

On the code of data science

VaroquauxGael

import data sciencedata science.discover()

Disclaimer:A bit of a Python biasKey ideas carry over

Page 3: On the code of data science

What is data science?

G Varoquaux 2

Page 4: On the code of data science

What is data science?

G Varoquaux 2

Page 5: On the code of data science

Data science = statistics + code

Larger datasets

Automating insights

Data science is disruptive when it enablesreal-time or personalized decisions

G Varoquaux 3

Page 6: On the code of data science

1 Writing code for data science

2 Machine learning in Python

3 Outlook: infrastructure

Make you more productive as a data scientist

G Varoquaux 4

Page 7: On the code of data science

1 Writing code for data scienceProductivity & Quality

G Varoquaux 5

Page 8: On the code of data science

1 A data-science workflowWork based on intuition

and experimentation

Conjecture

Experiment

ñ Interactive & framework-less

Yet needs consolidationkeeping flexibility

G Varoquaux 6

Page 9: On the code of data science

1 A design pattern in computational experimentsMVC pattern from Wikipedia:ModelManages the dataand rules of theapplication

ViewOutput represen-tationPossibly several views

ControllerAccepts inputand converts it tocommandsfor model and view

Photo-editing softwareFilters Canvas Tool palettes

Typical web applicationDatabase Web-pages URLs

For data science:Numerical, data-processing, & ex-perimental logic

Results, as files.Data & plots

Imperative APIAvoid input as files:not expressive

Modulewith functions

Post-processing scriptCSV & data files

Scriptñ for loops

A recipe3 types of files:•modules • command scripts • post-processing scriptsCSVs & intermediate data files

Separate computation from analysis / plottingCode and text (and data) ñ version control

Decouple stepsGoals: Reuse code

Mitigate compute time

G Varoquaux 7

Page 10: On the code of data science

1 A design pattern in computational experimentsMVC pattern from Wikipedia:ModelManages the dataand rules of theapplication

ViewOutput represen-tationPossibly several views

ControllerAccepts inputand converts it tocommandsfor model and view

For data science:Numerical, data-processing, & ex-perimental logic

Results, as files.Data & plots

Imperative APIAvoid input as files:not expressive

Modulewith functions

Post-processing scriptCSV & data files

Scriptñ for loops

A recipe3 types of files:•modules • command scripts • post-processing scriptsCSVs & intermediate data files

Separate computation from analysis / plottingCode and text (and data) ñ version control

Decouple stepsGoals: Reuse code

Mitigate compute time

G Varoquaux 7

Page 11: On the code of data science

1 A design pattern in computational experimentsMVC pattern from Wikipedia:ModelManages the dataand rules of theapplication

ViewOutput represen-tationPossibly several views

ControllerAccepts inputand converts it tocommandsfor model and view

For data science:Numerical, data-processing, & ex-perimental logic

Results, as files.Data & plots

Imperative APIAvoid input as files:not expressive

Modulewith functions

Post-processing scriptCSV & data files

Scriptñ for loops

A recipe3 types of files:•modules • command scripts • post-processing scriptsCSVs & intermediate data files

Separate computation from analysis / plottingCode and text (and data) ñ version control

Decouple stepsGoals: Reuse code

Mitigate compute time

G Varoquaux 7

Page 12: On the code of data science

1 A design pattern in computational experimentsMVC pattern from Wikipedia:ModelManages the dataand rules of theapplication

ViewOutput represen-tationPossibly several views

ControllerAccepts inputand converts it tocommandsfor model and view

For data science:Numerical, data-processing, & ex-perimental logic

Results, as files.Data & plots

Imperative APIAvoid input as files:not expressive

Modulewith functions

Post-processing scriptCSV & data files

Scriptñ for loops

A recipe3 types of files:•modules • command scripts • post-processing scriptsCSVs & intermediate data files

Separate computation from analysis / plottingCode and text (and data) ñ version control

Decouple stepsGoals: Reuse code

Mitigate compute time

G Varoquaux 7

Page 13: On the code of data science

1 How I work progressive consolidation

Start with a script playing to understand the problem

Identify blocks/operations ñ move to a function

As they stabilize, move to a module

Clean: delete code & files you have version control

Why is it hard?Long compute timesmake us unadventurous

Know your toolsRefactoring editorVersion control

G Varoquaux 8

Page 14: On the code of data science

1 How I work progressive consolidation

Start with a script playing to understand the problem

Identify blocks/operations ñ move to a functionUse functionsObstacle: local scoperequires identifying input and output variables

That’s a good thing

Solution: Interactive debugging / understandinginside a function: %debug in IPython

Functions are the basic reusable abstraction

As they stabilize, move to a module

Clean: delete code & files you have version control

Why is it hard?Long compute timesmake us unadventurous

Know your toolsRefactoring editorVersion control

G Varoquaux 8

Page 15: On the code of data science

1 How I work progressive consolidation

Start with a script playing to understand the problem

Identify blocks/operations ñ move to a function

As they stabilize, move to a moduleModules

enable sharing between experimentsñ avoid 1000 lines scripts + commented codeenable testing

Fast experiments as testsñ gives confidence, hence refactorings

Clean: delete code & files you have version control

Why is it hard?Long compute timesmake us unadventurous

Know your toolsRefactoring editorVersion control

G Varoquaux 8

Page 16: On the code of data science

1 How I work progressive consolidation

Start with a script playing to understand the problem

Identify blocks/operations ñ move to a function

As they stabilize, move to a module

Clean: delete code & files you have version control

Attentional load makes it impossibleto find or understand things

Where’s Waldo?

Why is it hard?Long compute timesmake us unadventurous

Know your toolsRefactoring editorVersion control

G Varoquaux 8

Page 17: On the code of data science

1 How I work progressive consolidation

Start with a script playing to understand the problem

Identify blocks/operations ñ move to a function

As they stabilize, move to a module

Clean: delete code & files you have version control

Why is it hard?Long compute timesmake us unadventurous

Know your toolsRefactoring editorVersion control

G Varoquaux 8

Page 18: On the code of data science

1 Caching to tame computation timeThe memoize pattern

mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from memory

For data sciencea & b can be biga & b arbitrary objects no change in workflowResults stored on disk

G Varoquaux 9

Page 19: On the code of data science

1 Caching to tame computation timeThe memoize pattern

mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from memory

For data scienceFits in experimentation loopHelps decrease re-run times œ

Black-boxy, persistence only implicit

G Varoquaux 9

Page 20: On the code of data science

Adopting software-engineering best practices

G Varoquaux 10

Page 21: On the code of data science

1 The ladder of code quality

Incr

easin

gco

st

Use linting/code analysis in your editor seriously

Coding convention, good naming

Version control Use git + github

Code review

Unit testingIf it’s not tested, it’s broken or soon will be

Make a packagecontrolled dependencies and compilation...

G Varoquaux 11

Page 22: On the code of data science

1 The ladder of code quality

Incr

easin

gco

st

Use linting/code analysis in your editor seriously

Coding convention, good naming

Version control Use git + github

Code review

Unit testingIf it’s not tested, it’s broken or soon will be

Make a packagecontrolled dependencies and compilation...

G Varoquaux 11

Page 23: On the code of data science

1 The ladder of code qualityIn

crea

sing

cost

Use linting/code analysis in your editor seriously

Coding convention, good naming

Version control Use git + github

Code review

Unit testingIf it’s not tested, it’s broken or soon will be

Make a packagecontrolled dependencies and compilation...

Avoid premature software engineering

G Varoquaux 11

Page 24: On the code of data science

1 The ladder of code qualityIn

crea

sing

cost

Use linting/code analysis in your editor seriously

Coding convention, good naming

Version control Use git + github

Code review

Unit testingIf it’s not tested, it’s broken or soon will be

Make a packagecontrolled dependencies and compilation...

Avoid premature software engineering

Over versus under engineeringWhen the goal is generating insights

Experimentation to develop intuitionsñ new ideas

As the path becomes clear: consolidationHeavy engineering too early freezes bad ideas

G Varoquaux 11

Page 25: On the code of data science

1 LibrariesIn

crea

sing

cost

Use linting/code analysis in your editor seriously

Coding convention, good naming

Version control Use git + github

Code review

Unit testingIf it’s not tested, it’s broken or soon will be

Make a packagecontrolled dependencies and compilation...

A libraryG Varoquaux 12

Page 26: On the code of data science

2 Machine learning in Python

scikit-learn

G Varoquaux 13

Page 27: On the code of data science

2 Tradeoffs

Experimentation Production-đ §

scikit

G Varoquaux 14

Page 28: On the code of data science

2 My stack for data science

Python, what else?General-purpose languageInteractiveEasy to read / write

G Varoquaux 15

Page 29: On the code of data science

2 My stack for data science

The scientific Python stacknumpy arrays

Mostly a float**No annotation / structureUniversal across applicationsEasily shared across languages

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620G Varoquaux 15

Page 30: On the code of data science

2 My stack for data science

The scientific Python stacknumpy arrays

Connecting topandas

Columnar datascikit-image

Imagesscipy

Numerics, signal processing...

G Varoquaux 15

Page 31: On the code of data science

2 Machine learning in a nutshellMachine learning is about making predictions from data

e.g. learning to distinguish apples from oranges

Prediction is very difficult, especially about the future. Niels Bohr

Learn as much as possible from the databut not too much

G Varoquaux 16

Page 32: On the code of data science

2 Machine learning in a nutshellMachine learning is about making predictions from data

e.g. learning to distinguish apples from oranges

Prediction is very difficult, especially about the future. Niels Bohr

Learn as much as possible from the databut not too much

G Varoquaux 16

Page 33: On the code of data science

2 Machine learning in a nutshellMachine learning is about making predictions from data

e.g. learning to distinguish apples from oranges

Prediction is very difficult, especially about the future. Niels Bohr

Learn as much as possible from the databut not too much

x

y

x

y

Which model do you prefer?

G Varoquaux 16

Page 34: On the code of data science

2 Machine learning in a nutshellMachine learning is about making predictions from data

e.g. learning to distinguish apples from oranges

Prediction is very difficult, especially about the future. Niels Bohr

Learn as much as possible from the databut not too much

x

y

x

y

Minimizing train error ‰ generalization : overfit

G Varoquaux 16

Page 35: On the code of data science

2 Machine learning in a nutshellMachine learning is about making predictions from data

e.g. learning to distinguish apples from oranges

Prediction is very difficult, especially about the future. Niels Bohr

Learn as much as possible from the databut not too much

x

y

x

y

Adapting model complexity to data – regularization

G Varoquaux 16

Page 36: On the code of data science

2 Machine learning without learning the machinery

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

let’s disrupt something new

As easy as pyfrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 17

Page 37: On the code of data science

2 Machine learning without learning the machinery

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

let’s disrupt something new

As easy as pyfrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 17

Page 38: On the code of data science

2 Machine learning without learning the machinery

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

let’s disrupt something new

As easy as pyfrom s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 17

Page 39: On the code of data science

2 Show me your data: the samples ˆ features matrix

Data input: a 2D numerical arrayRequires transforming your problem

With text documents:

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features

sklearn.feature extraction.text.TfIdfVectorizer

G Varoquaux 18

Page 40: On the code of data science

2 Show me your data: the samples ˆ features matrix

Data input: a 2D numerical arrayRequires transforming your problem

With text documents:

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

documents

the

Python

performance

profiling

module

is

code

can

a

sklearn.feature extraction.text.TfIdfVectorizer

G Varoquaux 18

Page 41: On the code of data science

“Big” dataEngineering efficient processing pipelines

Many samples or

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features

Many features

0307809070

7907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

samples

features 03078

090707907

0079075270

0578

9407100600

0797

0097000800

7000

1000040040

0090

0005020500

8000

See also: http://www.slideshare.net/GaelVaroquaux/processing-biggish-data-on-commodity-hardware-simple-python-patterns

G Varoquaux 19

Page 42: On the code of data science

2 Many samples: on-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

G Varoquaux 20

Page 43: On the code of data science

2 Many samples: on-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

Supervised models: predictingsklearn.naive bayes...sklearn.linear model.SGDRegressorsklearn.linear model.SGDClassifier

Clustering: grouping samplessklearn.cluster.MiniBatchKMeanssklearn.cluster.Birch

Linear decompositions: finding new representationssklearn.decompositions.IncrementalPCAsklearn.decompositions.MiniBatchDictionaryLearningsklearn.decompositions.LatentDirichletAllocation

G Varoquaux 20

Page 44: On the code of data science

2 Many features: on-the-fly data reduction

ñ Reduce the data as it is loaded

X s m a l l =e s t i m a t o r . t r a n s f o r m ( X big , y)

G Varoquaux 21

Page 45: On the code of data science

2 Many features: on-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 21

Page 46: On the code of data science

More gems in scikit-learnSAG:linear model.LogisticRegression(solver=’sag’)Fast linear model on biggish data

G Varoquaux 22

Page 47: On the code of data science

More gems in scikit-learnSAG:linear model.LogisticRegression(solver=’sag’)Fast linear model on biggish data

PCA == RandomizedPCA: (0.18)Heuristic to switch PCA to random linear algebra

Fights global warming

Huge speed gains for biggish data

G Varoquaux 22

Page 48: On the code of data science

More gems in scikit-learn

New cross-validation objects (0.18)from s k l e a r n . c r o s s v a l i d a t i o n

import S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :

X t r a i n = X[ t r a i n ]y t r a i n = y[ t r a i n ]

Data-independent better nested-CV

G Varoquaux 22

Page 49: On the code of data science

More gems in scikit-learn

New cross-validation objects (0.18)from s k l e a r n . m o d e l s e l e c t i o n

import S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):

X t r a i n = X[ t r a i n ]y t r a i n = y[ t r a i n ]

Data-independent ñ better nested-CV

G Varoquaux 22

Page 50: On the code of data science

More gems in scikit-learn

Outlier detection and isolation forests (0.18)

G Varoquaux 22

Page 51: On the code of data science

3 Outlook: infrastructure

No strings attached

G Varoquaux 23

Page 52: On the code of data science

3 Dataflow is key to scale

Array computingCPU

03878794797927

01790752701578

03878794797927

01790752701578

Data parallel

03878794797927

03878794797927

Streaming

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

Parallel computingData + code transfer Out-of-memory persistence

These patterns can yield horrible codeG Varoquaux 24

Page 53: On the code of data science

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibąąąąąąąąą from joblib import Parallel, delayedąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

Threading and multiprocessing mode

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 25

Page 54: On the code of data science

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibąąąąąąąąą from joblib import Parallel, delayedąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

Threading and multiprocessing mode

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 25

Page 55: On the code of data science

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibąąąąąąąąą from joblib import Parallel, delayedąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

Threading and multiprocessing mode

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backend

with parallel backend(’dask.distributed’,scheduler host=’HOST:PORT’):

# normal Joblib code

G Varoquaux 25

Page 56: On the code of data science

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibąąąąąąąąą from joblib import Parallel, delayedąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

Threading and multiprocessing mode

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

G Varoquaux 25

Page 57: On the code of data science

3 Parallel-computing engine: joblibsklearn.Estimator(n jobs=2)

Under the hood: joblibąąąąąąąąą from joblib import Parallel, delayedąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

Threading and multiprocessing mode

New: distributed computing backends:Yarn, dask.distributed, IPython.parallel

import distributed.joblibfrom joblib import Parallel, parallel backendwith parallel backend(’dask.distributed’,

scheduler host=’HOST:PORT’):# normal Joblib code

Algorithmic plain-Python codewith optional bells and whistles

G Varoquaux 25

Page 58: On the code of data science

3 Persistence to diskPersisting any Python object fast, with little overhead

Streamed compressionI/O from/in open file handlesñ In S3, HDFS

G Varoquaux 26

Page 59: On the code of data science

3 joblib.Memory as a storage pool in devS3/HDFS/cloud backend:

joblib.Memory(’uri’, backend=’s3’)https://github.com/joblib/joblib/pull/397

Cache replacement:mem = joblib.Memory(’uri’, bytes limit=’1G’)mem.reduce size() # Remove oldest

Out-of-memory computingąąąąąąąąą result = mem.cache(g).call and shelve(a)ąąąąąąąąą result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)ąąąąąąąąą c = result.get()

G Varoquaux 27

Page 60: On the code of data science

3 joblib.Memory as a storage pool in devS3/HDFS/cloud backend:

joblib.Memory(’uri’, backend=’s3’)https://github.com/joblib/joblib/pull/397

Cache replacement:mem = joblib.Memory(’uri’, bytes limit=’1G’)mem.reduce size() # Remove oldest

Out-of-memory computingąąąąąąąąą result = mem.cache(g).call and shelve(a)ąąąąąąąąą result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)ąąąąąąąąą c = result.get()

G Varoquaux 27

Page 61: On the code of data science

3 joblib.Memory as a storage pool in devS3/HDFS/cloud backend:

joblib.Memory(’uri’, backend=’s3’)https://github.com/joblib/joblib/pull/397

Cache replacement:mem = joblib.Memory(’uri’, bytes limit=’1G’)mem.reduce size() # Remove oldest

Out-of-memory computingąąąąąąąąą result = mem.cache(g).call and shelve(a)ąąąąąąąąą result

MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)ąąąąąąąąą c = result.get()

G Varoquaux 27

Page 62: On the code of data science

3 joblib as bricks of a compute engine: vision

03878794797927

03878794797927

Parallel ona cloud/cluster

Memory + shelvingon distributed storesas out-of-core& commoncomputation

Simple algorithmic codeUsing infrastructure without buying into it

G Varoquaux 28

Page 63: On the code of data science

3 joblib as bricks of a compute engine: vision

03878794797927

03878794797927Parallel ona cloud/cluster

Memory + shelvingon distributed storesas out-of-core& commoncomputation

Simple algorithmic codeUsing infrastructure without buying into it

G Varoquaux 28

Page 64: On the code of data science

Time to wrap up

Page 65: On the code of data science

Time to wrap up

Code, code, code

Page 66: On the code of data science

Scipy-lectures: learning numerical Python

Many problems are better solved bydocumentation than new code

Page 67: On the code of data science

Scipy-lectures: learning numerical PythonComprehensive document: numpy, scipy, ...

1. Getting started with Python for science2. Advanced topics3. Packages and applications

http://scipy-lectures.org

Page 68: On the code of data science

Scipy-lectures: learning numerical Python

Code examples

Page 69: On the code of data science

On the code of data science

@GaelVaroquaux

I believe in code

without compromises

Page 70: On the code of data science

On the code of data science

@GaelVaroquaux

I believe in codewithout compromises

Librarieswith side-effect free codetight APIsdecoupling of analysisand plotting

Software engineeringversion control

Page 71: On the code of data science

On the code of data science

@GaelVaroquaux

I believe in codewithout compromises

Code empowersenables automationopens new applications

Disrupt something new

We use scikit-learn for markersof neuropsychiatric diseases

Page 72: On the code of data science

On the code of data science

@GaelVaroquaux

I believe in codewithout compromises

Code empowers

Software purity holds backInteractivity:brittle and costlybut necessary for insights

Refusing syntax shortcutsñ verbosity

Page 73: On the code of data science

On the code of data science

@GaelVaroquaux

I believe in codewithout compromises

Code empowers

Software purity holds back

The cost is worth the benefit in the long run