scikit-learn: the state of the union 2016

33
Scikit-learn The state of the union Ga¨ el Varoquaux Open Source Innovation Spring 2016 Personal point of view, as an opening to scikit-learn days 2016 in Paris

Upload: gael-varoquaux

Post on 13-Apr-2017

1.481 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scikit-learn: the state of the union 2016

Scikit-learn The state of the unionGael Varoquaux Open Source Innovation Spring

2016

Personal point of view, as an opening to scikit-learn days 2016 in Paris

Page 2: Scikit-learn: the state of the union 2016

1 Some historyScikit-learn canal historique

G Varoquaux 2

Page 3: Scikit-learn: the state of the union 2016

1 scikit-learn growth: users

Website users (weekly): Google analytics

Debian popcon: ∼ 1% of the Debian users

Web searches: Google trends

G Varoquaux 3

Page 4: Scikit-learn: the state of the union 2016

1 scikit-learn growth: users

Website users (weekly): Google analytics

Debian popcon: ∼ 1% of the Debian users

Web searches: Google trends

G Varoquaux 3

Page 5: Scikit-learn: the state of the union 2016

1 scikit-learn growth: lines of code

Lines of code:

Huge feature set

https://www.openhub.net/p/scikit-learn

G Varoquaux 4

Page 6: Scikit-learn: the state of the union 2016

1 scikit-learn growth: contributors

Contributors:

759 contributorshttps://www.openhub.net/p/scikit-learn

G Varoquaux 5

Page 7: Scikit-learn: the state of the union 2016

1 Started as David Cournapeau’s failed PhD project

David then preferredimproving numpy/scipy

That’s David sprinting in 2011G Varoquaux 6

Page 8: Scikit-learn: the state of the union 2016

1 2009: We (Inria Parietal) need machine learning

My team takes over thedevelopment

Hire a young guy(Fabian Pedregosa)

Put post-docs and PhDs(Alexandre Gramfort, Vincent Michel...)

Work in the open

Pythonic, fast, documented

G Varoquaux 7

Page 9: Scikit-learn: the state of the union 2016

1 2010: ICML MLOSS workshop

Machine Learning Open Source Software

“The examples in thetutorial are pretty, butnot particularly usefulfor the serious user.”

“For the sustainability ofthe project it might be bet-ter to narrow the focus...”

G Varoquaux 8

Page 10: Scikit-learn: the state of the union 2016

1 2011: NIPS sprint

People that I didn’t knowwere solving my problems

The project took off because of the community...

G Varoquaux 9

Page 11: Scikit-learn: the state of the union 2016

1 2011: NIPS sprint

People that I didn’t knowwere solving my problems

The project took off because of the community...

G Varoquaux 9

Page 12: Scikit-learn: the state of the union 2016

2 Upcoming cool stuffUpcoming 0.18 release

G Varoquaux 10

Page 13: Scikit-learn: the state of the union 2016

2 Less code:

Lines of code:

Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm

= memory saver

Arthur Mensch

G Varoquaux 11

Page 14: Scikit-learn: the state of the union 2016

2 Less code: Cython no longer embedded

Lines of code:

Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm

= memory saver

Arthur MenschG Varoquaux 11

Page 15: Scikit-learn: the state of the union 2016

2 Faster code: better algorithmics

RandomizedPCA → PCAAutomatic choice randomized linear algebra

power iteration (arpack) full (lapack)

For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243

Giorgio Patrini

Elkan’s K meansFor large data: ∼ 2× speed up.

https://github.com/scikit-learn/scikit-learn/pull/5414

Andreas Muller

G Varoquaux 12

Page 16: Scikit-learn: the state of the union 2016

2 Faster code: better algorithmics

RandomizedPCA → PCAAutomatic choice randomized linear algebra

power iteration (arpack) full (lapack)

For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243

Giorgio Patrini

Elkan’s K meansFor large data: ∼ 2× speed up.

https://github.com/scikit-learn/scikit-learn/pull/5414

Andreas MullerG Varoquaux 12

Page 17: Scikit-learn: the state of the union 2016

2 New cross-validation objects

from s k l e a r n . c r o s s v a l i d a t i o nimport S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :

X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]

Data-independent nested-CV possible

https://github.com/scikit-learn/scikit-learn/pull/4294

Raghav R VG Varoquaux 13

Page 18: Scikit-learn: the state of the union 2016

2 New cross-validation objects

from s k l e a r n . m o d e l s e l e c t i o nimport S t r a t i f i e d K F o l d

cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):

X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]

Data-independent ⇒ nested-CV possible

https://github.com/scikit-learn/scikit-learn/pull/4294

Raghav R VG Varoquaux 13

Page 19: Scikit-learn: the state of the union 2016

2 Sequential / Bayesian search CV

See hyper-parameter selection as a Bayesianoptimization / noisy fit problem.⇒ choose hyper-parameters cleverly, not on a grid

Pull request stalled

https://github.com/scikit-learn/scikit-learn/pull/5491

Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar

G Varoquaux 14

Page 20: Scikit-learn: the state of the union 2016

3 Vision(s): the future

G Varoquaux 15

Page 21: Scikit-learn: the state of the union 2016

Mission statement

Enable progress via data science

Lower the costs,less technicalities

Machine learningfor everybody andfor everything

Small hardware,medium data

G Varoquaux 16

Page 22: Scikit-learn: the state of the union 2016

Mission statement

Enable progress via data science

Lower the costs,less technicalities

Machine learningfor everybody andfor everything

Small hardware,medium data

G Varoquaux 16

Page 23: Scikit-learn: the state of the union 2016

3 Deep learningsklearn.neural network.MLPClassifier

architecture-specification languageGPUs unbound technicality

keras, caffe...

G Varoquaux 17

Page 24: Scikit-learn: the state of the union 2016

3 Deep learningsklearn.neural network.MLPClassifier

architecture-specification languageGPUs unbound technicality

keras, caffe...

G Varoquaux 17

Page 25: Scikit-learn: the state of the union 2016

3 AutoMLAutomatic model selection

Better hyper-parameter selection

Better description and uniformization of estimators

Integrate feedback from auto-sklearn

G Varoquaux 18

Page 26: Scikit-learn: the state of the union 2016

3 Better, faster, strongerFaster models

From lightning, back to sklearnInspiration from XGBoost the paper is out!

Larger dataMore partial fit online forests?Less copies

G Varoquaux 19

Page 27: Scikit-learn: the state of the union 2016

3 Better, faster, strongerFaster models

From lightning, back to sklearnInspiration from XGBoost the paper is out!

Larger dataMore partial fit online forests?Less copies

G Varoquaux 19

Page 28: Scikit-learn: the state of the union 2016

3 Scaling up (out?)

I don’t want java/scalaLess fluid prototypingCross-VM debugging hardNumerics in java slowers than Lapack

Need C somewhere

G Varoquaux 20

Page 29: Scikit-learn: the state of the union 2016

3 Scaling up (out?)

I don’t want java/scala

They have:Coupling distributed store to computationDistributed job management

Create new stack? Ride on this one?

G Varoquaux 20

Page 30: Scikit-learn: the state of the union 2016

3 Scaling up (out?)

I don’t want java/scala

They have:Coupling distributed store to computationDistributed job management

Create new stack? Ride on this one?

Blaze, Ibis, dask: require rewrite of algorithmsdask promising for ETL

New backends for joblib parallel and storagedistributed, ssh

G Varoquaux 20

Page 31: Scikit-learn: the state of the union 2016

Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)

Coding is not the only thingsprint, GSOC management, tutorials...

Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...

6= work on project

Passionate coders get a lot doneunless they get drowned by meetings

G Varoquaux 21

Page 32: Scikit-learn: the state of the union 2016

Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)

Coding is not the only thingsprint, GSOC management, tutorials...

Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...

6= work on project

Passionate coders get a lot doneunless they get drowned by meetings

G Varoquaux 21

Page 33: Scikit-learn: the state of the union 2016

@GaelVaroquaux

Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC