data science with r for java developers
DESCRIPTION
As presented at JavaOne 2013TRANSCRIPT
Data Science With R
~ for ~
Java De
velopers
@Sander_Mak
Agenda
Data Science
The R language
Gimme some Java!
1
1
1
1 1
1
11
0
0
0
0
0
0
90% of the world’s data wasproduced in the last 2 years
- SINTEF/ScienceDaily June 2013!!!!!!!!
We need more thanjust CRUD
Stand back.
I know Data Science!
SoftwareEngineering
DomainExpertise
Math & Statistics
DataScience
MachineLearning
OperationsResearch
Danger!Perl ahead!
SoftwareEngineering
DomainExpertise
Math & Statistics
DataScience
MachineLearning
OperationsResearch
Danger!Perl ahead!
Data Science:Achievement Unlocked
R, R-Studio
Today
Data Science:Achievement Unlocked
Agenda
Data Science
The R language
Gimme some Java!
1
1
1 1 1
1
1
10
0
0
0
0
0
LanguageDesigners Statisticians
LanguageDesigners? Statisticians?
The best thing about R is that it was developed by statisticians. The worst thing about R is that... it was developed by statisticians. - Bo Cowgill, Google
Why R, then?
Open Source
De-facto standard (in statistical research)
“It’s a DSL posing as general purpose language”
Interactive data exploration
Why not R, then?Slow
Memory Bound
(Did I mention it’s a quirky language?)
Try googling for R...
Why not R, then?
‘If you are using R and you think you’re in hell, this is a map for you.’
- The R Inferno
Slow
Memory Bound
(Did I mention it’s a quirky language?)
Try googling for R...
Apparently, statisticians aren’t designers, either...
VS
Dynamic (eval)
Interpreted
Static types
Compiled
Functional/OO/Procedural OO
Factor Enum
numeric
character String
Integer/Double/...
Factor Enum
numeric
character String
vectorlist
dataframe
Integer/Double/...
1-based 0-based12
34
01
23
1-based 0-based12
34
01
23
for-loops
higher-order functionssapply(vec, function(elm) { elm + 1;})
eager evalutationlazy evaluation
eager evalutationlazy evaluation
pass-by-value(copy-on-write)
pass-by-reference
Function FValue A Value A
Value A’call F(A) modify
Studio
Central
ComprehensiveRArchiveNetwork
Studio
Coding time!
Titanic Competition: Machine Learning from Disaster
Titanic Competition: Machine Learning from Disaster
Titanic Competition: Machine Learning from Disaster
Sex == Female
Decision Tree
Age > 50Age > 16
Fare > 100
T FT T F
Titanic Competition: Machine Learning from Disaster
Sex == Female
Decision Tree
Age > 50Age > 16
Random Forest
Fare > 100
T FT T F
T
FT T FT
FT T F
T
FT T FT
FT T F
Demo time!
...
...
Agenda
Data Science
The R language
Gimme some Java!
1
1
1 1 1
1
1
1
0
0
0
0
0
0
Bridging R and Java
Integrate
Assimilate
Replace
rJava & Java/R interfaceIntegrate
Two way native interface - JNI: libjri - or TCP to RServe
Rengine re = new Rengine(new String[] {}, false, null);
// wait until engine is readyif (!re.waitForR()) { throw new IllegalStateException(“Can’t load R engine”);}
re.eval("data(cars)", false);REXP cars = re.eval("cars");
RVector carsVector = cars.asVector();// dissect carsVector...
Assimilate
Reimplementation of R on JVM
Fast & lean
Parallelized
Just-another-lib
... not production ready yet...
Assimilate
// create a script engine managerScriptEngineManager factory = new ScriptEngineManager();
// create an R engineScriptEngine engine = factory.getEngineByName("Renjin");
// load package from classpathengine.eval(“library(survey)");
// evaluate R code from Stringengine.eval("print('Hello from R')");
Reimplementation of R on JVM
Fast & lean
Parallelized
Just-another-lib
... not production ready yet...
Reimplementation of R on JVM
Share data:
Integer[] data = {1, 2, 3};
engine.put("data", data); engine.eval("print(sum(data))");
Assimilate
Reimplementation of R on JVM
Share data:
import(com.foo.User)
# instantiate Java beanstim <- User$new(name='Tim', age=23)tom <- User$new(name='Tom', age=45)
# invoke settertim$name <- "Timmy"
Use Java from Renjin:
Integer[] data = {1, 2, 3};
engine.put("data", data); engine.eval("print(sum(data))");
Assimilate
Big Data?
ReplaceJVM Libraries/platforms
ReplaceScalable R distributions(non-JVM)
Revolution Analytics
Oracle Enterprise R
Wrap-up
Data Science
The R language
Gimme some Java!
1
1
1 1 1
1
1
10
0
0
0
0
0
SanitizeExplore
Model PredictScale
Next steps
Computing for Data Analysisstarts Sept. 23rd
Install R Read
Questions?Data Science
The R language
Gimme some Java!11
1 1 11 110
0
0
0
0
0
@Sander_Mak
branchandbound.net