Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Simulating a Data Science Pipe-Line on your Laptop
Confidential – Oracle Internal/Restricted/Highly Restricted 1
Ed Bullen, Oracle UK
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
Confidential – Oracle Internal/Restricted/Highly Restricted 2
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Oracle
Confidential – Oracle Internal/Restricted/Highly Restricted 3 3
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Open Source Projects at Oracle
Confidential – Oracle Internal/Restricted/Highly Restricted 4
http://openjdk.java.net/projects/graal/
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Motivation
Confidential – Oracle Internal/Restricted/Highly Restricted 5
MATHS SCIENCE
PROGRAMMING ENGINEERING
Data Science and Engineering
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
A Simple Data Science Pipe-Line Engineering a Data Processing Pipe-Line
Confidential – Oracle Internal/Restricted/Highly Restricted 7
Source Raw Data
Pre-Process Summarise Consumers
UK Crime Data
Hadoop Map
Python
Hadoop Reduce
Python
HDFS, Hive
R Studio
Hadoop Streaming API
A Simple Approach – well known (not latest cutting-edge tech) … but …
Stable – effective, easy to implement, static technology components
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
The Oracle Big Data Lite VM
Confidential – Oracle Internal/Restricted/Highly Restricted 8
Free, Simple to Install – Fast Track Access to Hadoop Stack Technologies
http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html
https://pygot.wordpress.com/2016/07/08/getting-started-with-the-oracle-hadoop-vm/
Main Download Site:
Personal Blog – Additional Assistance and Network Configuration Tips:
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Map Reduce
Confidential – Oracle Internal/Restricted/Highly Restricted 9
A Quick Refresher – Map, Shuffle, Reduce
HDFS
Node 1 - MAP
Node 2 - MAP
Node 1 - REDUCE
Node 2 - REDUCE
=3 =1 =2 =2
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Hadoop Streaming API
Confidential – Oracle Internal/Restricted/Highly Restricted 10
Deploy Python and R Straight to Hadoop
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file my_python_mapper.py -mapper "python my_python_mapper.py" \
-file my_python_reducer.py -reducer "python my_python_reducer.py" \
-input /user/hadoopuser/source_HDFS_dir \
-output /user/hadoopuser/dest_HDFS_dir
Hadoop HDFS
Mapper Executed in
OS Shell STD-OUT
Hadoop SORT and
NODE PARTITION STD-IN
Reducer Executed in
OS Shell STD-IN OUT
Hadoop HDFS
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Hadoop Streaming API
Confidential – Oracle Internal/Restricted/Highly Restricted 11
Sample Code – Python Map-Reduce for UK Crime Data
https://github.com/edbullen/py-mapred
Example Code on GitHub
Crime ID,Month,Reported by,Falls within,..,LSOA, Crime type...
,2012-01,Avon and Somerset Constabulary,..,E01014399, Anti-social behaviour...
…
…
,2012-02,Avon and Somerset Constabulary,..,E01014400, Burglary...
…
DATE, LSOA , LSOA_Name , crime[0], crime[1], crime[2], ... crime[n]
2012-01,e01014399, LSOA Desc , 1 , 2 , 0 , ... 4
2012-02,e01014400, LSOA Desc , 1 , 2 , 0 , ... 4
https://data.police.uk/data/
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Accessing the Data in R Studio
Confidential – Oracle Internal/Restricted/Highly Restricted 12
A Simple Approach
Personal Blog – Connecting R Studio to Hadoop via Hive: https://pygot.wordpress.com/2016/10/13/connecting-r-studio-to-hadoop-via-hive/
# Load Libraries and setup Java ClassPath
library("DBI")
library("rJava")
library("RJDBC")
# Java ClassPath for HIVE Access
cp = c("./hive-jdbc.jar"
, "./hadoop-common.jar"
, "./libthrift-0.9.2.jar"
, "./hive-service.jar"
, "./httpclient-4.2.5.jar"
, "./httpcore-4.2.5.jar“
, "./hive-jdbc-standalone.jar")
# Connect to Hive datastore in Hadoop
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver"
, "hive-jdbc.jar")
conn <- dbConnect(drv
, "jdbc:hive2://bigdatalite:10000/default"
, "oracle", "")
# Query Data using SQL
ukcrimesum <- dbGetQuery(conn
, "select * from ukcrimesum")
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Analysis of the Data-Set
Confidential – Oracle Internal/Restricted/Highly Restricted 13
A quick first-pass…
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Analysis of the Data-Set
Confidential – Oracle Internal/Restricted/Highly Restricted 14
ukcrimesum <- dbGetQuery(conn
, "select * from ukcrimesum")
#which crimes show correlation?
crimesM <- data.matrix(ukcrimesum[,4:17])
corM <- cor(crimesM)
diag(corM) <- 0
heatmap(corM)
Correlation and Clustering
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Analysis of the Data-Set
Confidential – Oracle Internal/Restricted/Highly Restricted 15
Seasonality
monthagg <- aggregate(cbind(robbery
, burglary
, bicycle_theft
, social) ~ date
, data=monthcrimes
, FUN=sum)
centered <- cbind(monthagg$date
, as.data.frame(apply(monthagg[-1]
, 2
, function(y) y - mean(y))) )
par(mfrow = c(4,1))
attach(centered)
for (name in names(centered)[-1] ) {
barplot(as.vector(centered[name][,1])
, main = paste(name))
}
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Analysis of the Data-Set
Confidential – Oracle Internal/Restricted/Highly Restricted 16
Mapping
library("rgeos")
library("maptools")
ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"
ukmap <- readShapeSpatial(ukshapefileDETAIL)
lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]
loncrime <- dbGetQuery(conn, "select LSOA,
sum(total_classified) from ukcrimesum
where date in <...>
and lsoa in <...> group by LSOA")
#Combined Map Data (shapeFile) with added data
lonmap.crime <- SpatialPolygonsDataFrame(lonmap
,loncrime ,match.ID=FALSE)
plot(lonmap.crime
, col = countcols[findInterval(counts
, breaks, all.inside = TRUE)]
, axes = FALSE
, border = "transparent“
, main = "2015 Total Crimes" )
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Analysis of the Data-Set
Confidential – Oracle Internal/Restricted/Highly Restricted 17
Mapping
library("rgeos")
library("maptools")
ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"
ukmap <- readShapeSpatial(ukshapefileDETAIL)
lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]
loncrime <- dbGetQuery(conn, "select LSOA,
sum(bicycle_theft) from ukcrimesum
where date in <...>
and lsoa in <...> group by LSOA")
#Combined Map Data (shapeFile) with added data
lonmap.crime <- SpatialPolygonsDataFrame(lonmap
,loncrime ,match.ID=FALSE)
plot(lonmap.crime
, col = countcols[findInterval(counts
, breaks, all.inside = TRUE)]
, axes = FALSE
, border = "transparent"
, main = "2015 Bicycle Theft" )
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Thank You
Confidential – Oracle Internal/Restricted/Highly Restricted 18
Social Media and Blog: ** all personal views, not representing my employer ** @bullened http://pygot.wordpress.com http://github.com/edbullen
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 19
https://www.meetup.com/Oracle-UK-BigData/