simulating a data science pipe-line on your laptop...2016/10/24  · title how to use the powerpoint...

18
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Simulating a Data Science Pipe-Line on your Laptop Confidential – Oracle Internal/Restricted/Highly Restricted 1 Ed Bullen, Oracle UK

Upload: others

Post on 19-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Simulating a Data Science Pipe-Line on your Laptop

Confidential – Oracle Internal/Restricted/Highly Restricted 1

Ed Bullen, Oracle UK

Page 2: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Confidential – Oracle Internal/Restricted/Highly Restricted 2

Page 3: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Oracle

Confidential – Oracle Internal/Restricted/Highly Restricted 3 3

Page 4: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Open Source Projects at Oracle

Confidential – Oracle Internal/Restricted/Highly Restricted 4

http://openjdk.java.net/projects/graal/

Page 5: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Motivation

Confidential – Oracle Internal/Restricted/Highly Restricted 5

MATHS SCIENCE

PROGRAMMING ENGINEERING

Data Science and Engineering

Page 6: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

A Simple Data Science Pipe-Line Engineering a Data Processing Pipe-Line

Confidential – Oracle Internal/Restricted/Highly Restricted 7

Source Raw Data

Pre-Process Summarise Consumers

UK Crime Data

Hadoop Map

Python

Hadoop Reduce

Python

HDFS, Hive

R Studio

Hadoop Streaming API

A Simple Approach – well known (not latest cutting-edge tech) … but …

Stable – effective, easy to implement, static technology components

Page 7: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

The Oracle Big Data Lite VM

Confidential – Oracle Internal/Restricted/Highly Restricted 8

Free, Simple to Install – Fast Track Access to Hadoop Stack Technologies

http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html

https://pygot.wordpress.com/2016/07/08/getting-started-with-the-oracle-hadoop-vm/

Main Download Site:

Personal Blog – Additional Assistance and Network Configuration Tips:

Page 8: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Map Reduce

Confidential – Oracle Internal/Restricted/Highly Restricted 9

A Quick Refresher – Map, Shuffle, Reduce

HDFS

Node 1 - MAP

Node 2 - MAP

Node 1 - REDUCE

Node 2 - REDUCE

=3 =1 =2 =2

Page 9: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Hadoop Streaming API

Confidential – Oracle Internal/Restricted/Highly Restricted 10

Deploy Python and R Straight to Hadoop

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-file my_python_mapper.py -mapper "python my_python_mapper.py" \

-file my_python_reducer.py -reducer "python my_python_reducer.py" \

-input /user/hadoopuser/source_HDFS_dir \

-output /user/hadoopuser/dest_HDFS_dir

Hadoop HDFS

Mapper Executed in

OS Shell STD-OUT

Hadoop SORT and

NODE PARTITION STD-IN

Reducer Executed in

OS Shell STD-IN OUT

Hadoop HDFS

Page 10: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Hadoop Streaming API

Confidential – Oracle Internal/Restricted/Highly Restricted 11

Sample Code – Python Map-Reduce for UK Crime Data

https://github.com/edbullen/py-mapred

Example Code on GitHub

Crime ID,Month,Reported by,Falls within,..,LSOA, Crime type...

,2012-01,Avon and Somerset Constabulary,..,E01014399, Anti-social behaviour...

,2012-02,Avon and Somerset Constabulary,..,E01014400, Burglary...

DATE, LSOA , LSOA_Name , crime[0], crime[1], crime[2], ... crime[n]

2012-01,e01014399, LSOA Desc , 1 , 2 , 0 , ... 4

2012-02,e01014400, LSOA Desc , 1 , 2 , 0 , ... 4

https://data.police.uk/data/

Page 11: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Accessing the Data in R Studio

Confidential – Oracle Internal/Restricted/Highly Restricted 12

A Simple Approach

Personal Blog – Connecting R Studio to Hadoop via Hive: https://pygot.wordpress.com/2016/10/13/connecting-r-studio-to-hadoop-via-hive/

# Load Libraries and setup Java ClassPath

library("DBI")

library("rJava")

library("RJDBC")

# Java ClassPath for HIVE Access

cp = c("./hive-jdbc.jar"

, "./hadoop-common.jar"

, "./libthrift-0.9.2.jar"

, "./hive-service.jar"

, "./httpclient-4.2.5.jar"

, "./httpcore-4.2.5.jar“

, "./hive-jdbc-standalone.jar")

# Connect to Hive datastore in Hadoop

.jinit(classpath=cp)

drv <- JDBC("org.apache.hive.jdbc.HiveDriver"

, "hive-jdbc.jar")

conn <- dbConnect(drv

, "jdbc:hive2://bigdatalite:10000/default"

, "oracle", "")

# Query Data using SQL

ukcrimesum <- dbGetQuery(conn

, "select * from ukcrimesum")

Page 12: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Analysis of the Data-Set

Confidential – Oracle Internal/Restricted/Highly Restricted 13

A quick first-pass…

Page 13: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Analysis of the Data-Set

Confidential – Oracle Internal/Restricted/Highly Restricted 14

ukcrimesum <- dbGetQuery(conn

, "select * from ukcrimesum")

#which crimes show correlation?

crimesM <- data.matrix(ukcrimesum[,4:17])

corM <- cor(crimesM)

diag(corM) <- 0

heatmap(corM)

Correlation and Clustering

Page 14: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Analysis of the Data-Set

Confidential – Oracle Internal/Restricted/Highly Restricted 15

Seasonality

monthagg <- aggregate(cbind(robbery

, burglary

, bicycle_theft

, social) ~ date

, data=monthcrimes

, FUN=sum)

centered <- cbind(monthagg$date

, as.data.frame(apply(monthagg[-1]

, 2

, function(y) y - mean(y))) )

par(mfrow = c(4,1))

attach(centered)

for (name in names(centered)[-1] ) {

barplot(as.vector(centered[name][,1])

, main = paste(name))

}

Page 15: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Analysis of the Data-Set

Confidential – Oracle Internal/Restricted/Highly Restricted 16

Mapping

library("rgeos")

library("maptools")

ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"

ukmap <- readShapeSpatial(ukshapefileDETAIL)

lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]

loncrime <- dbGetQuery(conn, "select LSOA,

sum(total_classified) from ukcrimesum

where date in <...>

and lsoa in <...> group by LSOA")

#Combined Map Data (shapeFile) with added data

lonmap.crime <- SpatialPolygonsDataFrame(lonmap

,loncrime ,match.ID=FALSE)

plot(lonmap.crime

, col = countcols[findInterval(counts

, breaks, all.inside = TRUE)]

, axes = FALSE

, border = "transparent“

, main = "2015 Total Crimes" )

Page 16: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Analysis of the Data-Set

Confidential – Oracle Internal/Restricted/Highly Restricted 17

Mapping

library("rgeos")

library("maptools")

ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"

ukmap <- readShapeSpatial(ukshapefileDETAIL)

lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]

loncrime <- dbGetQuery(conn, "select LSOA,

sum(bicycle_theft) from ukcrimesum

where date in <...>

and lsoa in <...> group by LSOA")

#Combined Map Data (shapeFile) with added data

lonmap.crime <- SpatialPolygonsDataFrame(lonmap

,loncrime ,match.ID=FALSE)

plot(lonmap.crime

, col = countcols[findInterval(counts

, breaks, all.inside = TRUE)]

, axes = FALSE

, border = "transparent"

, main = "2015 Bicycle Theft" )

Page 17: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

Thank You

Confidential – Oracle Internal/Restricted/Highly Restricted 18

[email protected]

Social Media and Blog: ** all personal views, not representing my employer ** @bullened http://pygot.wordpress.com http://github.com/edbullen

Page 18: Simulating a Data Science Pipe-Line on your Laptop...2016/10/24  · Title How to Use the PowerPoint Template Author ebullen Subject Corproate Presentation Template Created Date 10/25/2016

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 19

https://www.meetup.com/Oracle-UK-BigData/