data science for big data

Data Science for Big Data with Anaconda Enterprise

Let Anaconda Take Your Organization to the Next Level

Daniel Rodriguez, Data Scientist

Gus Cavanaugh, Product Marketing Manager

Data Scientist

Daniel Rodriguez

Daniel Rodriguez is a Data Scientist and Software Developer

with over five years’ experience in areas ranging from DevOps

to machine learning. He has performed data analysis and data

engineering in big data environments across various industries.

Daniel holds a degree in Electrical Engineering from

Universidad de los Andes Colombia, and an MS in Science in IT

Management from UT Dallas. He is passionate about open

source data technologies and has spoken at PyData and Spark

Summit.

2© 2017 Anaconda, Inc. - Confidential & Proprietary

Product Marketing Manager

Gus Cavanaugh

Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he

focuses on translating technical capabilities into user benefits. He has

over five years’ experience in analytics and consulting for enterprises.

Prior to joining Anaconda, he worked on projects ranging from small

scale data apps and dashboards to distributed Hadoop clusters at

companies including IBM and Booz Allen Hamilton.

Gus holds an MS in Systems Engineering from George Washington

University and a BS in Business Administration from Washington & Lee

University. He is a frequent speaker on analytics topics for non-

technical audiences.


© 2017 Anaconda, Inc. - Confidential & Proprietary

Agenda

• Install Anaconda Distribution on a cluster

• Review the data and ETL process

• Analyze data with:

• Spark: Python & R

• Impala

• One-click deploy an application withAnaconda Enterprise with Python and R

4


Install Anaconda Distribution on a Cluster

• Two options:

• Build a custom Cloudera CDH Parcel or Ambari Management pack

• Create/Ship on the fly runtime distribution

5

Python & R runtime on Hadoop


CDH Parcel and Ambari Mgmt Pack Generation

6

Anaconda Enterprise offers UI for building custom distributions

© 2017 Anaconda, Inc. - Confidential & Proprietary 7

Add packages and versions to distribution

CDH Parcel and Ambari Mgmt Pack Generation


Install Anaconda Parcel on a CDH Cluster

Add Anaconda parcel to CDH via Cloudera Manager

8

https://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera


Connect Spark to Anaconda Enterprise

• Install Livy on edge node

• Start the Livy server

Connect Notebooks to Spark via Apache Livy & Sparkmagic

9


• Add Livy server to Sparkmagic config in your project

• Start doing your analysis using Spark inside the notebooks

Connect Notebooks to Spark via Apache Livy & Sparkmagic

10

Connect Spark to Anaconda Enterprise


Review the Data

• Format - line delimited JSON

• We transferred the data to S3

• Using our Hadoop cluster, we can load the data from S3

11

3 Billion Reddit comments (2007-2017)

• Source: s3://anaconda-public-datasets/reddit/json


Review the Data: ETL

• Distributed copy from Hadoop

• Download a JSON serializer for Parquet

• Transform the data into Parquet using Hive◦ Parquet is columnar data store that makes it easy

to make fast reads

12

Simple ETL process



13

hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@anaconda-public

wget http://s3.amazonaws.com/elasticmapreduce/samples/hive-

ads/libs/jsonserde.jar

Move data

Get JSON serializer



14

hive > ADD JAR jsonserde.jar;

hive > CREATE TABLE reddit_json (

archived boolean,

author string,

author_flair_css_class string,

author_flair_text string,

body string,

controversiality int,

created_utc string,

distinguished string,

downs int,

edited boolean,

gilded int,

id string,

link_id string,

name string,

parent_id string,

removal_reason string,

retrieved_on timestamp,

score int,

score_hidden boolean,

subreddit string,

subreddit_id string,

ups int

)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties

('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id

,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups');

hive > LOAD DATA INPATH '/user/centos/RC_*' INTO TABLE reddit_json;



15

hive > CREATE TABLE reddit_parquet (

archived boolean,

author string,

author_flair_css_class string,

author_flair_text string,

body string,

controversiality int,

created_utc string,

distinguished string,

downs int,

edited boolean,

gilded int,

id string,

link_id string,

name string,

parent_id string,

removal_reason string,

retrieved_on timestamp,

score int,

score_hidden boolean,

subreddit string,

subreddit_id string,

ups int,

created_utc_t timestamp

)

PARTITIONED BY (date_str string)

STORED AS PARQUET;



16

hive > set dfs.block.size=1g;

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions=1000;

set hive.exec.max.dynamic.partitions.pernode=1000;

set hive.optimize.sort.dynamic.partition=true;

hive > INSERT OVERWRITE TABLE reddit_parquet PARTITION (date_str) SELECT

*, cast(cast(created_utc as double) as timestamp) as created_utc_t,

date_format(cast(cast(created_utc as double) as timestamp),'yyyy-MM') as

date_str FROM reddit_json;


Analyze Data with Python and R

• SparklyR is one R API for Spark

• PySpark is the Python API for SparK

17

Using PySpark and SparklyR


Build an Application

• Impala is great for SQL queries on Hadoop

• With Anaconda Enterprise, you aren’t limited to just Spark, Python and R. You can use whichever tools you are familiar with

18


Deploy Application

• Anaconda Enterprise 5 offers one-click deployments in Python or R

• Easily deploy notebooks, APIs, dashboards, and web applications

19


DEMO

20

data science for big data

Technology