Download - Data science for Big Data
Data Science for Big Data with Anaconda Enterprise
Let Anaconda Take Your Organization to the Next Level
Daniel Rodriguez, Data Scientist
Gus Cavanaugh, Product Marketing Manager
Data Scientist
Daniel Rodriguez
Daniel Rodriguez is a Data Scientist and Software Developer
with over five years’ experience in areas ranging from DevOps
to machine learning. He has performed data analysis and data
engineering in big data environments across various industries.
Daniel holds a degree in Electrical Engineering from
Universidad de los Andes Colombia, and an MS in Science in IT
Management from UT Dallas. He is passionate about open
source data technologies and has spoken at PyData and Spark
Summit.
2© 2017 Anaconda, Inc. - Confidential & Proprietary
Product Marketing Manager
Gus Cavanaugh
Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he
focuses on translating technical capabilities into user benefits. He has
over five years’ experience in analytics and consulting for enterprises.
Prior to joining Anaconda, he worked on projects ranging from small
scale data apps and dashboards to distributed Hadoop clusters at
companies including IBM and Booz Allen Hamilton.
Gus holds an MS in Systems Engineering from George Washington
University and a BS in Business Administration from Washington & Lee
University. He is a frequent speaker on analytics topics for non-
technical audiences.
3© 2017 Anaconda, Inc. - Confidential & Proprietary
© 2017 Anaconda, Inc. - Confidential & Proprietary
Agenda
• Install Anaconda Distribution on a cluster
• Review the data and ETL process
• Analyze data with:
• Spark: Python & R
• Impala
• One-click deploy an application withAnaconda Enterprise with Python and R
4
© 2017 Anaconda, Inc. - Confidential & Proprietary
Install Anaconda Distribution on a Cluster
• Two options:
• Build a custom Cloudera CDH Parcel or Ambari Management pack
• Create/Ship on the fly runtime distribution
5
Python & R runtime on Hadoop
© 2017 Anaconda, Inc. - Confidential & Proprietary
CDH Parcel and Ambari Mgmt Pack Generation
6
Anaconda Enterprise offers UI for building custom distributions
© 2017 Anaconda, Inc. - Confidential & Proprietary 7
Add packages and versions to distribution
CDH Parcel and Ambari Mgmt Pack Generation
© 2017 Anaconda, Inc. - Confidential & Proprietary
Install Anaconda Parcel on a CDH Cluster
Add Anaconda parcel to CDH via Cloudera Manager
8
https://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera
© 2017 Anaconda, Inc. - Confidential & Proprietary
Connect Spark to Anaconda Enterprise
• Install Livy on edge node
• Start the Livy server
Connect Notebooks to Spark via Apache Livy & Sparkmagic
9
© 2017 Anaconda, Inc. - Confidential & Proprietary
• Add Livy server to Sparkmagic config in your project
• Start doing your analysis using Spark inside the notebooks
Connect Notebooks to Spark via Apache Livy & Sparkmagic
10
Connect Spark to Anaconda Enterprise
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data
• Format - line delimited JSON
• We transferred the data to S3
• Using our Hadoop cluster, we can load the data from S3
11
3 Billion Reddit comments (2007-2017)
• Source: s3://anaconda-public-datasets/reddit/json
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
• Distributed copy from Hadoop
• Download a JSON serializer for Parquet
• Transform the data into Parquet using Hive◦ Parquet is columnar data store that makes it easy
to make fast reads
12
Simple ETL process
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
13
hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@anaconda-public
wget http://s3.amazonaws.com/elasticmapreduce/samples/hive-
ads/libs/jsonserde.jar
Move data
Get JSON serializer
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
14
hive > ADD JAR jsonserde.jar;
hive > CREATE TABLE reddit_json (
archived boolean,
author string,
author_flair_css_class string,
author_flair_text string,
body string,
controversiality int,
created_utc string,
distinguished string,
downs int,
edited boolean,
gilded int,
id string,
link_id string,
name string,
parent_id string,
removal_reason string,
retrieved_on timestamp,
score int,
score_hidden boolean,
subreddit string,
subreddit_id string,
ups int
)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties
('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id
,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups');
hive > LOAD DATA INPATH '/user/centos/RC_*' INTO TABLE reddit_json;
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
15
hive > CREATE TABLE reddit_parquet (
archived boolean,
author string,
author_flair_css_class string,
author_flair_text string,
body string,
controversiality int,
created_utc string,
distinguished string,
downs int,
edited boolean,
gilded int,
id string,
link_id string,
name string,
parent_id string,
removal_reason string,
retrieved_on timestamp,
score int,
score_hidden boolean,
subreddit string,
subreddit_id string,
ups int,
created_utc_t timestamp
)
PARTITIONED BY (date_str string)
STORED AS PARQUET;
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
16
hive > set dfs.block.size=1g;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.optimize.sort.dynamic.partition=true;
hive > INSERT OVERWRITE TABLE reddit_parquet PARTITION (date_str) SELECT
*, cast(cast(created_utc as double) as timestamp) as created_utc_t,
date_format(cast(cast(created_utc as double) as timestamp),'yyyy-MM') as
date_str FROM reddit_json;
© 2017 Anaconda, Inc. - Confidential & Proprietary
Analyze Data with Python and R
• SparklyR is one R API for Spark
• PySpark is the Python API for SparK
17
Using PySpark and SparklyR
© 2017 Anaconda, Inc. - Confidential & Proprietary
Build an Application
• Impala is great for SQL queries on Hadoop
• With Anaconda Enterprise, you aren’t limited to just Spark, Python and R. You can use whichever tools you are familiar with
18
© 2017 Anaconda, Inc. - Confidential & Proprietary
Deploy Application
• Anaconda Enterprise 5 offers one-click deployments in Python or R
• Easily deploy notebooks, APIs, dashboards, and web applications
19
© 2017 Anaconda, Inc. - Confidential & Proprietary
DEMO
20
21© 2017 Anaconda, Inc. - Confidential & Proprietary