Transcript
Page 1: Data science for Big Data

Data Science for Big Data with Anaconda Enterprise

Let Anaconda Take Your Organization to the Next Level

Daniel Rodriguez, Data Scientist

Gus Cavanaugh, Product Marketing Manager

Page 2: Data science for Big Data

Data Scientist

Daniel Rodriguez

Daniel Rodriguez is a Data Scientist and Software Developer

with over five years’ experience in areas ranging from DevOps

to machine learning. He has performed data analysis and data

engineering in big data environments across various industries.

Daniel holds a degree in Electrical Engineering from

Universidad de los Andes Colombia, and an MS in Science in IT

Management from UT Dallas. He is passionate about open

source data technologies and has spoken at PyData and Spark

Summit.

2© 2017 Anaconda, Inc. - Confidential & Proprietary

Page 3: Data science for Big Data

Product Marketing Manager

Gus Cavanaugh

Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he

focuses on translating technical capabilities into user benefits. He has

over five years’ experience in analytics and consulting for enterprises.

Prior to joining Anaconda, he worked on projects ranging from small

scale data apps and dashboards to distributed Hadoop clusters at

companies including IBM and Booz Allen Hamilton.

Gus holds an MS in Systems Engineering from George Washington

University and a BS in Business Administration from Washington & Lee

University. He is a frequent speaker on analytics topics for non-

technical audiences.

3© 2017 Anaconda, Inc. - Confidential & Proprietary

Page 4: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Agenda

• Install Anaconda Distribution on a cluster

• Review the data and ETL process

• Analyze data with:

• Spark: Python & R

• Impala

• One-click deploy an application withAnaconda Enterprise with Python and R

4

Page 5: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Install Anaconda Distribution on a Cluster

• Two options:

• Build a custom Cloudera CDH Parcel or Ambari Management pack

• Create/Ship on the fly runtime distribution

5

Python & R runtime on Hadoop

Page 6: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

CDH Parcel and Ambari Mgmt Pack Generation

6

Anaconda Enterprise offers UI for building custom distributions

Page 7: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary 7

Add packages and versions to distribution

CDH Parcel and Ambari Mgmt Pack Generation

Page 8: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Install Anaconda Parcel on a CDH Cluster

Add Anaconda parcel to CDH via Cloudera Manager

8

https://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera

Page 9: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Connect Spark to Anaconda Enterprise

• Install Livy on edge node

• Start the Livy server

Connect Notebooks to Spark via Apache Livy & Sparkmagic

9

Page 10: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

• Add Livy server to Sparkmagic config in your project

• Start doing your analysis using Spark inside the notebooks

Connect Notebooks to Spark via Apache Livy & Sparkmagic

10

Connect Spark to Anaconda Enterprise

Page 11: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data

• Format - line delimited JSON

• We transferred the data to S3

• Using our Hadoop cluster, we can load the data from S3

11

3 Billion Reddit comments (2007-2017)

• Source: s3://anaconda-public-datasets/reddit/json

Page 12: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data: ETL

• Distributed copy from Hadoop

• Download a JSON serializer for Parquet

• Transform the data into Parquet using Hive◦ Parquet is columnar data store that makes it easy

to make fast reads

12

Simple ETL process

Page 13: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data: ETL

13

hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@anaconda-public

wget http://s3.amazonaws.com/elasticmapreduce/samples/hive-

ads/libs/jsonserde.jar

Move data

Get JSON serializer

Page 14: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data: ETL

14

hive > ADD JAR jsonserde.jar;

hive > CREATE TABLE reddit_json (

archived boolean,

author string,

author_flair_css_class string,

author_flair_text string,

body string,

controversiality int,

created_utc string,

distinguished string,

downs int,

edited boolean,

gilded int,

id string,

link_id string,

name string,

parent_id string,

removal_reason string,

retrieved_on timestamp,

score int,

score_hidden boolean,

subreddit string,

subreddit_id string,

ups int

)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties

('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id

,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups');

hive > LOAD DATA INPATH '/user/centos/RC_*' INTO TABLE reddit_json;

Page 15: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data: ETL

15

hive > CREATE TABLE reddit_parquet (

archived boolean,

author string,

author_flair_css_class string,

author_flair_text string,

body string,

controversiality int,

created_utc string,

distinguished string,

downs int,

edited boolean,

gilded int,

id string,

link_id string,

name string,

parent_id string,

removal_reason string,

retrieved_on timestamp,

score int,

score_hidden boolean,

subreddit string,

subreddit_id string,

ups int,

created_utc_t timestamp

)

PARTITIONED BY (date_str string)

STORED AS PARQUET;

Page 16: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Review the Data: ETL

16

hive > set dfs.block.size=1g;

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions=1000;

set hive.exec.max.dynamic.partitions.pernode=1000;

set hive.optimize.sort.dynamic.partition=true;

hive > INSERT OVERWRITE TABLE reddit_parquet PARTITION (date_str) SELECT

*, cast(cast(created_utc as double) as timestamp) as created_utc_t,

date_format(cast(cast(created_utc as double) as timestamp),'yyyy-MM') as

date_str FROM reddit_json;

Page 17: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Analyze Data with Python and R

• SparklyR is one R API for Spark

• PySpark is the Python API for SparK

17

Using PySpark and SparklyR

Page 18: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Build an Application

• Impala is great for SQL queries on Hadoop

• With Anaconda Enterprise, you aren’t limited to just Spark, Python and R. You can use whichever tools you are familiar with

18

Page 19: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

Deploy Application

• Anaconda Enterprise 5 offers one-click deployments in Python or R

• Easily deploy notebooks, APIs, dashboards, and web applications

19

Page 20: Data science for Big Data

© 2017 Anaconda, Inc. - Confidential & Proprietary

DEMO

20

Page 21: Data science for Big Data

21© 2017 Anaconda, Inc. - Confidential & Proprietary


Top Related