data science for big data

Click here to load reader

Post on 21-Jan-2018

1.212 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  1. 1. Data Science for Big Data with Anaconda Enterprise Let Anaconda Take Your Organization to the Next Level Daniel Rodriguez, Data Scientist Gus Cavanaugh, Product Marketing Manager
  2. 2. Data Scientist Daniel Rodriguez Daniel Rodriguez is a Data Scientist and Software Developer with over five years experience in areas ranging from DevOps to machine learning. He has performed data analysis and data engineering in big data environments across various industries. Daniel holds a degree in Electrical Engineering from Universidad de los Andes Colombia, and an MS in Science in IT Management from UT Dallas. He is passionate about open source data technologies and has spoken at PyData and Spark Summit. 2 2017 Anaconda, Inc. - Confidential & Proprietary
  3. 3. Product Marketing Manager Gus Cavanaugh Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he focuses on translating technical capabilities into user benefits. He has over five years experience in analytics and consulting for enterprises. Prior to joining Anaconda, he worked on projects ranging from small scale data apps and dashboards to distributed Hadoop clusters at companies including IBM and Booz Allen Hamilton. Gus holds an MS in Systems Engineering from George Washington University and a BS in Business Administration from Washington & Lee University. He is a frequent speaker on analytics topics for non- technical audiences. 3 2017 Anaconda, Inc. - Confidential & Proprietary
  4. 4. 2017 Anaconda, Inc. - Confidential & Proprietary Agenda Install Anaconda Distribution on a cluster Review the data and ETL process Analyze data with: Spark: Python & R Impala One-click deploy an application with Anaconda Enterprise with Python and R 4
  5. 5. 2017 Anaconda, Inc. - Confidential & Proprietary Install Anaconda Distribution on a Cluster Two options: Build a custom Cloudera CDH Parcel or Ambari Management pack Create/Ship on the fly runtime distribution 5 Python & R runtime on Hadoop
  6. 6. 2017 Anaconda, Inc. - Confidential & Proprietary CDH Parcel and Ambari Mgmt Pack Generation 6 Anaconda Enterprise offers UI for building custom distributions
  7. 7. 2017 Anaconda, Inc. - Confidential & Proprietary 7 Add packages and versions to distribution CDH Parcel and Ambari Mgmt Pack Generation
  8. 8. 2017 Anaconda, Inc. - Confidential & Proprietary Install Anaconda Parcel on a CDH Cluster Add Anaconda parcel to CDH via Cloudera Manager 8 https://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera
  9. 9. 2017 Anaconda, Inc. - Confidential & Proprietary Connect Spark to Anaconda Enterprise Install Livy on edge node Start the Livy server Connect Notebooks to Spark via Apache Livy & Sparkmagic 9
  10. 10. 2017 Anaconda, Inc. - Confidential & Proprietary Add Livy server to Sparkmagic config in your project Start doing your analysis using Spark inside the notebooks Connect Notebooks to Spark via Apache Livy & Sparkmagic 10 Connect Spark to Anaconda Enterprise
  11. 11. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data Format - line delimited JSON We transferred the data to S3 Using our Hadoop cluster, we can load the data from S3 11 3 Billion Reddit comments (2007-2017) Source: s3://anaconda-public-datasets/reddit/json
  12. 12. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL Distributed copy from Hadoop Download a JSON serializer for Parquet Transform the data into Parquet using Hive Parquet is columnar data store that makes it easy to make fast reads 12 Simple ETL process
  13. 13. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 13 hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@anaconda-publi wget http://s3.amazonaws.com/elasticmapreduce/samples/hive- ads/libs/jsonserde.jar Move data Get JSON serializer
  14. 14. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 14 hive > ADD JAR jsonserde.jar; hive > CREATE TABLE reddit_json ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int ) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id ,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups'); hive > LOAD DATA INPATH '/user/centos/RC_*' INTO TABLE reddit_json;
  15. 15. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 15 hive > CREATE TABLE reddit_parquet ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int, created_utc_t timestamp ) PARTITIONED BY (date_str string) STORED AS PARQUET;
  16. 16. 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 16 hive > set dfs.block.size=1g; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=1000; set hive.optimize.sort.dynamic.partition=true; hive > INSERT OVERWRITE TABLE reddit_parquet PARTITION (date_str) SELECT *, cast(cast(created_utc as double) as timestamp) as created_utc_t, date_format(cast(cast(created_utc as double) as timestamp),'yyyy-MM') as date_str FROM reddit_json;
  17. 17. 2017 Anaconda, Inc. - Confidential & Proprietary Analyze Data with Python and R SparklyR is one R API for Spark PySpark is the Python API for SparK 17 Using PySpark and SparklyR
  18. 18. 2017 Anaconda, Inc. - Confidential & Proprietary Build an Application Impala is great for SQL queries on Hadoop With Anaconda Enterprise, you arent limited to just Spark, Python and R. You can use whichever tools you are familiar with 18
  19. 19. 2017 Anaconda, Inc. - Confidential & Proprietary Deploy Application Anaconda Enterprise 5 offers one- click deployments in Python or R Easily deploy notebooks, APIs, dashboards, and web applications 19
  20. 20. 2017 Anaconda, Inc. - Confidential & Proprietary DEMO 20
  21. 21. 21 2017 Anaconda, Inc. - Confidential & Proprietary