big data and analytics on aws

49
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Big Data and Analytics on AWS KD Singh Solutions Architect Amazon Web Services ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Upload: amazon-web-services

Post on 14-Aug-2015

313 views

Category:

Technology


8 download

TRANSCRIPT

  1. 1. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Big Data and Analytics on AWS KD Singh Solutions Architect Amazon Web Services 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  2. 2. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 What is big data? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze, and share it Velocity Rate of data flow in Latency High or Low Volume High or Low Variety Diversity of source data Item Size KB or MB Request Rate Access patterns Change Rate How much is the data changing? Processing Requirements How much computation? Durability Preservation of source data? Availability Tolerance for downtime? Growth Rate Rate of data growth? Views The diversity of consumers?
  3. 3. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Plethora of tools Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift AWS Data Pipeline Amazon Kinesis Cassandra Amazon CloudSearch
  4. 4. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Ingest Store Analyze Visualize Data Answers Time Multiple stages Storage decoupled from processing Simplify data analytics flow
  5. 5. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon GlacierS3 DynamoDB RDS Amazon Kinesis Spark Streaming EMR Ingest Store Process/Analyze Visualize Data Pipeline Storm Kafka Amazon Redshift Cassandra Amazon CloudSearch Amazon Kinesis Connector Kinesis enabled app App Server Web Server Devices
  6. 6. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Collect / Ingest Amazon Kinesis Process / Analyze Amazon EMR Amazon EC2 Amazon Redshift AWS Data Pipeline Visualize / ReportStore Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS AWS Import/Export AWS Direct Connect Amazon SQS AWS big data portfolio
  7. 7. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Mobile / Cable Telecom Oil and Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming Industries using AWS for data analysis
  8. 8. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Ingest: The act of collecting and storing data
  9. 9. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Types of data ingest Transactional Database reads/writes File Media files; log files Stream Click-stream logs (sets of events) Database Cloud Storage Stream Storage LoggingFrameworksDevicesApps
  10. 10. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Amazon Redshift, DynamoDB Amazon Kinesis
  11. 11. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Sending and reading data from Amazon Kinesis streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading Write Read
  12. 12. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Hparser, Big Data Edition Flume, Sqoop AWS Partners for data ingest, load, and transformation
  13. 13. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Storage
  14. 14. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 App/Web Tier Client Tier Database & Storage Tier Cloud database and storage tier anti-pattern
  15. 15. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 App/Web Tier Client Tier Data TierDatabase & Storage Tier Search Hadoop/HDFS Cache Blob Store SQL NoSQL Cloud database and storage tier use the right tool for the job!
  16. 16. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Database & Storage Tier Amazon RDSAmazon DynamoDB Amazon ElastiCache Amazon S3 Amazon Glacier Amazon CloudSearch HDFS on Amazon EMR Cloud database and storage tier use the right tool for the job!
  17. 17. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
  18. 18. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Aggregate all data in S3 surrounded by a collection of the right tools Amazon EMR Amazon Kinesis Amazon Redshift Amazon DynamoDB Amazon RDS AWS Data Pipeline Spark Streaming Cassandra Storm Amazon S3 No limit on the number of objects Object size up to 5 TB Central data storage for all systems High bandwidth 99.999999999% durability Versioning; lifecycle policies Amazon Glacier integration
  19. 19. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Fully managed NoSQL database service Built on solid-state drives (SSDs) Consistent low-latency performance Any throughput rate No storage limits Amazon DynamoDB
  20. 20. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Scaling without downtime Automatic sharding Security inspections, patches, upgrades Automatic hardware failover Multi-AZ replication Hardware configuration designed specifically for DynamoDB Performance tuning DynamoDB: managed high availability and durability
  21. 21. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Relational databases Fully managed; zero admin MySQL, PostgreSQL, Oracle, SQL Server Aurora Amazon RDS
  22. 22. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Process and analyze
  23. 23. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Processing frameworks Batch processing Take large amount (>100 TB) of cold data and ask questions Takes minutes or hours to get answers back Example: Generating hourly, daily, weekly reports Stream processing (real-time) Take small amount of hot data and ask questions Takes short amount of time to get your answer back Example: 1 min metrics
  24. 24. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Processing frameworks Batch processing/analytic Amazon Redshift Amazon EMR (Hadoop) Spark, Hive/Tez, Pig, Impala, Presto, . Stream processing Amazon Kinesis client and connector library Spark Streaming Storm (+Trident) MPPMPPHadoopStreamProcessing
  25. 25. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully managed Very cost-effective Amazon Redshift
  26. 26. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon Redshift architecture Leader Node SQL endpoint Stores metadata Coordinates query execution Compute Nodes Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3 Parallel load from Amazon DynamoDB Hardware optimized for data processing Two hardware platforms DS2 (dense storage): HDD; scale to 1.6PB DC1 (dense compute): SSD; scale to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  27. 27. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Amazon Kinesis Amazon Elastic MapReduce
  28. 28. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How does EMR work?
  29. 29. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 The Hadoop ecosystem works with EMR
  30. 30. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Partners advanced analytics
  31. 31. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Visualize
  32. 32. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Partners for BI & data visualization
  33. 33. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Putting it all together
  34. 34. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
  35. 35. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon EMR as ETL Grid and Analysis Amazon Redshift Production DWH VisualizationLogs Traffic Statistics Demo
  36. 36. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Demo
  37. 37. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 ICAO and Hadoop Marco Merens Chief (Acting) Integrated Analysis International Civil Aviation Organization
  38. 38. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 ICAO in the cloud
  39. 39. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Cloudability principles at ICAO 1. What comes from the cloud, can stay in the cloud 2. What comes from in-house A. should stay in-house if private, or B. can be synced with the cloud if public
  40. 40. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Data sync Data Basic UI Create Read Update Delete sync Data FancyUI Read Metrics
  41. 41. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Collect Map Reduce Publish Key Priority EMR example: blended accident list
  42. 42. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Input format XML AscendS1982045MU-2, Collision with high ground, (near) Kelowna . CSV |26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take- off||
  43. 43. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 #!/bin/sh wget "http://somexml" -qO- | tr -d "n" | tr -d "r" | sed "s##n#g" > tmp aws s3 put tmp s3://accidents/input/source1 . Amazon S3 Amazon EC2 Use linux crontab to schedule Make One XML element per line for EMR Collect
  44. 44. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR command line elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/samples/node/install-node-bin- x86.sh --instance-type m1.small --instance-count 3 --json job.json --put /home/ec2-user/key/newtest.pem --to /home/hadoop --enable-debugging Put ssh key to hadoop if you need to remote sh
  45. 45. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR json config file [{ "Name": "Make accident map", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3://accidentstats/input/*", ]},{ "Name": "Store in mongo", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" , "Args": [ "s3://edmscripts/uploadtomongo.sh", "accidentstats/output", "NEWACCIDENTLIST" ]}} Move the results from S3 to somewhere else
  46. 46. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Map sourceX #!/usr/bin/env node function treatline(line) { If (line.indexOf()) { source1(line) } Function source1(line) { Var data=xml2json(line) data.records.forEach(function(v){ Var el={ Date:v.Date, Registration:v.Registration, Model:v.Model, Source:Source1, Priority:1 } var key=el.Date+#+el.Registration process.stdout.write(key+/t+JSON.stringify(el) ) }) } mapped Amazon Elastic MapReduce Amazon S3
  47. 47. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Reduce Mapped and sorted #!/usr/bin/env node Var oldkey,key,array=[] function treatline(line) { key=line.split(/t)[0] data=JSON.parse(line.split(/t)[1]) If ((key==oldkey) || !oldkey) { array.push(data)} Else { treat(array) array=[]} oldkey=key } Function treat(array) { el={} array=array.sort(prioritysort) array.forEach(function(v){ el=updateresult(el,v) }) process.stdout.write(JSON.stringify(el)+n) } Reduced Amazon Elastic MapReduce Amazon S3
  48. 48. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Real-time statistics Amazon Elastic MapReduce
  49. 49. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Thank You. This presentation will be loaded to SlideShare the week following the Symposium. http://www.slideshare.net/AmazonWebServices AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015