![Page 1: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Ecosystem Solution Architect
15th September 2015
Building your first Big Data application on AWS
![Page 2: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/2.jpg)
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon EMR
Amazon Redshift
Amazon MachineLearning
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Collectionand Storage
DataProcessing
EventProcessing
Data Analysis
Data Answers
Big Data ecosystem on AWS
![Page 3: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/3.jpg)
Your first Big Data application on AWS
?
![Page 4: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/4.jpg)
Big Data ecosystem on AWS - Collect
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Answers
![Page 5: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/5.jpg)
Big Data ecosystem on AWS - Process
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Answers
![Page 6: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/6.jpg)
Big Data ecosystem on AWS - Analyze
CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore
Data Answers
SQL
![Page 7: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/7.jpg)
Setup
![Page 8: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/8.jpg)
Resources
1. AWS Command Line Interface (aws-cli) configured
2. Amazon Kinesis stream with a single shard
3. Amazon S3 bucket to hold the files
4. Amazon EMR cluster (two nodes) with Spark and Hive
5. Amazon Redshift data warehouse cluster (single node)
![Page 9: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/9.jpg)
Amazon Kinesis
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream \ --stream-name AccessLogStream \ --shard-count 1
![Page 10: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/10.jpg)
Amazon S3
![Page 11: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/11.jpg)
Amazon EMR
![Page 12: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/12.jpg)
Amazon Redshift
![Page 13: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/13.jpg)
Your first Big Data application on AWS
1. COLLECT: Stream data into Kinesis with Log4J
2. PROCESS: Process datawith EMR using Spark & Hive
3. ANALYZE: Analyze data in Redshift using SQL
STORE
SQL
![Page 14: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/14.jpg)
1. Collect
![Page 15: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/15.jpg)
Amazon Kinesis Log4J Appender
![Page 16: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/16.jpg)
![Page 17: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/17.jpg)
Log file format
![Page 18: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/18.jpg)
Spark
•Fast and general engine for large-scale data processing
•Write applications quickly in Java, Scala or Python
•Combine SQL, streaming, and complex analytics.
![Page 19: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/19.jpg)
Using Spark on EMR
![Page 20: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/20.jpg)
Amazon Kinesis and Spark Streaming
Producer AmazonKinesis
AmazonS3
DynamoDB
KCL
Spark-Streaming uses KCL for Kinesis
AmazonEMR
Spark-Streaming application to read from Kinesis and write to S3
![Page 21: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/21.jpg)
Spark-streaming - Reading from Kinesis
![Page 22: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/22.jpg)
Spark-streaming – Writing to S3
![Page 23: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/23.jpg)
![Page 24: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/24.jpg)
View the output files in Amazon S3
![Page 25: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/25.jpg)
![Page 26: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/26.jpg)
2. Process
![Page 27: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/27.jpg)
Amazon EMR’s Hive
Adapts a SQL-like (HiveQL) query to run on Hadoop
Schema on read: map table to the input data
Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis
Query complex input formats using SerDe
Transform data with User Defined Functions (UDF)
![Page 28: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/28.jpg)
Using Hive on Amazon EMR
![Page 29: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/29.jpg)
Create a table that points to your Amazon S3 bucket
CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING)PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?") LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';
msck repair table access_log_raw;
![Page 30: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/30.jpg)
![Page 31: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/31.jpg)
Process data using Hive
We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table
Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour
The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command
![Page 32: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/32.jpg)
Create an external Hive table in Amazon S3
![Page 33: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/33.jpg)
Configure partition and compression
![Page 34: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/34.jpg)
Query Hive and write output to Amazon S3
-- convert the Apache log timestamp to a UNIX timestamp-- split files in Amazon S3 by the hour in the log linesINSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;
![Page 35: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/35.jpg)
![Page 37: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/37.jpg)
View the output files in Amazon S3
![Page 38: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/38.jpg)
![Page 39: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/39.jpg)
Spark SQLSpark's module for working with structured data using SQL
Run unmodified Hive queries on existing data.
![Page 40: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/40.jpg)
Using Spark-SQL on Amazon EMR
![Page 41: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/41.jpg)
Query the data with Spark
![Page 42: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/42.jpg)
![Page 43: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/43.jpg)
3. Analyze
![Page 44: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/44.jpg)
Connect to Amazon Redshift
![Page 45: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/45.jpg)
Create an Amazon Redshift table to hold your data
![Page 46: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/46.jpg)
Loading data into Amazon Redshift
“COPY” command loads files in parallel
COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM-ACCESS_KEY; aws_secret_access_key=YOUR-IAM-SECRET-KEY'DELIMITER '\t' IGNOREHEADER 0 MAXERROR 0 GZIP;
![Page 47: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/47.jpg)
![Page 48: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/48.jpg)
![Page 49: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/49.jpg)
Amazon Redshift test queries
![Page 50: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/50.jpg)
Your first Big Data application on AWS
A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
![Page 51: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/51.jpg)
…around the same cost as a cup of coffee
Try it yourself on the AWS Cloud…
Service Est. Cost*
Amazon Kinesis $1.00
Amazon S3 (free tier) $0
Amazon EMR $0.44
Amazon Redshift $1.00
Est. Total $2.44
*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.
$3.50
![Page 52: AWS September Webinar Series - Building Your First Big Data Application on AWS](https://reader035.vdocument.in/reader035/viewer/2022070518/58e57e6b1a28abbf5d8b556d/html5/thumbnails/52.jpg)
Thank you
AWS Big Data blogblogs.aws.amazon.com/bigdata