(bdt205) your first big data application on aws | aws re:invent 2014
DESCRIPTION
Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us on our journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.TRANSCRIPT
November 12th, 2014 | Las Vegas, NV
Matt Yanchyshyn, Principal Solutions Architect
RedshiftEMR EC2
Process & Analyze
Store
AWS Direct Connect
S3
Amazon Kinesis
Glacier
AWS Import/Export
DynamoDB
Collect
AutomateAWS Data Pipeline
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
Launch a 3-instance Hadoop 2.4 cluster with Hive installed:
m3.xlarge
YOUR-AWS-REGION
YOUR-AWS-SSH-KEY
YOUR-BUCKET-NAME
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream \
--stream-name AccessLogStream \
--shard-count 2
\
CHOOSE-A-REDSHIFT-PASSWORD
YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY
Log4J
YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS
YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME
Start Hive:
hive
YOUR-IAM-ACCESS-KEY
YOUR-IAM-SECRET-KEY;
YOUR-AWS-REGION
hive>
hive>
hive>
hive>
hive>
hive>
hive>
STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");
-- return the first row in the stream
hive>
-- return count all items in the Stream
hive>
-- return count of all rows with given hosthive>
Log4J
EMR-Kinesis Connector
hive>
YOUR-S3-BUCKET/emroutput
-- set up Hive's "dynamic partioning"
-- splits output files when writing to Amazon S3
hive>
hive>
-- compress output files on Amazon S3 using Gzip
hive>
hive>
hive>
hive>
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
hive>
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
# using the PostgreSQL CLI
YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL
8.x drivers or native Redshift support
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
YOUR-S3-BUCKET
YOUR-IAM-ACCESS_KEY
YOUR-IAM-SECRET-KEY
-- show all requests from a given IP address
-- count all requests on a given day
-- show all requests referred from other sites
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Bonus:
hive>
hive>
hive>
hive>
hive>
-- Create an external table on Amazon S3
-- to hold query results.
-- Partition (split files on Amazon S3) by iteration
hive>
YOUR-S3-BUCKET
-- set up a first iteration
-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0
-- set up a second iteration over the data in the Kinesis Stream
-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
YOUR-S3-BUCKET
YOUR-S3-BUCKET
YOUR-S3-BUCKET YOUR-PREFIX.gz .
YOUR-PREFIX.gz
http://bit.ly/aws-bdt205
Learn from AWS big data experts
blogs.aws.amazon.com/bigdata
http://bit.ly/awsevals