November 12th, 2014 | Las Vegas, NV
Matt Yanchyshyn, Principal Solutions Architect
RedshiftEMR EC2
Process & Analyze
Store
AWS Direct Connect
S3
Amazon Kinesis
Glacier
AWS Import/Export
DynamoDB
Collect
AutomateAWS Data Pipeline
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
Launch a 3-instance Hadoop 2.4 cluster with Hive installed:
m3.xlarge
YOUR-AWS-REGION
YOUR-AWS-SSH-KEY
YOUR-BUCKET-NAME
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream \
--stream-name AccessLogStream \
--shard-count 2
\
CHOOSE-A-REDSHIFT-PASSWORD
YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY
Log4J
YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS
YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME
Start Hive:
hive
YOUR-IAM-ACCESS-KEY
YOUR-IAM-SECRET-KEY;
YOUR-AWS-REGION
hive>
hive>
hive>
hive>
hive>
hive>
hive>
STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");
-- return the first row in the stream
hive>
-- return count all items in the Stream
hive>
-- return count of all rows with given hosthive>
Log4J
EMR-Kinesis Connector
hive>
YOUR-S3-BUCKET/emroutput
-- set up Hive's "dynamic partioning"
-- splits output files when writing to Amazon S3
hive>
hive>
-- compress output files on Amazon S3 using Gzip
hive>
hive>
hive>
hive>
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
hive>
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
# using the PostgreSQL CLI
YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL
8.x drivers or native Redshift support
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
YOUR-S3-BUCKET
YOUR-IAM-ACCESS_KEY
YOUR-IAM-SECRET-KEY
-- show all requests from a given IP address
-- count all requests on a given day
-- show all requests referred from other sites
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Bonus:
hive>
hive>
hive>
hive>
hive>
-- Create an external table on Amazon S3
-- to hold query results.
-- Partition (split files on Amazon S3) by iteration
hive>
YOUR-S3-BUCKET
-- set up a first iteration
-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0
-- set up a second iteration over the data in the Kinesis Stream
-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
YOUR-S3-BUCKET
YOUR-S3-BUCKET
YOUR-S3-BUCKET YOUR-PREFIX.gz .
YOUR-PREFIX.gz
http://bit.ly/aws-bdt205
Learn from AWS big data experts
blogs.aws.amazon.com/bigdata
http://bit.ly/awsevals