(bdt205) your first big data application on aws | aws re:invent 2014

November 12th, 2014 | Las Vegas, NV

Matt Yanchyshyn, Principal Solutions Architect

RedshiftEMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

AutomateAWS Data Pipeline

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

YOUR-BUCKET-NAME

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

\

CHOOSE-A-REDSHIFT-PASSWORD

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given hosthive>

Log4J


http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

http://127.0.0.1:19026/cluster

hive>

YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Log4J


Hive with

Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKET

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Log4J


Hive with


parallel COPY from

Amazon S3

Bonus:

hive>

hive>

hive>

hive>

hive>

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Log4J


Hive with


parallel COPY from

Amazon S3

Amazon Kinesis

processing state

YOUR-S3-BUCKET

YOUR-S3-BUCKET

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Big Data software on AWS Marketplace:

http://amzn.to/1va4KQ6

http://amzn.to/1va4KQ6

http://bit.ly/aws-bdt205

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

http://bit.ly/awsevals

http://bit.ly/awsevals

(bdt205) your first big data application on aws | aws re:invent 2014

Technology