Transcript
Page 1: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

November 12th, 2014 | Las Vegas, NV

Matt Yanchyshyn, Principal Solutions Architect

Page 2: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

RedshiftEMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

AutomateAWS Data Pipeline

Page 3: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 4: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 5: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 6: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 7: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

Page 8: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-BUCKET-NAME

Page 9: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

Page 10: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

\

CHOOSE-A-REDSHIFT-PASSWORD

Page 11: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY

Page 12: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 13: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

Page 14: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

Page 15: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 16: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

Page 17: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 18: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

Page 19: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given hosthive>

Page 20: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Page 21: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

Page 22: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 23: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

YOUR-S3-BUCKET/emroutput

Page 24: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

Page 25: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

Page 26: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 27: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Page 28: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3

Page 29: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 30: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 31: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

Page 32: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 33: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 34: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

Page 35: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Page 36: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 37: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Page 38: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Bonus:

Page 39: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 40: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

hive>

hive>

hive>

hive>

Page 41: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

Page 42: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

Page 43: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Page 44: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 45: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 46: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Page 47: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Big Data software on AWS Marketplace:

http://amzn.to/1va4KQ6

Page 48: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://bit.ly/aws-bdt205

Page 49: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

Page 50: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://bit.ly/awsevals


Top Related