(bdt205) your first big data application on aws | aws re:invent 2014

50
November 12 th , 2014 | Las Vegas, NV Matt Yanchyshyn, Principal Solutions Architect

Upload: amazon-web-services

Post on 02-Jul-2015

667 views

Category:

Technology


2 download

DESCRIPTION

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us on our journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.

TRANSCRIPT

Page 1: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

November 12th, 2014 | Las Vegas, NV

Matt Yanchyshyn, Principal Solutions Architect

Page 2: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

RedshiftEMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

AutomateAWS Data Pipeline

Page 3: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 4: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 5: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 6: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 7: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

Page 8: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-BUCKET-NAME

Page 9: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

Page 10: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

\

CHOOSE-A-REDSHIFT-PASSWORD

Page 11: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY

Page 12: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 13: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

Page 14: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

Page 15: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 16: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

Page 17: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 18: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

Page 19: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given hosthive>

Page 20: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Page 21: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

Page 22: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 23: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

YOUR-S3-BUCKET/emroutput

Page 24: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

Page 25: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

Page 26: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 27: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Page 28: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3

Page 29: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 30: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 31: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

Page 32: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 33: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 34: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

Page 35: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Page 36: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 37: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Page 38: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Bonus:

Page 39: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Page 40: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

hive>

hive>

hive>

hive>

hive>

Page 41: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

Page 42: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

Page 43: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Page 44: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Page 45: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 46: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Page 47: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Big Data software on AWS Marketplace:

http://amzn.to/1va4KQ6

Page 48: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://bit.ly/aws-bdt205

Page 49: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

Page 50: (BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014

http://bit.ly/awsevals