aws august meetupfiles.meetup.com/20076791/evolving your big data use cases from batch... · aws...

40
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steve Abraham Solutions Architect August 17, 2016 AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time

Upload: others

Post on 11-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Steve Abraham – Solutions Architect

August 17, 2016

AWS August MeetupEvolving Your Big Data Use Cases from

Batch to Real Time

Page 2: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Agenda

Common Use Cases for Real-Time Analytics

Patterns & Practices

Tools

Q&A

Page 3: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Common Use Cases

Page 4: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Streaming Data Scenarios Across Verticals

Scenarios/

Verticals

Accelerated Ingest-

Transform-Load

Continuous Metrics

Generation

Responsive Data

Analysis

Digital Ad

Tech/Marketing

Publisher, bidder data

aggregation

Advertising metrics like

coverage, yield, and conversion

User engagement with ads,

optimized bid/buy engines

IoTSensor, device telemetry

data ingestion

Operational metrics and

dashboards

Device operational

intelligence and alerts

Gaming Online data aggregation,

e.g., top 10 players

Massively multiplayer online

game (MMOG) live dashboard

Leader board generation,

player-skill match

Consumer

OnlineClickstream analytics

Metrics like impressions and

page views

Recommendation engines,

proactive care

Page 5: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Customer Use Cases

Sonos runs near real-time streaming

analytics on device data logs from their

connected hi-fi audio equipment.

Analyzing 30TB+ clickstream

data enabling real-time insights for

Publishers.

Zillow uses Lambda and Kinesis to

manage a global ingestion pipeline and

produce quality analytics in real-time

without building infrastructure.

Nordstorm recommendation team built

online stylist using Amazon Kinesis

Streams and AWS Lambda.

Page 6: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Patterns & Practices

Page 7: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Big Data : Batch or Stream?

Batch Processing

(Data Lake)

Real-time Processing

(Stream)

Log Analysis What went wrong an hour

ago?

Take corrective action /

notify admin

Billing Analysis How much did you spend

during the last billing

cycle?

Notify user that billing limit

is approaching

Customer Preferences Which ads were most

effective yesterday?

Which ad should the

customer see right now?

Fraud Audit trail / forensic

evidence

Stop fraud now

Page 8: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Data Streaming and Big Data

Kinesis

AWS

Lambda

Velocity

Size

Big Data

Data Streaming

Batch Processing

Page 9: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Two Main Processing Patterns

Stream processing (real time) Real-time response to events in data streams

Examples: Proactively detect hardware errors in device logs

Notify when inventory drops below a threshold

Fraud detection

Micro-batching (near real time) Near real-time operations on small batches of events in data streams

Examples: Aggregate and archive events

Monitor performance SLAs

Page 10: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Data Streaming Design Pattern

Collect Process Analyze

Store

Data Collection

and StorageData

Processing

Event

Processing

Data

Analysis

Page 11: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Batch Layer

Amazon

Kinesis

data

process

store

Amazon

Kinesis

FirehoseAmazon S3

A

p

p

l

i

c

a

t

i

o

n

s

Amazon

Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed Layer

answer

Serving

Layer

Amazon

ES

AmazonDynamoDB

Amazon

RDS

Amazon

ElastiCache

answer

KCL

AWS Lambda

Spark

Streaming

Storm

Batch /

Stream

Architecture

Page 12: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Primitive: Multi-Stage Decoupled “Data Bus”

Multiple Stages

Storage Decouples Processing Stages

Store Process Store ProcessData Answers

process

store

Page 13: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Primitive: Multi-Stream Processing

Amazon

KinesisAWS LambdaData

Amazon

DynamoDB

Amazon

FirehoseAmazon S3

Read from Kinesis

Write to Multiple Data Stores

process

store

Page 14: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon EMR

Primitive: Analysis Frameworks

Amazon

Kinesis

AWS

Lambda

Amazon S3

DataAmazon

DynamoDB

AnswerSpark

Streaming

Amazon

Firehose

Spark

SQL

Can Read from Multiple Inputs

process

store

Page 15: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Real-time Analytics

KCL

AWS Lambda

Spark

Streaming

Apache

Storm

Amazon

SNS

Amazon

ML

Notifications

AmazonElastiCache

(Redis)

Amazon

DynamoDB

Amazon

RDS

Amazon

ES

Alert

App state

Real-time Prediction

KPI

Amazon

KinesisData

process

store

Page 16: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Tools

Page 17: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Plethora of Tools

Amazon

GlacierAmazon S3

Amazon

DynamoDB Amazon RDS

Amazon EMR

Amazon

Redshift

Amazon

Kinesis

Amazon Kinesis-

enabled app

AWS Lambda Amazon ML

Amazon SQS

Amazon

ElastiCache

DynamoDB

Streams Amazon Elasticsearch

Service

AWS IoT

Page 18: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common
Page 19: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis FirehoseLoad massive volumes of streaming data into Amazon S3,

Amazon Redshift and Amazon Elasticsearch

Capture and submit

streaming data to

Firehose

Analyze streaming data using your

favorite BI tools

Firehose loads streaming data

continuously into S3, Amazon Redshift

and Amazon Elasticsearch

Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift and Amazon

Elasticsearch without writing an application or managing infrastructure.

Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data

destinations in as little as 60 secs using simple configurations.

Seamless elasticity: Seamlessly scales to match data throughput w/o intervention

Page 20: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Data Sources

App.4

[Machine

Learning]

AW

S E

nd

po

int

App.1

[Aggregate &

De-Duplicate]

App.2

[Metric

Extraction]

Amazon S3

Amazon Redshift

App.3[Sliding

Window

Analysis]

Availability

Zone

Shard 1

Shard 2

Shard N

AWS Lambda

Amazon EMR

Availability

Zone

Availability

Zone

Amazon Kinesis Streams Managed service for real-time streaming

Data Sources

Data Sources

Data Sources

Page 21: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Streams are made of shards

Each shard ingests up to 1MB/sec,

and 1000 records/sec

Each shard emits up to 2 MB/sec

All data is stored for 24 hours by

default; storage can be extended for

up to 7 days

Scale Kinesis streams using scaling

utility

Replay data inside of 24-hour window

or extended window

Amazon Kinesis Streams Managed ability to capture and store data

Page 22: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis Firehose vs. Amazon Kinesis

Streams

Amazon Kinesis Streams is for use cases that require custom

processing, per incoming record, with sub-1 second processing

latency, and a choice of stream processing frameworks.

Amazon Kinesis Firehose is for use cases that require zero

administration, ability to use existing analytics tools based on

Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a

data latency of 60 seconds or higher.

Page 23: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Streaming Data Ingestion

Page 24: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Putting Data into Amazon Kinesis Streams

Determine your partition key strategy

Managed buffer or streaming MapReduce job

Ensure high cardinality for your shards

Provision adequate shards

For ingress needs

Egress needs for all consuming applications: if more

than two simultaneous applications

Include headroom for catching up with data in stream

Page 25: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Putting Data into Amazon Kinesis

Amazon Kinesis Agent – (supports pre-processing)

http://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html

Pre-batch before Puts for better efficiency

Consider Flume, Fluentd as collectors/agents

See https://github.com/awslabs/aws-fluent-plugin-kinesis

Make a tweak to your existing logging

log4j appender option

See https://github.com/awslabs/kinesis-log4j-appender

Page 26: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis Producer Library

Writes to one or more Amazon Kinesis streams with automatic,

configurable retry mechanism

Collects records and uses PutRecords to write multiple records to

multiple shards per request

Aggregates user records to increase payload size and improve

throughput

Integrates seamlessly with KCL to de-aggregate batched records

Use Amazon Kinesis Producer Library with AWS Lambda (New!)

Submits Amazon CloudWatch metrics on your behalf to provide

visibility into producer performance

Page 27: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Record Order and Multiple Shards

Unordered processing

Randomize partition key to distribute events over

many shards and use multiple workers

Exact order processing

Control partition key to ensure events are grouped

into the same shard and read by the same worker

Need both? Use global sequence number

Producer

Get Global

SequenceUnordered

Stream

Campaign Centric

Stream

Fraud Inspection

Stream

Get Event Metadata

Page 28: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Sample Code for Scaling Shards

java -cpKinesisScalingUtils.jar-complete.jar-Dstream-name=MyStream-Dscaling-action=scaleUp-Dcount=10 -Dregion=eu-west-1 ScalingClient

Options:

stream-name - The name of the stream to be scaled

scaling-action - The action to be taken to scale. Must be one of "scaleUp”, "scaleDown" or “resize”

count - Number of shards by which to absolutely scale up or down, or resize

See https://github.com/awslabs/amazon-kinesis-scaling-utils

Page 29: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis Stream Processing

Page 30: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis Client Library

Build Kinesis Applications with Kinesis Client Library (KCL)

Open source client library available for Java, Ruby, Python,

Node.JS dev

Deploy on your EC2 instances

KCL Application includes three components:

1. Record Processor Factory – Creates the record processor

2. Record Processor – Processor unit that processes data from a

shard in Amazon Kinesis Streams

3. Worker – Processing unit that maps to each application instance

Page 31: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

State Management with Kinesis Client Library

One record processor maps to one shard and processes data records from

that shard

One worker maps to one or more record processors

Balances shard-worker associations when worker / instance counts change

Balances shard-worker associations when shards split or merge

Page 32: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Other Options

Third-party connectors(for example, Splunk)

AWS IoT platform

AWS Lambda

Amazon EMR with Apache Spark, Pig or Hive

Page 33: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Apache Spark and Amazon Kinesis Streams

Apache Spark is an in-memory analytics cluster using

RDD for fast processing

Spark Streaming can read directly from an Amazon

Kinesis stream

Amazon software license linking – Add ASL dependency

to SBT/MAVEN project, artifactId = spark-streaming-kinesis-asl_2.10

Example: Counting tweets on a sliding window

KinesisUtils.createStream(‘twitter-stream’)

.filter(_.getText.contains(”Open-Source"))

.countByWindow(Seconds(5))

Page 34: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon EMR

Amazon Kinesis

StreamsStreaming Input

Tumbling/Fixed Window

Aggregation

Periodic Output

Amazon Redshift

COPY from Amazon EMR

Common Integration Pattern with Amazon EMRTumbling Window Reporting

Page 35: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis Streams with AWS Lambda

Page 36: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis - KCL

Page 37: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Amazon Kinesis - Lambda

Page 38: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Example Architecture

Page 39: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Questions?

Page 40: AWS August Meetupfiles.meetup.com/20076791/Evolving Your Big Data Use Cases from Batch... · AWS August Meetup Evolving Your Big Data Use Cases from Batch to Real Time. Agenda Common

Thank you!