data & analytics - session 1 - big data analytics

72
Big Data Analytics Jon Einkauf Senior Product Manager, Amazon Elastic MapReduce

Upload: amazon-web-services

Post on 27-Jan-2015

178 views

Category:

Technology


12 download

DESCRIPTION

Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics. Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4

TRANSCRIPT

Page 1: Data & Analytics - Session 1 -  Big Data Analytics

Big Data Analytics

Jon Einkauf

Senior Product Manager, Amazon Elastic MapReduce

Page 2: Data & Analytics - Session 1 -  Big Data Analytics

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

Overview

Page 3: Data & Analytics - Session 1 -  Big Data Analytics

Introducing Big Data

1

Page 4: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 5: Data & Analytics - Session 1 -  Big Data Analytics

The cost of data generation

is falling

Page 6: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Page 7: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Page 8: Data & Analytics - Session 1 -  Big Data Analytics

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure

Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 9: Data & Analytics - Session 1 -  Big Data Analytics

Elastic and highly scalable

No upfront capital expense

Only pay for what you use +

+

Available on-demand

+

= Remove

constraints

Page 10: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Page 11: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Accelerated

Page 12: Data & Analytics - Session 1 -  Big Data Analytics

Technologies and techniques for

working productively with data,

at any scale.

Big Data

Page 13: Data & Analytics - Session 1 -  Big Data Analytics

From data to

actionable information

2

Page 14: Data & Analytics - Session 1 -  Big Data Analytics

“Who buys video games?”

Page 15: Data & Analytics - Session 1 -  Big Data Analytics

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

Page 16: Data & Analytics - Session 1 -  Big Data Analytics
Page 17: Data & Analytics - Session 1 -  Big Data Analytics
Page 18: Data & Analytics - Session 1 -  Big Data Analytics

500% return on ad spend

17,000% reduction in procurement time

Results:

Page 19: Data & Analytics - Session 1 -  Big Data Analytics

“Who is using our

service?”

Page 20: Data & Analytics - Session 1 -  Big Data Analytics

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

Page 21: Data & Analytics - Session 1 -  Big Data Analytics

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Page 22: Data & Analytics - Session 1 -  Big Data Analytics

Open web index.

3.4 billion records.

Available to all.

Page 23: Data & Analytics - Session 1 -  Big Data Analytics

Full parse for impact of

social networks

300 lines of Ruby code.

14 hours.

$100.

Page 24: Data & Analytics - Session 1 -  Big Data Analytics

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Tweeting about Flu

Page 25: Data & Analytics - Session 1 -  Big Data Analytics

Analytics and

Cloud Computing

3

Page 26: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 27: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase

Page 28: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 &

Elastic MapReduce

Page 29: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

Page 30: Data & Analytics - Session 1 -  Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Page 31: Data & Analytics - Session 1 -  Big Data Analytics

Elastic MapReduce

Page 32: Data & Analytics - Session 1 -  Big Data Analytics

How does it work?

EMR

EMR Cluster S3

1. Put the data into S3 (or HDFS)

3. Get the results

2. Launch your cluster. Choose: • Hadoop distribution • How many nodes • Node type (hi-CPU,

hi-memory, etc.) • Hadoop apps (Hive,

Pig, HBase)

Page 33: Data & Analytics - Session 1 -  Big Data Analytics

EMR

EMR Cluster

How does it work?

S3

You can easily resize the cluster

Page 34: Data & Analytics - Session 1 -  Big Data Analytics

EMR

EMR Cluster

How does it work?

S3

Use Spot nodes to save time

and money

Page 35: Data & Analytics - Session 1 -  Big Data Analytics

EMR

EMR Cluster

How does it work?

S3

Launch parallel clusters against the same data source (tune for the

workload)

Page 36: Data & Analytics - Session 1 -  Big Data Analytics

How does it work?

EMR Cluster S3

When the work is complete, you can terminate the cluster

(and stop paying)

Page 37: Data & Analytics - Session 1 -  Big Data Analytics

EMR Cluster

How does it work?

You can store everything in HDFS

(local disk)

High Storage nodes = 48 TB/node

Page 38: Data & Analytics - Session 1 -  Big Data Analytics

EMR Cluster

How does it work?

Launch in a Virtual Private Cloud for

extra security

Page 39: Data & Analytics - Session 1 -  Big Data Analytics

Thousands of Customers, 5+ Million Clusters

Page 40: Data & Analytics - Session 1 -  Big Data Analytics

Give it a try.

Cost to run a 100-node EMR cluster:

£4.90 / hour

Page 41: Data & Analytics - Session 1 -  Big Data Analytics

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute

resources

Page 42: Data & Analytics - Session 1 -  Big Data Analytics

Anatomy of a pipeline

Page 43: Data & Analytics - Session 1 -  Big Data Analytics

Additional checks and notifications

Page 44: Data & Analytics - Session 1 -  Big Data Analytics

Arbitrarily complex pipelines

Page 45: Data & Analytics - Session 1 -  Big Data Analytics

Thanks. [email protected]

To Learn More:

aws.amazon.com/elasticmapreduce

aws.amazon.com/datapipeline

aws.amazon.com/big-data

Page 46: Data & Analytics - Session 1 -  Big Data Analytics

Back to the Future

Big Data at Channel 4

Bob Harris

Chief Technology Officer – Channel 4 Television

April 2013

Page 47: Data & Analytics - Session 1 -  Big Data Analytics

The Disclaimer

<IMHO>

blah blah blah…..

</IMHO>

Page 48: Data & Analytics - Session 1 -  Big Data Analytics

C4 in the Cloud

• 2008 – Started investigations into Cloud Computing

• 2008 – Launched our first applications on AWS

• 2009 – Entered into an Enterprise Agreement with Amazon for AWS

Rapid growth of AWS based offerings during 2009/2010

• 2011 – AWS established as the default platform of choice for new websites

Page 49: Data & Analytics - Session 1 -  Big Data Analytics

C4 in the Cloud

Page 50: Data & Analytics - Session 1 -  Big Data Analytics

C4 in the Cloud

• 2008 – Started investigations into Cloud Computing

• 2008 – Launched our first applications on AWS

• 2009 – Entered into an Enterprise Agreement with Amazon for AWS

Rapid growth of AWS based offerings during 2009/2010

• 2011 – AWS established as the default platform of choice for new websites

• 2012 – Adopted cloud-based analytics

• 2013 – Investigating cloud-based back-up and archiving

Page 51: Data & Analytics - Session 1 -  Big Data Analytics

Why Big Data?

Page 52: Data & Analytics - Session 1 -  Big Data Analytics

Business Intelligence at C4

• Well established Business Intelligence capability

• Based on industry standard proprietary products

• Real-time data warehousing

• Comprehensive business reporting

• Excellent internal skills

• Good external skills availability

Page 53: Data & Analytics - Session 1 -  Big Data Analytics

Big Data at C4

2011

• Embarked on Big Data initiative in 2011

• Ran in-house and cloud-based PoCs

• Selected AWS Elastic Map Reduce

2012

• Ran EMR in parallel with conventional BI stack

• Hive deployed to Data Analysts in 2012

• EMR workflows deployed to production in 2012

2013

• EMR confirmed as primary Big Data platform

• EMR usage growing, focus on automation

• Experimenting with R and Mahout

Page 54: Data & Analytics - Session 1 -  Big Data Analytics

Big Data at C4 – Elastic MapReduce

• AWS EMR established as our Big Data platform of choice

• Friendly front-end developed to allow Data Analysts to

start/stop clusters and submit/track queries.

Page 55: Data & Analytics - Session 1 -  Big Data Analytics

Big Data at C4 – Big Data Control Panel

Page 56: Data & Analytics - Session 1 -  Big Data Analytics

Big Data at C4 – Elastic MapReduce

• AWS EMR established as our Big Data platform of choice

• Friendly front-end developed to allow Data Analysts to

start/stop clusters and submit/track queries.

• Production workflows written predominantly in Python and

Pig

• Fully integrated with our conventional BI stack making

EMR outputs available for reporting

• Experimenting with ADP (AWS Data Pipeline)

• Next steps – MapR and HBase

Page 57: Data & Analytics - Session 1 -  Big Data Analytics

Personalising the viewer experience

Most popular dramas

Drama

collections

US drama

Single view of the viewer recognising them across devices

and serving relevant content

Big Data – Improving Viewer Experience

Page 58: Data & Analytics - Session 1 -  Big Data Analytics

Myths or Truths? – It’s all about Perspective!

• Nothing that can’t be done with an RDBMS

• It’s a completely different approach

• It’s really difficult

• It’s immature and lacks good tools

• It’s totally incompatible with you current BI platform

and tools

• It’s difficult to find skilled and experienced staff

Image by Tayrawr Fortune

Elastic MapReduce has provided a cost effective

approach to establishing our Big Data platform

Page 59: Data & Analytics - Session 1 -  Big Data Analytics

That’s all folks…

[email protected]

@bobharrisuk

uk.linkedin.com/in/bobharrisuk01

Page 60: Data & Analytics - Session 1 -  Big Data Analytics

Alan Priestley

EMEA Enterprise Marketing

Intel Corporation

Page 61: Data & Analytics - Session 1 -  Big Data Analytics

Analysis of Data Can Transform Society

Create new business

models and improve

organizational

processes.

Enhance scientific

understanding, drive

innovation, and

accelerate medical cures.

Increase public safety

and improve

energy efficiency with

smart grids.

Page 62: Data & Analytics - Session 1 -  Big Data Analytics

Democratizing Analytics gets Value out of Big Data

Unlock Value in

Silicon

Support Open

Platforms

Deliver Software Value

Page 63: Data & Analytics - Session 1 -  Big Data Analytics

Intel at the Intersection of Big Data

Enabling exascale

computing on massive

data sets

Helping enterprises build open

interoperable clouds

Contributing code and fostering ecosystem

HPC Cloud Open Source

Page 64: Data & Analytics - Session 1 -  Big Data Analytics

Intel at the Heart of the Cloud

Server

Storage

Network

Page 65: Data & Analytics - Session 1 -  Big Data Analytics

Scale-Out Platform Optimizations for Big Data

Cost-effective performance

•Intel® Advanced Vector Extension Technology

•Intel® Turbo Boost Technology 2.0

•Intel® Advanced Encryption Standard New

Instructions Technology

Page 66: Data & Analytics - Session 1 -  Big Data Analytics

66

Intel® Advanced Vector Extensions Technology

• Newest in a long line of

processor instruction

innovations

• Increases floating point

operations per clock up to

2X1 performance

1 : Performance comparison using Linpack benchmark. See backup for configuration details.

For more legal information on performance forecasts go to http://www.intel.com/performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Page 67: Data & Analytics - Session 1 -  Big Data Analytics

Intel® Turbo Boost Technology 2.0

More Performance Higher turbo speeds maximize

performance for single and

multi-threaded applications

Page 68: Data & Analytics - Session 1 -  Big Data Analytics

Intel® Advanced Encryption

Standard New Instructions

• Processor assistance for performing AES encryption 7 new instructions

• Makes enabled encryption software faster and stronger

Page 69: Data & Analytics - Session 1 -  Big Data Analytics

Power of the Platform built by Intel

Richer

user

experiences

4HRS

50% Reduction

10MIN

80% Reduction 50%

Reduction 40% Reduction

TeraSort for

1TB sort

Intel®

Xeon®

Processor

E5 2600

Solid-State

Drive 10G

Ethernet Intel® Apache

Hadoop

Previous

Intel®

Xeon®

Processor

Page 70: Data & Analytics - Session 1 -  Big Data Analytics

Cloud

Intelligent Systems

Clients

Virtuous Cycle of Data-Driven Experience

Page 71: Data & Analytics - Session 1 -  Big Data Analytics

Get 600 Hours of free supercomputing

time!

www.powerof60.com

Page 72: Data & Analytics - Session 1 -  Big Data Analytics

Thank you!