data & analytics - session 1 - big data analytics

Post on 27-Jan-2015

179 Views

Category:

Technology

12 Downloads

Preview:

Click to see full reader

DESCRIPTION

Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics. Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4

TRANSCRIPT

Big Data Analytics

Jon Einkauf

Senior Product Manager, Amazon Elastic MapReduce

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

Overview

Introducing Big Data

1

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

The cost of data generation

is falling

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure

Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Elastic and highly scalable

No upfront capital expense

Only pay for what you use +

+

Available on-demand

+

= Remove

constraints

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Accelerated

Technologies and techniques for

working productively with data,

at any scale.

Big Data

From data to

actionable information

2

“Who buys video games?”

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

500% return on ad spend

17,000% reduction in procurement time

Results:

“Who is using our

service?”

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Open web index.

3.4 billion records.

Available to all.

Full parse for impact of

social networks

300 lines of Ruby code.

14 hours.

$100.

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Tweeting about Flu

Analytics and

Cloud Computing

3

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 &

Elastic MapReduce

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Elastic MapReduce

How does it work?

EMR

EMR Cluster S3

1. Put the data into S3 (or HDFS)

3. Get the results

2. Launch your cluster. Choose: • Hadoop distribution • How many nodes • Node type (hi-CPU,

hi-memory, etc.) • Hadoop apps (Hive,

Pig, HBase)

EMR

EMR Cluster

How does it work?

S3

You can easily resize the cluster

EMR

EMR Cluster

How does it work?

S3

Use Spot nodes to save time

and money

EMR

EMR Cluster

How does it work?

S3

Launch parallel clusters against the same data source (tune for the

workload)

How does it work?

EMR Cluster S3

When the work is complete, you can terminate the cluster

(and stop paying)

EMR Cluster

How does it work?

You can store everything in HDFS

(local disk)

High Storage nodes = 48 TB/node

EMR Cluster

How does it work?

Launch in a Virtual Private Cloud for

extra security

Thousands of Customers, 5+ Million Clusters

Give it a try.

Cost to run a 100-node EMR cluster:

£4.90 / hour

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute

resources

Anatomy of a pipeline

Additional checks and notifications

Arbitrarily complex pipelines

Thanks. jeinkauf@amazon.com

To Learn More:

aws.amazon.com/elasticmapreduce

aws.amazon.com/datapipeline

aws.amazon.com/big-data

Back to the Future

Big Data at Channel 4

Bob Harris

Chief Technology Officer – Channel 4 Television

April 2013

The Disclaimer

<IMHO>

blah blah blah…..

</IMHO>

C4 in the Cloud

• 2008 – Started investigations into Cloud Computing

• 2008 – Launched our first applications on AWS

• 2009 – Entered into an Enterprise Agreement with Amazon for AWS

Rapid growth of AWS based offerings during 2009/2010

• 2011 – AWS established as the default platform of choice for new websites

C4 in the Cloud

C4 in the Cloud

• 2008 – Started investigations into Cloud Computing

• 2008 – Launched our first applications on AWS

• 2009 – Entered into an Enterprise Agreement with Amazon for AWS

Rapid growth of AWS based offerings during 2009/2010

• 2011 – AWS established as the default platform of choice for new websites

• 2012 – Adopted cloud-based analytics

• 2013 – Investigating cloud-based back-up and archiving

Why Big Data?

Business Intelligence at C4

• Well established Business Intelligence capability

• Based on industry standard proprietary products

• Real-time data warehousing

• Comprehensive business reporting

• Excellent internal skills

• Good external skills availability

Big Data at C4

2011

• Embarked on Big Data initiative in 2011

• Ran in-house and cloud-based PoCs

• Selected AWS Elastic Map Reduce

2012

• Ran EMR in parallel with conventional BI stack

• Hive deployed to Data Analysts in 2012

• EMR workflows deployed to production in 2012

2013

• EMR confirmed as primary Big Data platform

• EMR usage growing, focus on automation

• Experimenting with R and Mahout

Big Data at C4 – Elastic MapReduce

• AWS EMR established as our Big Data platform of choice

• Friendly front-end developed to allow Data Analysts to

start/stop clusters and submit/track queries.

Big Data at C4 – Big Data Control Panel

Big Data at C4 – Elastic MapReduce

• AWS EMR established as our Big Data platform of choice

• Friendly front-end developed to allow Data Analysts to

start/stop clusters and submit/track queries.

• Production workflows written predominantly in Python and

Pig

• Fully integrated with our conventional BI stack making

EMR outputs available for reporting

• Experimenting with ADP (AWS Data Pipeline)

• Next steps – MapR and HBase

Personalising the viewer experience

Most popular dramas

Drama

collections

US drama

Single view of the viewer recognising them across devices

and serving relevant content

Big Data – Improving Viewer Experience

Myths or Truths? – It’s all about Perspective!

• Nothing that can’t be done with an RDBMS

• It’s a completely different approach

• It’s really difficult

• It’s immature and lacks good tools

• It’s totally incompatible with you current BI platform

and tools

• It’s difficult to find skilled and experienced staff

Image by Tayrawr Fortune

Elastic MapReduce has provided a cost effective

approach to establishing our Big Data platform

That’s all folks…

bharris@channel4.co.uk

@bobharrisuk

uk.linkedin.com/in/bobharrisuk01

Alan Priestley

EMEA Enterprise Marketing

Intel Corporation

Analysis of Data Can Transform Society

Create new business

models and improve

organizational

processes.

Enhance scientific

understanding, drive

innovation, and

accelerate medical cures.

Increase public safety

and improve

energy efficiency with

smart grids.

Democratizing Analytics gets Value out of Big Data

Unlock Value in

Silicon

Support Open

Platforms

Deliver Software Value

Intel at the Intersection of Big Data

Enabling exascale

computing on massive

data sets

Helping enterprises build open

interoperable clouds

Contributing code and fostering ecosystem

HPC Cloud Open Source

Intel at the Heart of the Cloud

Server

Storage

Network

Scale-Out Platform Optimizations for Big Data

Cost-effective performance

•Intel® Advanced Vector Extension Technology

•Intel® Turbo Boost Technology 2.0

•Intel® Advanced Encryption Standard New

Instructions Technology

66

Intel® Advanced Vector Extensions Technology

• Newest in a long line of

processor instruction

innovations

• Increases floating point

operations per clock up to

2X1 performance

1 : Performance comparison using Linpack benchmark. See backup for configuration details.

For more legal information on performance forecasts go to http://www.intel.com/performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Turbo Boost Technology 2.0

More Performance Higher turbo speeds maximize

performance for single and

multi-threaded applications

Intel® Advanced Encryption

Standard New Instructions

• Processor assistance for performing AES encryption 7 new instructions

• Makes enabled encryption software faster and stronger

Power of the Platform built by Intel

Richer

user

experiences

4HRS

50% Reduction

10MIN

80% Reduction 50%

Reduction 40% Reduction

TeraSort for

1TB sort

Intel®

Xeon®

Processor

E5 2600

Solid-State

Drive 10G

Ethernet Intel® Apache

Hadoop

Previous

Intel®

Xeon®

Processor

Cloud

Intelligent Systems

Clients

Virtuous Cycle of Data-Driven Experience

Get 600 Hours of free supercomputing

time!

www.powerof60.com

Thank you!

top related