performance management in ‘big data’ applications

Performance Management in ‘Big Data’ ApplicationsIt’s still about the Application

Michael Kopp, Technology Strategist

[email protected]

@mikopp

blog.dynatrace.com

Edward Capriolo

[email protected]

@edwardcapriolo

m6d.com/blog

High Volume/Low Latency DBs

3

JavaWeb

BigData

Key Benefits1) Fast Read/Write2) Horizontal Scalability3) Redundancy and High

Availability

Key Challenges1) Even Distribution2) Correct Schema and Access

patterns3) Understanding Application Impact

Hive high-levelmap/reducequery JOB

123

BigData

JOB

1234...

754Key Benefits1) Massive Horizontal Batch Job2) Split big Problems into smaller

ones

Hive Server

batchtrigger

Large Parallel Batch Processing

Key Challenges1) Optimal Distribution 2) Unwieldy Configuration3) Can easily waste your

resources

What is m6d?

5

6

Impressions look like…

7

Map Reduce Performance

8

Typical MapReduce Job at m6d

9

Hadoop at m6d

• Critical piece of infrastructure

• Long Term Data Storage

– Raw logs

– Aggregations

– Reports

– Generated data (feed back loops)

• Numerous ETL (Extract Transform Load)

• Scheduled and adhoc processes

• Used directly by Tech-Team, Ad Ops, Data Science

Hadoop at m6d

• Two deployments 'production' and 'research'– ~ 500 TB - 40+ Nodes

– ~ 350 TB – 20+ Nodes

• Thousands of jobs – <5 minute jobs and 12 hour Job Flows

– Mostly Hive Jobs

– Some custom code and streaming jobs

Hadoop Design Tenants

• Linear scalability by adding more hardware

• HDFS Distributed file system

– User space file system

– Blocks are replicated across nodes

– Limited semantics

• MapReduce

– Paradigm which models using map/reduce

– Data Locality

– Split Job into Tasks by Data

– Retry in failure

12

Schema Design Challenges

• Partition data for good distribution

– By time interval (optionally a second level)

• Partition pruning with WHERE

– Clustering (aka bucketing)

• Optimized sampling and joins

– Columnar

• Column oriented • Raw Data Growth

• Data features change (more distinct X)

13

Key Performance Challenges

• Intermediate I/O

– Compression codec

– Block size

– Split-table formats

• Contentions between jobs

• Data and Map/Reduce Distribution• Data Skew

• Non Uniform Computation (long running tasks)

• ‘Cost' of new feature – is this justified?

• Tuning variables (spills, buffers, Etc, etc)

14

How to handle Performance Issues?

• Profile the Job / Query?– Who should do this?

(DBA, Dev, Ops, DevOps , NoOps, Big Data Guru)

– How should we do this?• Look at job run times day over day?

• Look at code and micro-benchmark?

• Collect Job Counters?

• Upgrade often for latest performance features?

• Investigate/purchase newer better hardware– More cores? RAM? 10G Ethernet? SSD

• Read blogs?Test Data is not like

Real Data

15

But how to optimize the job itself?

16

Understanding Map/Reduce Performance

Maximum Parallelism

Actual Mapping Parallelism

Also your own Code

Attention Data Volume!

Attention Potential Choke

Point!

Maximum Reduce

Parallelism

Actual Reduce Parallelism

Also your own Code

Millions of Executions!!!

Understanding Map/Reduce Performance

18

Map/Reduce Performance

19

Map/Reduce behind the scenesSerialize

De-Serialize and Serialize

again

Potentionally Inefficient

Too Many Files, Same Key

spread all over

Expensive Synchronous

Combine

De-Serialize and Serialize

again

20

Map/Reduce Combine and Spill Performance

1) Pre Combine in Mapping Step2) Avoid many intermediate files and combines

21

Map/Reduce “Map” Performance

Focus on Big HotspotsAvoid Brute ForceSave a lot of HardwareThen Optimize Hadoop

22

Map/Reduce to the Max!

• Ensure Data Locality

• Optimize Map/Reduce Hotspots

• Reduce Intermediate Data and “Overhead”

• Ensure optimal Data and Compute Distribution

• Tune Hadoop Environment

23

Cassandra and

Application Performance

24

1. Browsers visit Publishers and create impressions.2. Publishers sell impressions via Exchanges.3. Exchanges serve as auction houses for the impressions4. On behalf of the marketer, m6d bids the impressions via

the auction house. If m6d wins, we display our ad to the browser.

A High Level look at RTB

25

Cassandra at m6d for Real Time Bidding

• RTB limited data is provided from exchange

• System to store information on users

– Frequency Capping

– Visit History

– Segments (product service affinity)

• Low latency Requirements

– Less then 100ms

– Requires fast read/write on discrete data

26

Cassandra design

Key Cassandra Design Tennents

• Swap/paging not possible

• Mostly schema-less

• Writes do not read– Read/Write is an anti-pattern

• Optimize around put and get– Not for scan and query

• De-Normalize data– Attempt to get all data in single read*

28

Cassandra Design Challenges

• De-normailize

– Store data to optimize reads

– Composite (multi-column) keys

• Multi-column family and Multi-tenant scenarios

• Compress settings

– Disk and cache savings

– CPU and JVM costs

• Data/Compaction settings

– Size tiered vs LevelDB

• Caching, Memtable and other tuning

29

How to handle performance issues?

• Monitor standard vitals (cpu,disk) ?

• Read blogs and documentation?

• Use Cassandra JMX to track req/sec

• Use Cassandra JMX to track size of Column Families, rows and columns

• Upgrade often to get latest performance enhancements? *

What about the Application?

30

APM for Cassandra

NoSQL APM is not so different after all…

31

JavaWeb

Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query

relationship (transaction flow)

Database

32

Response Time Contribution

Access PatternAccess PatternAccess Pattern

Contribution to Business Transaction Connection Pool

33

Statement Analysis

Contribution to Business Transaction

Executions per Transactions and

Total

Average and Total Execution Time

34

Where, Why, How and which Transaction…

Where and why in my Transaction

Single Statement Performance

Which Web Service

Which Business Transaction

How does this apply to NoSQL Databases?

35

Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query

relationship (transaction flow)

1) Data Access Distribution2) End-to-End Monitoring3) Storage (I/O, GC) Bottlenecks4) Consistency Level

JavaWeb

37

Real End-to-End Application Performance

Third Party

Services

External

End User

Our Application

End User Response Time Contribution

38

Understanding Cassandra’s Contribution

Which statements did the Transaction Execute?Which node where they executed against?Which Consistency Level was used?

Contribution of each StatmentToo many calls? Data Access patterns

39

Understand Response Time Contribution

4 Calls~15 ms Contribution

5 Calls~50-80 ms Contribution?

Access and Data Distribution

40

Why and how was a statement executed?

45ms latency? 60ms waiting on the server?

41

Any Hotspots on the Cassandra Nodes?

Much more load on Node3?Which Transactions are

responsible

42

Specific Cassandra Health Metrics

43

General Health of Cassandra

Too much GC Suspensions?

Memory Issues?

Too many requests?

44

Conclusion

45

Extend Performance Focus on Application

JavaWeb

A Fast Database doesn’t make a fast Application

Hive high-levelmap/reducequery JOB

123

JOB

1234...

754

master node

Hive Server

batchtrigger

data/task node

data/task node

Intelligent MapReduce APM

data/task node

Simple Optimizations with big impact

47

Big Data is about solving Application Problems

APM is about Application Performance and Efficiency

THANK YOU

48

Michael Kopp, Technology Strategist

[email protected]

@mikopp

blog.dynatrace.com

Edward Capriolo

[email protected]

@edwardcapriolo

m6d.com/blog

performance management in ‘big data’ applications

Technology

data science9

generated data feed

typical mapreduce job

job queryjob

nodes thousands of jobs

fast readwrite2 correct

split big problems

loops numerous etl extract