ten things you can do to make your cluster easy to monitor in … · 2013. 2. 5. · see oracle...

34
1 Privileged and Confidential Ten Things You Can Do to Make Your Cluster Easy to Monitor in Production Bay Area Coherence SIG Tom Lubinski SL Corporation 29 April 2010

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

1 Privileged and Confidential

Ten Things You Can Doto Make Your Cluster Easy to Monitor in Production

Bay Area Coherence SIG

Tom LubinskiSL Corporation

29 April 2010

2 Privileged and Confidential

Agenda

bull Introduction

bull Disclaimer

bull Why Monitor Coherence (and Applications)

bull Ten Things You Can Do

bull Questions

3 Privileged and Confidential

About SL Corporation

bull A Leader in Real-time Application Performance Monitoring

bull Located near San Francisco CA with office in Tokyo

bull RTView platform for APM and operational visibility

bull Oracle Coherence Monitor (OCM) and Viewer (OCV) based on RTView

4 Privileged and Confidential

RTView ndash Partial Customer List

RTViewCustomers

5 Privileged and Confidential

SL - Background

Connecticut Valley Power Grid Management System

Extensive background in real-time process monitoring

Critical Tax Season Applications at Intuit

Large volumes of dynamic data

OOCL World WideShipment Tracking

Visualization technologies

NASA Space Shuttle Launch Control System

Mission-critical applications

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 2: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

2 Privileged and Confidential

Agenda

bull Introduction

bull Disclaimer

bull Why Monitor Coherence (and Applications)

bull Ten Things You Can Do

bull Questions

3 Privileged and Confidential

About SL Corporation

bull A Leader in Real-time Application Performance Monitoring

bull Located near San Francisco CA with office in Tokyo

bull RTView platform for APM and operational visibility

bull Oracle Coherence Monitor (OCM) and Viewer (OCV) based on RTView

4 Privileged and Confidential

RTView ndash Partial Customer List

RTViewCustomers

5 Privileged and Confidential

SL - Background

Connecticut Valley Power Grid Management System

Extensive background in real-time process monitoring

Critical Tax Season Applications at Intuit

Large volumes of dynamic data

OOCL World WideShipment Tracking

Visualization technologies

NASA Space Shuttle Launch Control System

Mission-critical applications

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 3: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

3 Privileged and Confidential

About SL Corporation

bull A Leader in Real-time Application Performance Monitoring

bull Located near San Francisco CA with office in Tokyo

bull RTView platform for APM and operational visibility

bull Oracle Coherence Monitor (OCM) and Viewer (OCV) based on RTView

4 Privileged and Confidential

RTView ndash Partial Customer List

RTViewCustomers

5 Privileged and Confidential

SL - Background

Connecticut Valley Power Grid Management System

Extensive background in real-time process monitoring

Critical Tax Season Applications at Intuit

Large volumes of dynamic data

OOCL World WideShipment Tracking

Visualization technologies

NASA Space Shuttle Launch Control System

Mission-critical applications

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 4: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

4 Privileged and Confidential

RTView ndash Partial Customer List

RTViewCustomers

5 Privileged and Confidential

SL - Background

Connecticut Valley Power Grid Management System

Extensive background in real-time process monitoring

Critical Tax Season Applications at Intuit

Large volumes of dynamic data

OOCL World WideShipment Tracking

Visualization technologies

NASA Space Shuttle Launch Control System

Mission-critical applications

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 5: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

5 Privileged and Confidential

SL - Background

Connecticut Valley Power Grid Management System

Extensive background in real-time process monitoring

Critical Tax Season Applications at Intuit

Large volumes of dynamic data

OOCL World WideShipment Tracking

Visualization technologies

NASA Space Shuttle Launch Control System

Mission-critical applications

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 6: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

6 Privileged and Confidential

Disclaimer (1)

bull Wersquove learned a lot

bull We donrsquot know everything

bull We could be wrong

bull Built Coherence monitoring product in 2006

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 7: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

7 Privileged and Confidential

Disclaimer (2)

bull This stuff is boring -Z Z Z-

bull Questions banter even hecklingare welcome

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 8: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

8 Privileged and Confidential

Why Monitor Coherence (and Applications)

hellip so they stay up and are fast and reliable

(of course)

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 9: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

9 Privileged and Confidential

What typically happens hellip

The Three Phases of a Coherence Project

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 10: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

10 Privileged and Confidential

Why Monitor Coherence (and Applications)

Your Coherence Vision

Faster better smarter applications

Large volumes of in-memory data

Fast caching to buffer database

In-location parallel processing

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 11: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

11 Privileged and Confidential

Why Monitor Coherence (and Applications)

Development

Download install experiment

Initial results promising

Difficult to test full user and data load

Pressure to deploy quickly

Monitoring takes back seat

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 12: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

12 Privileged and Confidential

Why Monitor Coherence (and Applications)

Reality

User and data load causes problem

Latency bottlenecks timeouts

Did Coherence survive my changes

Which app is causing excessive load

What events lead up to the problem

How do I tell Oracle what happened

Difficult to change once in production

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 13: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

13 Privileged and Confidential

There is an hellip

Alternative Reality

ldquoBlack Boxrdquo is opened up

Alerts notify users of problems

Monitoring metrics at your fingertips

Event history shows what happened

Life is good

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 14: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

14 Privileged and Confidential

Ten Eleven Things You Can Do

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 15: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

15 Privileged and Confidential

Ten Things You Can Do

Monitoring is a ldquoNice-to-haverdquo

0 - Change the mindset(most important)

NO

Monitoring is a ldquoMust-haverdquo

YES

No hellip you canrsquot just get by with log files

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 16: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

16 Privileged and Confidential

Ten Things You Can Do

1 ndash Itrsquos your code hellip not Coherence

ldquoThere must be a bug in the compilerrdquo ndash Donn Combelic

Understand how your code executes in the cluster

Memory dies hellip you replace the memory

App dies hellip you change your code (or config)

All ldquoopsrdquo can do is restart the cluster

No ldquoone size fits allrdquo monitoring

(usually)

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 17: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

17 Privileged and Confidential

Ten Things You Can Do

2 ndash Know JMX and its sluggish nature

Coherence is high-performance JMX is not

Node Cluster Service MBean counts low

Cache Storage counts can be huge hellip gt 20000

Monitor the monitor ndash with JMX easy to get bad data

Separate node for JMX MBeanServer

Separate hardware for monitoring

Donrsquot put management=all on every node

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 18: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

18 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Good communication critical to Coherence

Run the datagram test to get throughput baseline

Storage Process node behaviors different

See Oracle documents re buffers switches etc

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 19: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

19 Privileged and Confidential

Ten Things You Can Do

3 ndash Understand Network Metrics

Success rates published in JMX are asymptotic

Pub Succ Rate = Pkts Resent Pkts Sent

Instead calculate rate from deltas

Delta Succ Rate = Delta Pkts Resent Delta Pkts Sent

Cluster tries to self-correct by reducing rates

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 20: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

20 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles)

Track nodes across restarts ndash use member names(ids change on restart)

Proxy services on separate nodes(node metrics apply to proxy only)

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 21: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

21 Privileged and Confidential

Ten Things You Can Do

4 ndash Identify the Players (Roles) Do cache operations (getput) on non-storage nodes

ndashlocalstorage=false

Process Nodes

request history

Storage Nodes

request history ndash

rogue process

node cannot be

identified here

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 22: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

22 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

Define one service for each important cache(or group of related caches)

Service Metrics

-Cpu load-Requests-Messages-Task Backlog-Requests Pending-etc

Cache Metrics

-Total GetsHitsMisses-Total Puts-Hit Rate-Store WritesReads-etc

Especially important for entry processors

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 23: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

23 Privileged and Confidential

Ten Things You Can Do

5 ndash Configure Services for Monitoring

With a few large caches and many small caches -Use heterogeneous scaling to reduce MBean counts

MBean count = nodes caches 2 for each service

100 nodes X 100 caches = 20000 MBeans

By running service on smaller of nodes fewer MBeans

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 24: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

24 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Avoid Out-Of-Memory Errors at all costs

Setup node death on OOME

No more than 30 heap for data

Avoid swapping total JVM heap lt physical memory

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 25: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

25 Privileged and Confidential

Ten Things You Can Do

6 ndash Monitor Capacity Carefully

Binary Unit Calculator useful for cache sizes

Use High Units to limit cache sizes

Binary units not helpful for front caches

Backup data and index sizes not shown

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 26: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

26 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Service is like database Cache is like table ndash Gene Gleyzer

Dynamic Caches = Scratch Space

Createdestroy cache = costly MBean registration

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 27: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

27 Privileged and Confidential

Ten Things You Can Do

7 ndash Stomp Out Cluster Abuse

Hundreds of Caches costly to monitor = many MBeans

Eg Cache Storage MBeans for 10 nodes with 375 caches

Cannot respond any faster than you can get the data

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 28: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

28 Privileged and Confidential

Ten Things You Can Do

8 ndash Refine Monitoring During Testing

Monday Morning Problem ndash validate after changes

Develop monitoring for your load tests ndash automated reports

Use monitoring to get familiar with failure modes during test

Separate batch vs operational scenarios

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 29: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

29 Privileged and Confidential

Ten Things You Can Do

9 ndash Monitor Your Resources

The effect your app has on resources (cpu mem network)

CPU monitoring using external tools

Java BCI tools like Wiley useful

Network signature tools

End user experience monitoring tools

Develop holistic view of app and Coherence component

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 30: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

30 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

Coherence JMX provides a lot of info but much is missing

No info about time taken to perform operations or the counts

No way to differentiate requests in the cluster clients

Custom JMX beans useful to augment Coherence

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 31: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

31 Privileged and Confidential

Ten Things You Can Do

10 ndash Instrument Your Applications

ltmbeansgt

ltmbean id=10gtltmbean-querygtMyDemoDataltmbean-querygtltmbean-namegttype=MyDemoDataltmbean-namegtltenabledgttrueltenabledgt

ltmbeangt

ltmbeansgt

Coherence collects MBeans from each client

Use custom-mbeansxml to specify MBeans in app

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 32: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

32 Privileged and Confidential

Ten Things You Can Do

Final Notes

Happy clusters are all alikeEvery unhappy cluster is unhappy in its own way

- var Tolstoy

Keep your cluster singing hellip

Itrsquos up to you not ldquoopsrdquo

Donn did find one bug in the compiler

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 33: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

33 Privileged and Confidential

Questions and Discussion

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom

Page 34: Ten Things You Can Do to Make Your Cluster Easy to Monitor in … · 2013. 2. 5. · See Oracle documents re: buffers, switches, etc. ... MyDemoData:*

34 Privileged and Confidential

RTView andOracle Coherence Monitor

SL Corporation

AdditionalInformation

wwwslcom