p.c. burkimsher it-co-be july 2004 scaling up pvss showstopper tests paul burkimsher it-co

53
P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Upload: julia-barker

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

P.C. Burkimsher IT-CO-BE July 2004

Scaling Up PVSS

Showstopper Tests

Paul Burkimsher IT-CO

Page 2: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Aim of the Scaling Up ProjectWYSIWYAF

Investigate functionality and performance of large PVSS systems

Reassure ourselves that PVSS scales to support large systems

Provide detail rather than bland reassurances

Page 3: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What has been achieved?

18 months PVSS gone through many pre-release versions– “2.13”– 3.0Alpha– 3.0Pre-Beta– 3.0Beta– 3.0RC1– 3.0RC1.5

Lots of feedback to ETM. ETM have incorporated

– Design fixes & Bug fixes

Page 4: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Progress of the project

Has closely followed the different versions. Some going over the same ground, repeating tests as bugs were fixed.

Good news: V3.0 Official Release is now here (even 3.0.1)

Aim of this talk: – Summarise where we’ve got to today.– Show that the list of potential

“showstoppers” has been addressed

Page 5: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2 !

Sheer number of systems– Can the implementation cope?

Sheer number of displaysAlert Avalanches

– How does PVSS degrade?Is load of many Alerts reasonable?Is load of many Trends reasonable?

Page 6: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches–How does PVSS degrade?

Is load of many Alerts reasonable?Is load of many Trends

reasonable?

}Skip

Page 7: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Sheer number of systems

130 systems simulated on 5 machines

40,000 DPEs~5 million DPEs

Interconnected successfully

Page 8: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches–How does PVSS degrade?

Is load of many Alerts reasonable?Is load of many Trends

reasonable?

}Skip

Page 9: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Alert Avalanche Configuration

2 WXP machinesEach machine = 1 systemEach system has 5 crates declared x

256 channels x 2 alerts in each channel (“voltage” and “current”)

40,000 DPEs total in each systemEach system showed alerts from both

systems

9491UI

UI

Page 10: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Traffic & Alert Generation

Simple UI script

Repeat– Delay D mS– Change N DPEs

Traffic rate D \ N– Bursts.– Not changes/sec.

Option provoke alerts

Page 11: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Alert Avalanche Test Results - I

You can select which system’s alerts you wish to view

UI caches ALL alerts from ALL selected systems.

Needs sufficient RAM! (5,000 CAME + 5,000 WENT alerts needed 80Mb)

Screen update is CPU hungry and an avalanche takes time(!)– 30 sec for 10,000 lines.

Page 12: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Alert Avalanche Test Results - II

Too many alerts -> progressive degradation

1) Screen update suspended – Message shown

2) Evasive Action. Event Manager eventually cuts the connection to the UI; UI suicides.– EM correctly processed ALL alerts

and LOST NO DATA.

Page 13: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Alert Avalanche Test Results - III

Alert screen update is CPU intensive

Scattered alert screens behave the same as local ones. (TCP)

“Went” alerts that are acknowledged on one alert screen disappear from the other alert screens, as expected.– Bugs we reported have now been

fixed.

Page 14: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches– How does PVSS degrade?

Is load of many Alerts reasonable?

Is load of many Trends reasonable?

Page 15: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Agreed Realistic Configuration

3 level hierarchy of machinesOnly ancestral connections, no peer

links. Only direct connections allowed.40,000 DPEs in each system, 1 sys per

machineMixed platform (W=Windows, L=Linux)

L

L L L L L L L L L L L L

W W

W

Page 16: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

91

92 93 94

95 04 05 06 07 08 09 10 11 12 1303

Viewing Alerts coming from leaf systems

1,000 “came” alerts generated on PC94 took 15 sec to be absorbed by PC91. All 4(2) CPUs in PC91 shouldered the load.

Additional alerts then fed from PC93 to the top node.– Same graceful degradation and evasive action seen

as before. PC91’s EM killed PC91’s Alert ScreenDisplay is again the bottleneck.

Page 17: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Rate supportable from 2 systems

Set up a high, but supportable rate of traffic (10,000 \ 1,000) on each of PC93 and PC94, feeding PC91.

PC93 itself was almost saturated, but PC91 coped (~200 alerts/sec average, dual CPU)

91

92 93 94

95 04 05 06 07 08 09 10 11 12 1303

Page 18: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Surprise Overload (manual)

Manually stop PC93PC91 pops up a message Manually restart PC93Rush of traffic to PC91 caused PC93 to

overloadPC93’s EM killed PC93’s DistMPC91 pops up a message

91

92 93 94

95 04 05 06 07 08 09 10 11 12 1303

Page 19: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

PVSS Self-healing property

PVSS self-healing algorithm– Pmon on PC93 restarts PC93’s DistM

Page 20: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Remarks

Evasive action taken by EM, cutting connection, is very good. Localises problems, keeping the overall system intact.

Self-healing action is very good. Automatic restart of dead managers

BUT…

Page 21: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Evasive action and Self-healing

Manually stop PC93PC91 pops up a messageManually restart PC93Rush of traffic to PC91 causes

PC93 to overloadPC93’s EM killed PC93’s DistMPC91 pops up a messagePmon restarts PC93’s DistM

91

92

93

94

Page 22: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Self-healing Improvement

To avoid the infinite loop, ETM’s Pmon eventually gives up.

Configurable how soon – Still not ideal!

ETM are currently considering my suggestion for improvement:– Pmon should issue the restart, but not

immediately.

Page 23: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

(Old) Alert Screen

We fed back many problems with the Alert Screen during the pre-release trials. – E.g. leaves stale information on-

screen when systems leave and come back.

Page 24: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

New Alert/Event Screen in V3.0

3.0Official release now has a completely new Alert/Event Screen which fixes most of the problems.

It’s new and still has some bugs, but the ones we have seen are neither design problems nor showstoppers.

Page 25: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

More work for ETM:

When DistM is killed by EM taking evasive action, the only indication is in the log.

But Log viewer, like Alert viewer, is heavy on CPU and shouldn’t be left running when it’s not needed.

Page 26: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Reconnection Behaviour

No gaps in the Alert archive of the machine that isolated itself by taking evasive action. No data was lost.

It takes about 20 sec for 2 newly restarted Distribution Managers to get back in contact.

Existing (new-style!) alert screens are updated with the alerts of new systems that join (or re-join) the cluster.

Page 27: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Is load of many Alerts reasonable?

~200 alerts/sec average would be rather worrying in a production system. So I believe “Yes”.

The response to an overload is very good. Though can still be tweaked.

Data integrity is preserved throughout.

Page 28: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches– How does PVSS degrade?

Is load of many Alerts reasonable?Is load of many Trends

reasonable?

Page 29: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Can you see the baby?

Page 30: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches– How does PVSS degrade?

Is load of many Alerts reasonable?Is load of many Trends

reasonable?

Page 31: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Is the load of many Trends reasonable?

Same configuration:91

92 93 94

95 04 05 06 07 08 09 10 11 12 1303

Trend windows were opened on PC91 displaying data from more and more systems. Mixed platform.

Page 32: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Is Memory Usage Reasonable?RAM

(MB)

Steady state, no trends open on PC91 593

Open plot ctrl panel on 91 658

On PC91, open a 1 channel trend window from PC03 658

On PC91, open a 1 channel trend window from PC04 657

On PC91, open a 1 channel trend window from PC05 657

On PC91, open a 1 channel trend window from PC06 658

On PC91, open a 1 channel trend window from PC07 658

Yes

Page 33: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Is Memory Usage Reasonable?

RAM

Steady state, no trends open on PC91 602

On PC91, open 16 single channel trend windows from PC95Crate1Board1 604

On PC91, open 16 single channel trend windows from PC03Crate1Board1 607

On PC91, open 16 single channel trend windows from PC04Crate1Board1 610

Yes

Page 34: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Test 34: Looked at top node plotting data from leaf machines’ archives

Performed excellently.Test ceased when we ran out of

screen real estate to show even the iconised trends (48 of).

Page 35: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Bland result? No!

Did the tests go smoothly? No!– But there was good news at the end

Page 36: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Observed gaps in the trend!!

Investigation showed gap was correct – Remote Desktop start-up caused CPU load– Data changes were not generated at this time

Zzzzzzz

Page 37: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Proof with a Scattered Generator

Steady traffic generationNo gaps in the recorded archive

– Even when deliberately soak up CPU

Gaps were seen in the display– Need a “Trend Refresh” button (ETM)

Scattered UI on PC93

TrafficEM

Trend UI on PC94

Zzzzzzz

Page 38: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Would sustained overload give trend problems?High traffic (400mS delay\1000

changes) on PC93, as a scattered member of PC94’s system.

PC94’s own trend plot could not keep up.

PC91’s trend plot could not keep up.

“Not keep up” means…

Zzzzzzz

Page 39: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

“Display can’t keep up” means…

Trend screen values updated to here

Timenow

Zzzzzzz

Page 40: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Evasive action

Trend screen values finally updated to here

Timenow

EM took evasive action, (disconnected the traffic generator) just here

Last 65sec queued in Traffic Generator. Lost when it suicided.

Zzzzzzz

Page 41: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Summary of Multiple Trending

PVSS can copePVSS is very resilient to overload

Successful tests.

Wakey!

Page 42: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Test 31 DP change rates

Measured saturation rates on different platform configurations.

No surprises. Faster machines with more memory are better. Linux is better than Windows.

Numbers on the Web.

Page 43: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Test 32 DP changes with alerts

Measured saturation rates; no surprises again.

Dual CPU can help in processing when there are a lot of alert screen (user interface) updates.

Page 44: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

What were the potential showstoppers?

Basic functionality– Synchronised types in V2!

Sheer number of systems– Can the implementation cope?

Alert Avalanches– How does PVSS degrade?

Is load of many Alerts reasonable?

Is load of many Trends reasonable?Conclusions

Page 45: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Conclusions

No showstoppers.

We have seen nothing to suggest that PVSS cannot be used to build a very big system.

Page 46: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Further work - IFurther “informational” tests will be

conducted to assist in making configuration recommendations, eg understanding the configurability of the message queuing and evasive action mechanism.

Follow up issues such as “AES needed more CPU when scattered”.

Traffic overload from a SIM driver rather than a UI

Collaborate with Peter C. to perform network overload tests.

Page 47: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Further work – II

Request a Use Case from experiments for a non-stressed configuration:– Realistic sustained alert rates– Realistic peak alert rate + realistic duration

• i.e. not a sustained avalanche– How many users connected to control room

machine?– % viewing alerts; % viewing trends; %

viewing numbers (eg CAEN voltages)– Terminal Server UI connections– How many UIs can control room cope with?

What recommendations do you want?

Page 48: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

In greater detail…

The numbers behind these slides will soon be available on the Web at http://itcobe.web.cern.ch/itcobe/Projects/ScalingUpPVSS/welcome.html

Any questions?

Page 49: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO
Page 50: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Can you see the baby?

Page 51: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Example Numbers

Name O/S GHz GB Rate@~70% CPU

PC92 Linux 2.2 x 2

2 1000\1000

PC93 W2000 1.8 0.5 1000\500

PC94 WXP 2.4 1 2000\1000

PC95 Linux 2.4 1 1000\1000

PC03 Linux 0.7 0.25 2000\1000

Table showing the Traffic Rates on different machine configurations, that gave rise to 70% CPU usage on those machines. See the Web links for the original table and details on how to interpret the figures.

Page 52: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO
Page 53: P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO