cern - real life rac scalability experiences

CERN IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Eric Grancher [email protected]

CERN IT

Anton [email protected]

openlab, CERN IT

Real life experiences of RAC scalability with a 6 node cluster



Outline

• CERN computing challenge• Oracle RDBMS and Oracle RAC @ CERN• RAC scalability – what, why and how?• Real life scalability examples• Conclusions• References



LHC gets ready …

Jürgen Knobloch / CERN



The LHC Computing Challenge

• Signal/Noise 10-9

• Data volume– High rate * large number of

channels * 4 experiments

� 15 PetaBytes of new data each year

• Compute power– Event complexity * Nb.

events * thousands users� 100 k of (today's) fastest

CPUs

• Worldwide analysis & funding– Computing funding locally

in major regions & countries

– Efficient analysis everywhere

� GRID technologyJürgen Knobloch / CERN


Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 5

WLCG Collaboration

• The Collaboration– 4 LHC experiments

– ~250 computing centres– 12 large centres

(Tier-0, Tier-1)– 38 federations of smaller

“Tier-2” centres

• Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid

• Technical Design Reports– WLCG, 4 Experiments: June 2005

• Memorandum of Understanding– Agreed in October 2005

• Resources– 5-year forward look




Centers around the world form a Supercomputer

• The EGEE and OSG projects are the basis of the Worldwide LHC Computing Grid Project WLCG

Inter-operation between Grids is working!Jürgen Knobloch / CERN


Switzerlandwww.cern.ch/it Jürgen Knobloch/CERN Slide 7

CPU & Disk Requirements 2006

0

50

100

150

200

250

300

350

2007 2008 2009 2010

Year

MS

I2000

LHCb-Tier-2CMS-Tier-2ATLAS-Tier-2ALICE-Tier-2LHCb-Tier-1CMS-Tier-1ATLAS-Tier-1ALICE-Tier-1LHCb-CERNCMS-CERNATLAS-CERNALICE-CERN

0

20

40

60

80

100

120

140

2007 2008 2009 2010

Year

PB

CPU Disk

CERN: ~ 10%




Oracle databases at CERN

• 1982 : CERN starts using Oracle• 1996: OPS on Sun SPARC Solaris• 2000: Use of Linux x86 for Oracle RDBMS

• Today : Oracle RAC for most demanding services:– CASTOR mass storage system (15 PB / year)– Administrative applications (AIS)– Accelerators and controls etc.– LHC Computing Grid (LCG)



(our view of) RAC basics

• Shared disk infrastructure, all disk devices have to be accessible from all servers

• Shared buffer cache (with coherency!)

Clients DB Servers Storage



Linux RAC deployment example

• RAC for Administrative Services move (2006)– 4 nodes x 4 CPUs, 16GB RAM/node– RedHat Enterprise Linux 4 / x86_64– Oracle RAC 10.2.0.3

• Consolidation of administrative applications, first:– CERN Expenditure Tracking– HR management, planning and follow-up as well as a self

service application– Computer resource allocation, central repository of

information (“foundation”)…

• Reduce number of databases, profit from HA, profit from information sharing between the applications (same database).



Linux RAC deployment for administrative applications

• Use of Oracle 10g services– One service per application OLTP workload– One service per application batch workload

– One service per application external access

• Use of Virtual IPs• Raw devices only for OCR and quorum devices• ASM disk management (2 disk groups)• RMAN backup to TSM

• Things became must easier and stable over time (GC_FILES_TO_LOCKS, shared raw device volumes …) , good experience with consolidation on RAC



RAC Scalability

Frits Ahlefeldt-Laurvig / http://downloadillustration.com/



RAC Scalability (1)

2 ways of upgrading database performance

Upgrading the hardware (scale up)- Expensive and inflexible

Adding more hardware (scale out)- Less expensive hardware and more flexible



RAC Scalability (2)

Adding more hardware (scale out)Pros:- Cheap and flexible- Getting more popular- Oracle RAC is there to help youCons:- More is not always better- Achieving desired scalability is not easy.

Why?



Reason 1

• Shared disk infrastructure, all disk devices have to be accessible from all servers

• Shared buffer cache (with coherency!)

Clients DB Servers Storage



Reason 2



Reason 3



Examples

2 real life RAC scalability examples

• CASTOR Name Server

• PVSS



CASTOR Name Server

• CASTOR– CERN Advanced STORage manager– Store physics production files and user files

• CASTOR Name Server– Implements hierarchical view of CASTOR name

space– Multithreaded software– Uses Oracle Database for storing files metadata



Stress Test Application

• Multithreaded • Used with up to 40 threads • Each thread loops 5000 times on

– Creating a file– Checking it’s parameters– Changing size of the file

• Test Made:– Single instance vs. 2 nodes RAC– No changes in schema and application code



Result

CNS : Single instance vs RAC

0

100

200

300

400

500

600

1 2 5 7 10 12 14 16 20 25 30 35 40

Threads

ops/

s

Single instance

RAC, 2 instances



Analysis 1/2

Problem:• Contention on CNS_FILE_METADATA table

Change:• Hash partition with local PK index

Result: 10% gain, but still worse than single instance



Analysis (2/2)

• Top event:– enq: TX - row lock contention– Again on the CNS_FILE_METADATA

• Findings:– Application logic causes row lock contention– Table structure reorganization can’t help

• Follow – up– No simple solution– Work in progress now



PVSS II

• Commercial SCADA application – critical for LHC and experiments

• Archiving in Oracle Database

• Out of the box performance:100 “changes” per second

• CERN needs: 150 000 changes per second = 1500 times faster!



The Tuning Process

1. run the workload, gather ASH/AWR information, 10046…

2. find the top event that slows down the processing

3. understand why time is spent on this event

4. modify client code, database schema, database code, hardware configuration



PVSS Tuning (1/6)

• Shared resource :EVENTS_HISTORY (ELEMENT_ID, VALUE…)

• Each client “measures” input and registers history with a “merge” operation in the EVENTS_HISTORY table

Performance:• 100 “changes” per second

Tableevents_history

trigger on update eventlast… merge (…)

150 Clients DB Servers Storage

Update eventlastval set …


Tableeventlastval



PVSS Tuning (2/6)

Initial state observation:

• database is waiting on the clients “SQL*Net message from client ”

• Use of a generic library C++/DB• Individual insert (one statement per entry) • Update of a table which keeps “latest state”

through a trigger



PVSS Tuning (3/6)

Changes: • bulk insert to a temporary table with OCCI, then call PL/SQL

to load data into history table

Performance:• 2000 changes per second

Now top event: “db file sequential read ”

System I/O3.73123,951db file parallel write

System I/O5.81191,133log file parallel write

18.8861CPU time

Other37.2212041enq: TX - contention

User I/O42.5613729,242db file sequential read

Wait ClassPercent Total DB TimeTime(s)WaitsEventawrrpt_1_5489_5490.html



PVSS Tuning (4/6)

Changes: • Index usage analysis and reduction• Table structure changes. IOT. • Replacement of merge by insert. • Use of “direct path load” with ETLPerformance:• 16 000 “changes” per second • Now top event: cluster related wait event

test5_rac_node1_8709_8710.html

Cluster8.62198118,454gc current block 2-way

Cluster9.9922824,370gc current grant busy

Cluster11.1372556,818gc current block busy

16.0369CPU time

Cluster31.62672827,883gc buffer busy

Wait Class% Total Call Time

Avg Wait(ms)Time(s)WaitsEvent



PVSS Tuning (5/6)

Changes:• Each “client” receives a unique number.• Partitioned table. • Use of “direct path load” to the partition with ETL Performance:• 150 000 changes per second • Now top event : “freezes ” once upon a while

rate75000_awrrpt_2_872_873.html

Configuration3.6088785,439undo segment extension

System I/O4.5711091,542log file parallel write

5.1123CPU time

Cluster6.4221557,218gc current multi block request

Concurrency27.6818665813row cache lock

Wait Class% Total Call TimeAvgWait(ms)

Time(s)WaitsEvent



PVSS Tuning (6/6)

Problem investigation:• Link between foreground process and ASM processes• Difficult to interpret ASH report, 10046 trace

Problem identification:• ASM space allocation is blocking some operations

Changes: • Space pre-allocation, background task.

Result:• Stable 150 000 “changes” per second.



PVSS Tuning Schema

150 Clients DB Servers Storage

Tableevents_history

PL/SQL: insert /*+ APPEND */

into eventh (… ) partition

PARTITION (1)select …

from temp

Temptable

Tableevents_history

trigger on update eventlast… merge (…)

Tableeventlastval



Bulk insert into temp table

Bulk insert into temp table



PVSS Tuning Summary

Conclusion: • from 100 changes per second to 150 000

“changes” per second • 6 nodes RAC (dual CPU, 4GB RAM), 32

disks SATA with FCP link to host• 4 months effort:

– Re-writing of part of the application with changes interface (C++ code)

– Changes of the database code (PL/SQL)– Schema change– Numerous work sessions, joint work with other

CERN IT groups



Scalability Conclusions

• ASM / cluster filesystem / NAS allow– a much easier deployment– way less complexity and human error risk.

• 10g RAC is much easier to tune• RAC can boost your application performance, but

also disclose the weak design points and magnify their impact

• Proper application design is the key to almost linear scalability for a “non-read-only” application



Recommendations

• Understand the extra features• Take advantage of RAC connections

management– Use « services » with resource management.– Load balancing : client side, server side, client +

server side…

• Use connection failover mechanisms: – « Transparent Application Failover » re-

connection, loss of modifications non yet committed.

– « Fast Connection Failover » (10g), notification, client side to be implemented.

– Minimum: re-connection.



The Message to You

• RAC can scale even for write intensive applications

• As you know, RAC scalability tuning is challenging

• Create RAC-aware applications from the beginning– Application design is key

• The effort is paid back:– With better application performance– With highly available system– With flexibility and lower price for hardware



References

• “Pro Oracle Database 10g RAC on Linux”Julian Dyke and Steve Shaw

• “Connection management in RAC” James Morle

• "RAC awareness SQL“ pythian.com

• “ETL 10gR2 loading “ Tom Kyte

• CERN IT-DES (CERN IT-DES group web site)



Q&A

cern - real life rac scalability experiences

Documents