rac internals - indico · rac background processes lmsn lmsn global cache service process manage...
TRANSCRIPT
1
CERNGeneva - November 2008
juliandyke.com© 2008 Julian Dyke
Julian DykeIndependent Consultant
RAC Internals
2 © 2008 Julian Dyke juliandyke.com
About me...20 years Oracle experience as DBA, developer and consultant
Independent Consultant specializing inKernel Performance TuningRAC and High Availability
Chair of UKOUG RAC & HA SIG
Regular presenter at conferences, seminars and user group meetings in UK, Europe and USA
Member of Oak Table Network
Website http://www.juliandyke.com specializing in Oracle internals
3 © 2008 Julian Dyke juliandyke.com
About the book...Pro Oracle Database 10g RAC on Linux
Co-authored with Steve Shaw of Intel Corporation
Published by Apress
Available August 2006
ISBN: 1-59059-524-6
New edition plannedfor 2009 (Oracle 11gR2)
4 © 2008 Julian Dyke juliandyke.com
10101 101010010 01010010101 010101010101
10101 1001010 1010 10101 101001 01101010 '1011011';
10101 101010010 01010010101 0101010101011001010 110 10101 100101101010 10101010001 100101010010 10101011111000000 0000011000 101 0101010100 1010010
1010111 0101 010110101 0110101
1001011 1010 10101 1010 100010111001 1000110 1001 11101110001 00101 1111 101110
00100110 10101 1001 010111101101110 0110
5 © 2008 Julian Dyke juliandyke.com
Agenda
InterconnectRAC Background ProcessesGlobal Cache Services
6 © 2008 Julian Dyke juliandyke.com
RAC4-node cluster
Public Network
SharedStorage
Node 1
Instance 1
Node 2
Instance 2
Node 3
Instance 3
Node 4
Instance 4
PrivateNetwork
(Interconnect)
StorageNetwork
7 © 2008 Julian Dyke juliandyke.com
InterconnectOverview
Instances communicate with each other over the interconnect (network)
Information transferred between instances includesdata blockslocksSCNs
Typically 1Gb Ethernet UDP protocolOften teamed in pairs to avoid SPOFs
Can also use InfinibandFewer levels in stack
Other proprietary protocols are available
8 © 2008 Julian Dyke juliandyke.com
InterconnectTCP/IP Five Layer Model
All messages travel down through layers, across physical layer then up again
1Physical
2 Data Link
3 Network
4 Transport
5 Application
1Physical
2 Data Link
3 Network
4 Transport
5 Application
9 © 2008 Julian Dyke juliandyke.com
InterconnectTCP/IP Five Layer Model
TCP/IP has a four or five layer modelFive-layer model shown below
10BASE-T, 100BASE-T, 1000BASE-T, Optical Fibre, Twisted Pair1 Physical
Ethernet, Token Ring, 802.11, Wi-Fi, FDDI, PPP2 Data Link
IP (IPv4, IPv6), ICMP, ARP, RARP3 Network
TCP, UDP4 Transport
DHCP, DNS, FTP, HTTP, SSH, NFS, NTP, SMTP, SNMP, TELNET, RPC, SOAP5 Application
TCP/IP SuiteLayer
Four-layer model combines data link and physical layers
10 © 2008 Julian Dyke juliandyke.com
InterconnectTCP/IP Transport Layer
Transport LayerConnection-oriented (TCP)Connectionless (UDP)
Ethernet
Physical Layer
IP
TCP UDPClusterware RAC
11 © 2008 Julian Dyke juliandyke.com
InterconnectEncapsulation
EthernetHeader
EthernetTrailer
UDPHeader
IPHeader Data
UDPHeader
IPHeader Data
UDPHeader Data
Data
4 bytes14 bytes 20 bytes 8 bytes
MTU Size
12 © 2008 Julian Dyke juliandyke.com
Oracle ClusterwareNode Heartbeat Messages
Sent to each node in cluster every second in both directionsChecks nodes are still members of cluster
Sent by ocssd.bin using TCP well-known port 49895Outgoing message is 134 bytes (80 byte payload)Incoming message is 66 bytes (12 byte payload)
Node 1
Node 3
Node 2
Node 4
Outgoing
Incoming
13 © 2008 Julian Dyke juliandyke.com
Oracle ClusterwareNode Status Messages
Number of packets exchanged by a node is determined by number of nodes in clusterNumber of packets per node per hour is
(#nodes - 1) * 4 messages * 3600 seconds
446,40032216,00016100,8008
86,400772,000657,600543,200428,800314,4002
Packets per hourNumber of nodes
14 © 2008 Julian Dyke juliandyke.com
Global ServicesOverview
ResourceObject to which access must be controlled at instance level
EnqueueMemory structure that serializes access to a resource
Global ResourcesObject to which access must be controlled at cluster level
Global EnqueueLocks and enqueues which need to be consistent between all instances
15 © 2008 Julian Dyke juliandyke.com
Global ServicesOverview
Global Resource Directory (GRD)Records current state and owner of each resourceContains convert and write queues Distributed across all instances in clusterMaintained by GCS and GES
Global Cache Services (GCS)Implements cache coherency for database Coordinates access to database blocks for instances
Global Enqueue Services (GES)Controls access to other resources (locks) including library cache and dictionary cachePerforms deadlock detection
16 © 2008 Julian Dyke juliandyke.com
DatafilesControlfiles
Redo Logs
RAC Background ProcessesOverview
Redo Logs
DIAG
LMON
LCK0
LMD0
LMSn
PMON SMON
LGWR
CKPT
ARCn
SMON PMON
DBWR DBWR LGWR
Shared Pool
Buffer Cache
Instance 2
Shared Pool
Buffer Cache
Instance 1
DIAG
LMON
LCK0
LMD0
LMSn
CKPT
ARCn
Node 1 Node 2
17 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesLMSn
LMSnGlobal Cache Service Process
Manage requests for data access across cluster
Up to 20 in Oracle 10.1LMS0-LMS9 LMSa-LMSj
Up to 36 in Oracle 10.2 LMS0-LMS9 LMSa-LMSz
In Oracle 10.1 and above, number of GCS server processes can be configured using gcs_server_processes parameter
Default value is 1 (single CPU system)Can also be configured using _lm_lms parameter
18 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesLMSn
In Oracle 10.2 and above LMS processes run in real-time modeRemaining processes run in time-share mode
Check using:
[oracle@server3 ~]$ ps -eo pid,user,opri,cmd | grep ora_lm8596 oracle 75 ora_lmon_TEST18598 oracle 75 ora_lmd0_TEST18601 oracle 58 ora_lms0_TEST1
58 is real time; 75 or 76 is time shareYou can also check process scheduling policies using chrtoracle@server3 ~]$ chrt -p 8601 # lms0 - Real Timepid 8601's current scheduling policy: SCHED_RRpid 8601's current scheduling priority: 1
[oracle@server3 ~]$ chrt -p 8596 # lmon - Time Sharepid 8596's current scheduling policy: SCHED_OTHERpid 8596's current scheduling priority: 0
19 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesLCK0
LCK0Instance Enqueue Process
Part of KCL (Kernel Cache Library)
Manages instance resource requestscross-instance call operations
Assists LMS processes
Formerly known as lock process
One LCK0 process per instance
In 9.0.1 and below, number of lock processes may be configurable using _gc_lck_procs parameter
20 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesLMD0
LMD0Global Enqueue Service Daemon
Manages requests for global enqueuesUpdates status of enqueues when granted to / revoked from an instance
Responsible for deadlock detection
One LMD0 process per instance
In 8.1.7 and below number of lock daemons may be configurable using _lm_dlmd_processes parameter
21 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesLMON
LMON Global Enqueue Service Monitor
One LMON process per instance
Monitors cluster to maintain global enqueues and resources
Manages instance and process expirationsrecovery processing for cluster enqueues
22 © 2008 Julian Dyke juliandyke.com
RAC Background ProcessesDIAG
DIAG - Diagnosability Process
Collects diagnostic data in the event of a failure
Creates subdirectories in BACKGROUND_DUMP_DESTdirectory
In Oracle 9.0.1 and above can be disabled using _diag_daemon parameter
Do not try this on a production system
23 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesIntroduction
Global Cache Services exist to implement Cache Fusion
Cache Fusion allows blocks to be updated by multiple instances
Only one instance can have the updatable (current) version of a block
GCS must ensure that only one instance can update a block at any time
Many instances can have read-only (consistent read) versions of a block
Instances can have multiple copies of same block at different SCNs
24 © 2008 Julian Dyke juliandyke.com
Global Cache Services2 way Consistent Read
Instance 1
Instance 2
Instance 4
1318
Request shared resource
Instance 3
ResourceMaster
Instance 2 requests current read on block
Request granted
SN
Read request
Block returned
1318
1
2
3
4
STOP
25 © 2008 Julian Dyke juliandyke.com
Global Cache Services3-way Current Read
Instance 1
Instance 2
Instance 4
1318
Request exclusiveresource
Instance 3
ResourceMaster
Instance 1 requests exclusive read on block
Transfer block to Instance 1 for exclusiveaccess
SNBlock and resource status
Resource status
1318
1
2
3
4
N
N
X
1320
STOP
26 © 2008 Julian Dyke juliandyke.com
Global Cache Services3-way Current Read (Dirty Block)
Instance 1
Instance 2
Instance 4
1318
Request block in exclusive mode
Instance 3
ResourceMaster
Instance 4 requests exclusive read on block
Transfer block to Instance 4 in exclusive mode
SN
Block and resource status
Resource status
1318
12
3
4N NX
1320N
N
X
1320 1323
STOP
Note that Instance 1 will create a past image (PI) of the dirty block
27 © 2008 Julian Dyke juliandyke.com
Global Cache Services3-way Current (Without Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block in shared mode
Instance 3
ResourceMaster
Instance 2 requests current read on block
Block and resource status
Resource status
1
3
4
N NX
1320N
N
X
1320 1323
Transferblock to Instance 2in sharedmode
2
STOP
In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions
28 © 2008 Julian Dyke juliandyke.com
Global Cache Services3-way Current (With Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block in shared mode
Instance 3
ResourceMaster
Instance 2 requests current read on block
Block and resource status
Resource status
1
3
4
N NX
1320N X
1320 1323
Transferblock to Instance 2in sharedmode
2
S
S
STOP
In Oracle 8.1.5 and above _fairness_threshold is used to avoid unnecessary lock conversions
29 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesWait Events
Wait events show reads where messages have been exchanged with other instancesCan include:
gc cr grant 2-waygc cr block 2-waygc cr block 3-way gc cr multi block requestgc current grant 2-waygc current block 2-waygc current block 3-waygc current multi block request
30 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesCache Fusion Example
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
1,402,44
1,422,44
1,422,50
2 UPDATE t1SET c2 = 50
WHERE c1 = 2;
1 UPDATE t1SET c2 = 42
WHERE c1 = 1;
31 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesCache Fusion Example
RAC4 executes
Table block 15Current Read3-wayCurrent Read
Undo block 89Consistent Read2-wayTable block 15Consistent Read3-way
Consistent ReadUndo block 239Consistent Read2-wayUndo block 89Consistent Read2-wayTable block 15Consistent Read3-way
Dynamic Sampling
No statistics so dynamic sampling requiredNo indexes so full table scan requiredSteps are:
UPDATE t1 SET c2 = 42 WHERE c1 = 2;
32 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesCache Fusion Example
Dynamic Sampling - 10046/8
PARSING IN CURSOR #4 len=433 dep=1 uid=55 oct=3 lid=55 hv=574971495 ad='2b8da360'SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),:"SYS_B_0"), NVL(SUM(C2),:"SYS_B_1") FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("T7") FULL("T7") NO_PARALLEL_INDEX("T7") */ :"SYS_B_2" AS C1, CASE WHEN "T7"."C1"=:"SYS_B_3" THEN :"SYS_B_4" ELSE :"SYS_B_5" END AS C2 FROM "T7" "T7") SAMPLESUBEND OF STMTPARSE #4:c=0,e=423,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1EXEC #4:c=1999,e=10615,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1
WAIT #4: nam='gc cr block 3-way' ela= 836 p1=8 p2=15 p3=1 obj#=51836WAIT #4: nam='gc cr block 2-way' ela= 442 p1=6 p2=89 p3=67 obj#=51836 WAIT #4: nam='gc cr block 2-way' ela= 453 p1=6 p2=239 p3=68 obj#=51836
FETCH #4:c=0,e=2540,p=0,cr=10,cu=0,mis=0,r=1,dep=1,og=1STAT #4 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=10 pr=0 pw=0 time=3903 us)'STAT #4 id=2 cnt=32 pid=1 pos=1 obj=51836 op='TABLE ACCESS FULL T7 (cr=10 pr=0 pw=0 time=2650 us)'
33 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesCache Fusion Example
UPDATE statement - 10046/8
PARSING IN CURSOR #1 len=34 dep=0 uid=55 oct=6 lid=55 tim=1168417842291309 hv=3829255502 ad='2b8d04dc'UPDATE t7 SET c2 = 20 WHERE c1 = 5END OF STMTPARSE #1:c=10998,e=61121,p=0,cr=11,cu=0,mis=1,r=0,dep=0,og=1
WAIT #1: nam='gc cr block 3-way' ela= 702 p1=8 p2=15 p3=1 obj#=51836WAIT #1: nam='gc cr block 2-way' ela= 447 p1=6 p2=89 p3=67 obj#=0
WAIT #1: nam='gc current block 3-way' ela= 650 p1=8 p2=15 p3=33554433 obj#=51836
EXEC #1:c=0,e=2931,p=0,cr=10,cu=1,mis=0,r=1,dep=0,og=1WAIT #1: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=51836WAIT #1: nam='SQL*Net message from client' ela= 7807082 driver id=1650815232 #bytes=1 p3=0 obj#=51836STAT #1 id=1 cnt=0 pid=0 pos=1 obj=0 op='UPDATE T7 (cr=10 pr=0 pw=0 time=2875 us)'STAT #1 id=2 cnt=1 pid=1 pos=1 obj=51836 op='TABLE ACCESS FULL T7 (cr=10 pr=0 pw=0 time=1665 us)'
34 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr block 3-way wait event
868
1500
1500
1500
1500
1500
212
480
212
456
Bytes
Block file 8 block 15 part 6RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 5RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 4RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 3RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 2RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 1RAC4 - ServerRAC3 - LMS1
OKRAC2 - LMS1RAC3 - LMS1
Send file 8 block 15 to RAC4RAC3 - LMS1RAC2 - LMS1
OKRAC4 - ServerRAC2 - LMS1
Request file 8 block 15RAC2 - LMS1RAC4 - Server
DescriptionDestinationSource
35 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr block 3-way wait event
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
1,402,44
1,422,44
UPDATE t1SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
67
89
1,422,441,422,44
36 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr block 2-way wait event
2-way Consistent Read
868
1500
1500
1500
1500
1500
212
400
Bytes
Block file 6 block 69 part 6RAC4 - ServerRAC3 - LMS1
Block file 6 block 69 part 5RAC4 - ServerRAC3 - LMS1
Block file 6 block 69 part 4RAC4 - ServerRAC3 - LMS1
Block file 6 block 69 part 3RAC4 - ServerRAC3 - LMS1
Block file 6 block 69 part 2RAC4 - ServerRAC3 - LMS1
Block file 6 block 69 part 1RAC4 - ServerRAC3 - LMS1
OKRAC4 - ServerRAC3 - LMS1
Request file 6 block 69RAC3 - LMS1RAC4 - Server
DescriptionDestinationSource
37 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr block 2-way wait event
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
1,402,44
1,402,44
UPDATE t1SET c2 = 50
WHERE c1 = 2;
1 2
34
56
78
1,402,441,402,44
STOP
38 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc current block 3-way wait event
3-way Current Read
212OKRAC4 - LMS1RAC2 - LMS1
244Received file 8 block 15RAC2 - LMS1RAC4 - LMS1
868
1500
1500
1500
1500
1500
212
480
212
456
Bytes
Block file 8 block 15 part 6RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 5RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 4RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 3RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 2RAC4 - ServerRAC3 - LMS1
Block file 8 block 15 part 1RAC4 - ServerRAC3 - LMS1
OKRAC2 - LMS1RAC3 - LMS1
Send file 8 block 15 to RAC4RAC3 - LMS1RAC2 - LMS1
OKRAC4 - ServerRAC2 - LMS1
Request file 8 block 15RAC2 - LMS1RAC4 - Server
DescriptionDestinationSource
39 © 2008 Julian Dyke juliandyke.com
11
Global Cache Servicesgc current block 3-way wait event
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
1,402,44
1,422,44
UPDATE t1SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
67
89
1,422,44
12
UPDATE t1SET c2 = 42
WHERE c1 = 1;
RAC3 saves past image of the dirty block until RAC4 writes the block to disk
1,422,44
1,422,50
STOP
40 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesPast Images
When an instance passes a dirty block to another instance itFlushes redo buffer to redo log
Retains past image (PI) of block in buffer cachePI is retained until another instance writes block to diskUsed to reduce recovery times
Recorded in V$BH.STATUS as PIBased on X$BH.STATE (value 8 in Oracle 10.2)
41 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesPast Images
71287129UPDATE t1SET c1 = 7124;COMMIT;
UPDATE t1SET c1 = 7129;COMMIT;
7123
Instance 1
71237124712571267127
Buffer Cache
71247123
71257124
71267125
71277126
7128
71287127
Redo Log 1
Instance 2
Buffer Cache
71297128
UPDATE t1SET c1 = 7125;COMMIT;
UPDATE t1SET c1 = 7126;COMMIT;
UPDATE t1SET c1 = 7127;COMMIT;
UPDATE t1SET c1 = 7128;COMMIT; 7128
7123
Redo Log 2
7123
712871297129
7129
7129
Assume table t1 contains a single row in block 42
Instance 1 updates column to 7124Block 42 is read from diskUndo/Redo written to
Redo Log 1Block 42 is updated in buffer
cacheInstance 1 updates column to
7125Undo/Redo written to
Redo Log 1Block 42 is updated in buffer
cacheInstance 1 updates column to
7126Undo/Redo written to
Redo Log 1Block 42 is updated in buffer
cacheInstance 1 updates column to
7127Undo/Redo written to
Redo Log 1Block 42 is updated in buffer
cacheInstance 1 updates column to
7128Undo/Redo written to
Redo Log 1Block 42 is updated in buffer
cacheInstance 2 updates column to
1329GCS transfers block from Instance 1 to Instance 2
Instance 1 makes block 42 a Past Image block
Undo/redo written toRedo Log 2
Block 42 is updated in buffer cache
Instance 2 CrashesContents of buffer cache are lostDBWR has not written changes
to block 42 back to disk yetInstance 1 must perform recovery for Instance 2
Block 42 needs recoveryInstance 1 uses Past Image Undo/redo is applied from
Redo Log 2Block 42 is subsequently written
back to disk by DBWR
STOP
42 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr grant 2-way wait event
2-way Consistent Read
212
276
212
400
Bytes
OKRAC3 - LMS1RAC4 - Server
Grant read file 6 block 69RAC4 - ServerRAC3 - LMS1
OKRAC4 - ServerRAC3 - LMS1
Request file 6 block 69RAC3 - LMS1RAC4 - Server
DescriptionDestinationSource
43 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr grant 2-way wait event
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
1,402,441,402,44
1,402,44
SELECT c2FROM t1
WHERE c1 = 1;
1 2
5 6
34
STOP
44 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr multi block request wait event
212
772
212
1872
Bytes
OKRAC3 - LMS1RAC4 - Server
Grant file 8 blocks 69-73 to RAC4RAC4 - ServerRAC3 - LMS1
OKRAC4 - ServerRAC3 - LMS1
Request file 8 blocks 69-73RAC3 - LMS1RAC4 - Server
DescriptionDestinationSource
45 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr multi block request wait event
RAC1
RAC2
RAC4
1318
RAC3
ResourceMaster
SELECT c2FROM t1
WHERE c1 = 1;
1 2
5 6
34
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
1,402,44
STOP
46 © 2008 Julian Dyke juliandyke.com
Global Cache Servicesgc cr multi block request wait event
The following 10046/8 trace is for a gc cr multi block request
WAIT #2: nam='gc cr multi block request' ela= 722 file#=4 block#=248 class#=1 obj#=51866 tim=1169728375495574
WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244 blocks=5 obj#=51866 tim=1169728375506092
This trace can be misleading because:the gc cr multi block request specifies the LAST block in the rangethe gc cr multi block request does not specify how many blocks should be readthe gc cr multi block request does not specify how many blocks have been returned from another instance
47 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesUDP Messages
There are two types of message exchanged within RACThese are PROBABLY defined as follows
SynchronousThese messages require an acknowledgement for each packetIn some cases the acknowledgement packet can be larger than the original request
e.g. SCN synchronization
AsynchronousThese messages do not require an individual acknowledgement for each packet
e.g. block transfers between instances
48 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesLock Modes
Lock modes can be:Null
Another instance can hold an exclusive or shared lockShared
Another instance can hold a shared lock but not an exclusive lock
ExclusiveNo other instances can hold shared or exclusive locks
Locks can also be:Local
No other instance has held an exclusive lockGlobal
Another instance has held an exclusive lock in the past
49 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesFairness Threshold
Intended to prevent unnecessary lock downgrades when other instances only require read-only copies
For write to read transfersWriting instance retains X lockReading instance retains null lock
If _fairness_threshold reached thenWriting instance downgrades X lock to S lockReading instance receives S lock
_fairness_threshold default value is 4
50 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesLock Elements
Lock elements are externalized in the V$LOCK_ELEMENT dynamic performance view
Based on X$LE
Additional information is available in the X$LE view
Past image buffers do not have a lock element
In OPS one lock element could manage a contiguous range of blocks
Still can in RAC using GC_FILES_PER_LOCK parameterDisables Cache Fusion
51 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesLock Elements
Contain embedded GCS Client structures (KJBL)
LockElement
GCSClient
BufferHeader
LockElement
GCSClient
BufferHeader
BufferHeader
LockElement
GCSClient
BufferHeader
52 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesMemory Structures
KJBRKJBR
KJBL
BH BH
LE
KJBL
LE
KJBL
GCSClient
GCSShadow
GCSResource
BlockHeader Lock
Element
GCS Shadow describes blocks
held by other instances, but
mastered locally
53 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesMemory Structures
GCS Resources (KJBR)Stored in segmented array Number of GCS resource structures determined by
_gcs_resources parameterExternalized in X$KJBRNumber of free GCS resource structures in X$KJBRFX
GCS Enqueues (Clients / Shadows) (KJBL)GCS clients embedded in lock elementsGCS shadows stored in segmented arrayNumber of GCS shadow structures determined by
_gcs_shadow_locks parameterExternalized in X$KJBLNumber of free GCS shadow structures in X$KJBLFX
54 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDumps
To dump the contents of the global cache use:ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1';
GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18):id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713)lock: SL rls: 0x0000 acq: 0x0000 latch: 0flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'bscn: 0x0.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS CLIENT 0x21fecd60,1 sq[(nil),(nil)] resp[(nil),0x3591.10000] pkey 181grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCALmaster 1 owner 0 sid 0 remote[(nil),0] hist 0x7chistory 0x3c.0x1.0x0.0x0.0x0.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0disk: 0x0000.00000000 write request: 0x0000.00000000pi scn: 0x0000.00000000msgseq 0x1 updseq 0x0 reqids[1,0,0] infop 0x0pkey 181hv 107 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 6, dom 0]kjga st 0x4, step 0.0.0, cinc 8, rmno 10, flags 0x0lb 0, hb 0, myb 178, drmb 178, apifrz 0
55 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDumps
Continued
GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358):id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193)lock: SL rls: 0x0000 acq: 0x0000 latch: 0flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'bscn: 0x0.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCALmaster 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5.....
GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894,0x2ff57390] name[0x6a39.10000] pkey 74grant 0x2eff3858 cvt (nil) send (nil),0 write (nil),0@65535flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL.....
GCS SHADOW 0x2eff3858,1 sq[0x237f43a0,0x2ee64e8c] resp[0x2ee64e74,0x6a39.10000] pkey 74grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCALmaster 0 owner 1 sid 0 remote[0x23fea160,1] hist 0x65f.....
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCALmaster 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5 .....
56 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBlock Mastering
Each block is mastered on one instanceBlock DBA is reported by X$KJBR.KJBRNAME
Names have the format:[<block_number>][<file_number>][BL]
For example
[0x137][0x40000][BL]
Ordering by X$KJBR.KJBRNAME is difficult because the resource names do not collate when sorted e.g.:
is file# 4, block# 311
[0x12E][0x40000][BL]
[0x12F][0x40000][BL]
[0x13][0x40000][BL]
[0x130][0x40000][BL]
[0x131][0x40000][BL]
etc...
57 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBlock Mastering
Some useful functions
CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2)RETURN INTEGERIS
pos1 INTEGER := INSTR (p_resource_name,'x',1,2);pos2 INTEGER := INSTR (p_resource_name,']',1,2);s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGINRETURN TO_NUMBER (s,'XXXXXXXX') / 65536;
END;/
CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2)RETURN INTEGERIS
pos1 INTEGER := INSTR (p_resource_name,'x',1,1);pos2 INTEGER := INSTR (p_resource_name,']',1,1);s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGINRETURN TO_NUMBER (s,'XXXXXXXX');
END;/
58 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBlock Mastering
In Oracle 10.2 block mastering is determined by _lm_contiguous_res_count
Specifies number of contiguous blocks that will hash to the same HV bucket Defaults to 128For example
etcetc0x5FF0x5800x4FF0x4800x3FF0x3800x2FF0x280
EndStart
0x1FF0x180
0x0FF0x080
etcetc0x57F0x5000x47F0x4000x37F0x3000x27F0x200
EndStart
0x17F0x100
0x07F0x000
Instance 0 Instance 1
59 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBlock Mastering
In Oracle 10.1 and below block mastering is determined by a hash function
Algorithm applied to groups of 1289 contiguous blocksIn two node cluster
Instance 0 has 645 blocksInstance 1 has 644 blocksetc
In three node clusterInstance 0 has 430 blocksInstance 2 has 215 blocksInstance 1 has 430 blocksInstance 2 has 214 blocksetc
Beware of small hot tables and indexes....
60 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBlock Mastering
The following table shows that masters are still assigned to ranges of 128 contiguous blocks in a four-node cluster
114071280
212791024
01023896
1895768
3767640
3639512
3511384
2383256
2255128
11270
MasterEnd BlockStart Block
61 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
In Oracle 9.2 documentation describes dynamic remasteringnot implemented in code
In Oracle 10.1work at data file levelvery high threshold so difficult to testdoes occur on some customer sites
In Oracle 10.2works at segment levelthresholds are relatively low
62 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
ExampleSELECT data_object_id FROM dba_objectsWHERE owner = 'US01'AND object_name = 'T1';OBJECT_ID---------52084
ORADEBUG LKDEBUG -m pkey 52084
To remaster object at current instance use:
All blocks now mastered by the current instance
To redistribute masters to all available instances use:ORADEBUG LKDEBUG -m dpkey 52084
Blocks mastered by both (all) instances again
63 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
Object remastering is recorded in V$GCSPFMASTER_INFOInstances are internally numbered 0, 1 etcInitially contains no rowsAfter remastering object 52084 to instance 0
SELECT object_id, current_master, previous_masterFROM v$gcspfmaster_info;
After remastering object 52084 to instance 1
32767052084Previous MasterCurrent MasterObject ID
0152084Previous MasterCurrent MasterObject ID
64 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
In Oracle 10.2 and above, information about Dynamic Remastering operations is also reported in the following fixed views
X$KJDRMREQDynamic Remastering Requests
X$KJDRMAFNSTATSFile Remastering Statistics
X$KJDRMHVSTATSHash Value Statistics
65 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
In Oracle 11.1 and above, Dynamic Remastering statistics are reported in V$DYNAMIC_REMASTER_STATS
NUMBERCURRENT_OBJECTS
NUMBERREPLAYED_LOCKS_RECEIVED
NUMBERREPLAYED_LOCKS_SENT
NUMBERRESOURCES_CLEANED
NUMBERSYNC_TIME
NUMBERFIXWRITE_TIME
NUMBERREPLAY_TIME
NUMBERCLEANUP_TIME
NUMBERFREEZE_TIME
NUMBERQUIESCE_TIME
NUMBERREMASTERED_OBJECTS
NUMBERREMASTER_TIME
NUMBERREMASTER_OPS
Data TypeCol;umn Name
66 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesDynamic Remastering
Dynamic remastering is coordinated by the LMD0 background The LMD0 process background process includes limited details of dynamic remastering operations
Excessive dynamic remastering can cause instance freezesObserved in both Oracle 10.1 and 10.2Oracle Support occasionally recommends that dynamic remastering is disabled using the following parameters:
_gc_affinity_time = 0_gc_undo_affinity=FALSE
67 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesSystem Change Number
In RAC clusters SCN must be maintained across all nodes in cluster
SCN propagation scheme differs according to version
In Oracle 10.1and below defaults to Lamport algorithmLamport in alert.logSCN piggy-backed on GCS/GES messagesRecorded in redo logDefault delay of 7 seconds
In Oracle 10.2 and above defaults to Broadcast on Commit algorithm
SCN negotiated immediatelyApparently no delay
68 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesSystem Change Number
System Change Number algorithm is determined by the MAX_COMMIT_PROPAGATION_DELAY parameter
In Oracle 10.1 and belowInitialization parameter specified in centrisecondsDefault value is 700 centiseconds (7 seconds)Specifies maximum time taken for a COMMIT on one node to be reflected on other nodes in the clusterFor some applications performing rapid updates and queries of the same data from different instances, value must be set to 0 (Broadcast on commit)Examples include:
E-Business suiteSAP
69 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesSystem Change Number
In Oracle 10.2 and above Default value of MAX_COMMIT_PROPAGATION_DELAYparameter is 0SCN broadcast on commit method is usedSCN updates are synchronized immediately
SCN is synchronized after current readbefore block updated
This ensures correct SCN is written to block
70 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBroadcast on Commit
Ethernet broadcast is not used
SCN is synchronized by updating instanceSends UDP SCN synchronization message to each remote instance Remote instances respond with their current SCN
Another round of messages may be required if remote SCNsare more recent than local SCN
Synchronization occurs every time an instance needs a new SCNSynchronization is always performed by the updating instanceNumber of messages = 4 x (number of instances - 1)
71 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesBroadcast on Commit
In a 4-node cluster 12 messages are exchanged
212192212192212192212192212192212192Bytes
Send current SCNRAC4-LMS0RAC1-LMS0
OKRAC3-LMS0RAC4-LMS0Send current SCNRAC4-LMS0RAC3-LMS0OKRAC2-LMS0RAC4-LMS0Send current SCNRAC4-LMS0RAC2-LMS0OKRAC1-LMS0RAC4-LMS0
OKRAC4-LMS0RAC3-LMS0Send current SCNRAC3-LMS0RAC4-LMS0OKRAC4-LMS0RAC2-LMS0Send current SCNRAC2-LMS0RAC4-LMS0OKRAC4-LMS0RAC1-LMS0Send current SCNRAC1-LMS0RAC4-LMS0DescriptionDestinationSource
72 © 2008 Julian Dyke juliandyke.com
Global Cache ServiceRead Consistency
When a read consistent version of a block is requested it may be necessary to apply undo to a more recent version of that block
Undo can be applied by LMSn background process inRemote instance Local instance
If undo applied by remote instance, any outstanding redo must first be flushed from redo buffer of remote instance to redo log
Can have significant performance impact on consistent readsParticularly on extended clusters
73 © 2008 Julian Dyke juliandyke.com
Global Cache ServiceRead Consistency
Statistics on inter-instance consistent reads are reported in V$CR_BLOCK_SERVER
Reports statistics for blocks served by local instances to remote instances including
Number of consistent reads servedNumber of current reads servedNumber of data blocks servedNumber of undo blocks servedNumber of undo headers servedNumber of fairness down convertsNumber of log flushesNumber of times light works rule invoked
74 © 2008 Julian Dyke juliandyke.com
Global Cache ServiceRead Consistency
In theory, once a block has been written to disk, the LMS process will not attempt to read it again when responding to a consistent read request
Light Works RulePrevents LMS processes from going to disk when responding to CR requests for data, undo or undo segment blocksCan prevent LMS process from completing its response to a CR request
75 © 2008 Julian Dyke juliandyke.com
Global Cache ServiceRead Consistency
Uncommitted changes MUST be flushed to the redo log before the LMS process can ship a consistent block to another instance
Reading process must wait until redo log changes have been written to redo log by LMS process
Bad for standard RAC databasesReads must wait for redo log writes
Worse for extended / stretch RAC clustersIncreased latency of cross site disk communications
76 © 2008 Julian Dyke juliandyke.com
Global Cache ServiceRead Consistency
For each block on which a consistent read is performed, a redo log flush must first be performed
Number of redo log flushes is recorded in the FLUSHEScolumn of V$CR_BLOCK_SERVER
Redo log flush time is recorded in the gc cr block flush time statistic for the LMS processwill increase time taken to serve consistent blockwill increase time taken to perform consistent read
If LMS processes become very busy, consistent reads will experience high wait times e.g. for a full table scan gc cr multi block request
77 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesRead Consistency
Committed transaction on RAC2 - All blocks still in buffer cache
110
109
108
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
1
2
3110 110
STOP
78 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesRead Consistency
Committed transaction on RAC2 - Some blocks written to disk
110
109
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
1
3
2
110
110
4
110
110
STOP
79 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesRead Consistency
Uncommitted transaction on RAC2 - All blocks still in buffer cache
110
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
2
31
108 110
4
5
6
109
110
109
109
108108
108108
STOP
80 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesRead Consistency
Uncommitted transaction on RAC2 - Some blocks written to disk
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
3
2
1
110
4
6
8
1105
7 110
110
109
110
109
109
108108
108
STOP
81 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesJumbo Frames
By default Maximum Transmission Unit (MTU) is 1500MTU includes
IP headerUDP headerData
Requires six packets to transmit one 8192 byte block
On some adapters MTU can be increased to around 9000e.g. Intel PRO/1000
At command line
ifconfig eth1 mtu 9000 up
or in /etc/sysconfig/ifcfg-eth<x>
MTU=9000
82 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesJumbo Frames
Example - cost of sending on 8192 byte blockMTU=1500 (default)
151841472820145
84762482004812084Total
4
4444
EthernetTrailer
6
4321
Frame#
88684082014
1518147282014151814728201415181472820141518147282014
TotalDataUDPHeader
IP HeaderEthernet Header
82464820082014Total4
EthernetTrailer
1
Frame#
8246820082014
TotalDataUDPHeader
IP HeaderEthernet Header
MTU=9000
83 © 2008 Julian Dyke juliandyke.com
Global Cache ServicesJumbo Frames
Not all network adapter drivers support jumbo framesParticularly cheap ones....
All network adapters in private interconnect must have same MTU size
Switch must also be configured to support jumbo frames
Lots of bugs and compatibility issues e.g.Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000
affects 10.1.0.5, 10.2,0.1fixed in 10.2.0.2 and above