In this Document
Purpose |
Scope |
Details |
Issue #1: High counts of lost blocks(gc lost blocks, gc current/cr lost blocks) |
Issue #2: High log file sync waits |
Issue #3: High Waits on Mutex |
Issue #4: High gc buffer busy, enq: TX -row lock contention, enq: TX - index contention, enq: TX - ITL allocate entry waits |
Issue #5: High CPU and memory consumption |
References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.1.0.2 to 12.1.0.1 [Release 10.1 to 12.1]Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Gen 1 Exadata Cloud at Customer (Oracle Exadata Database Cloud Machine) - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.
PURPOSE
The purpose of this document is to provide a summary of top database and/or instance performance issues and the possible solutions in a RAC environment.
Please note that Issue #3 (High Waits on Mutexes) and Issue #5 (High CPU and Memory Consumption) are generic database issues and not RAC-specific. However, they have been included in this document because they are one of the TOP issues, and can occur locally on an instance in a RAC environment.
SCOPE
DBAs
DETAILS
Issue #1: High counts of lost blocks(gc lost blocks, gc current/cr lost blocks)
Symptoms
II. netstat -s reports increasing number of packet reassembly failure, dropped packets.
Solutions
Document 563566.1 - gc block lost diagnostics
Issue #2: High log file sync waits
Symptoms
II. Average log file sync is high ( > 20 ms).
III. Average log file parallel write is high ( > 10 ms).
Iv. Average redo write broadcast ack time or wait for scn ack is high ( > 10 ms).
V. Average log file sync is low, but there are too many log file sync waits.
Background
When a user session COMMITs or ROLLBACKs, the session's redo information needs to be flushed by LGWR to the redo logfile. The user session waits on 'log file sync' while waiting for LGWR to post it back to confirm that all redo changes are safely on disk.
Example:
WAIT #0: nam='log file sync' ela= 977744 buffer#=754 p2=0 p3=0 obj#=114927 tim=1212384670107387
Parameters:
P1 = buffer#
P2 = Not used
P3 = Not used
obj# = object_id
All changes up to this buffer# (in the redo log buffer) must be flushed to disk and the writes confirmed to ensure that the transaction is committed, and will remain committed upon an instance crash.
A typical life cycle of 'log file sync' wait
1. A user sessions issues a commit or rollback and starts waiting on log file sync.
2. LGWR gather redo information to be written to redo log files, issues IO and queues BOC to an LMS process and posts LMS process.
3. LGWR waits for redo buffers to be flushed to disk and SCN to be ACK'd
4. Completion of IO and receiving ACK from LMS signal write complete. LGWR then posts foreground process to continue.
5. Foreground process wakes up and log file sync wait ends.
Important log file sync related statistics and events
redo write time - Total elapsed time of the write from the redo log buffer to the current redo log file in 10s of milliseconds.
redo writes - Total number of writes by LGWR to the redo log files. "redo blocks written" divided by this statistic equals the number of blocks per write.
log file parallel write - Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.
redo write broadcast ack time - Total amount of the latency associated with broadcast on commit beyond the latency of the log write (in microseconds).
user commits - Number of user commits. When a user commits a transaction, the redo generated that reflects the changes made to database blocks must be written to disk. Commits often represent the closest thing to a user transaction rate.
user rollbacks - Number of times users manually issue the ROLLBACK statement or an error occurs during a user's transactions.
The script provided in Document 1064487.1 - Script to Collect Log File Sync Diagnostic Information (lfsdiag.sql) collects useful information for troubleshooting log file sync issues.
Possible Causes
II. Oracle bugs. Please review WAITEVENT: "log file sync" Reference (Document 34592.1) for known oracle bugs.
III. LMS not running in RT class.
IV. Scheduling delays with LGWR process.
V. Excessive number of commits.
VI. OS resource starvation.
Solutions
II. Apply fixes for the known oracle bugs, which are applicable to your environment. The most effective way to get those fixes in is to apply the latest PSU patches. Document 756671.1 has more information on the latest PSUs.
III. Ensure that LMS processes are running in RT class. LMS processes run in RT class by default.
IV. Reduce number of commits by using batch commits instead of committing after every DML operation.
V. See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options.
VI. See if COMMIT NOWAIT option can be used. Please refer to Document 857576.1 for more information.
Issue #3: High Waits on Mutex
Mutexes are a lighter-weight and more granular concurrency mechanism than latches.The reason for obtaining a mutex is to ensure that certain operations are properly managed for concurrency. e.g., if one session is changing a data structure in memory, then other session must wait to acquire the mutex before it can make a similar change. The following are most common mutex related waits:
A. cursor: pin S wait on X
B. cursor: pin S
C. library cache: Mutex X
Symptoms (A)
Background (A)
A session may wait for this event when it is trying to get a mutex pin in Share mode but another session is holding the mutex pin on the same cursor object in exclusive. Frequently, waits for 'Cursor: pin S wait on X' is a symptom and not the cause. There may be underlying tuning requirements or known issues.
Possible Causes (A)
Solutions (A)
Symptoms (B)
Background (B)
A session waits for "cursor: pin S" when it wants a specific mutex in S (share) mode on a specific cursor and there is no concurrent X holder but it could not acquire that mutex immediately. This may seem a little strange as one might question why there should be any form of wait to get a mutex which has no session holding it in an incompatible mode. The reason for the wait is that in order to acquire the mutex in S mode (or release it) the session has to increment (or decrement) the mutex reference count and this requires an exclusive atomic update to the mutex structure itself. If there are concurrent sessions trying to make such an update to the mutex then only one session can actually increment (or decrement) the reference count at a time. A wait on "cursor: pin S" thus occurs if a session cannot make that atomic change immediately due to other concurrent requests.
Mutexes are local to the current instance in RAC environments.
Parameters:
P1 = idn
P2 = value
P3 = where (where|sleeps in 10.2)
idn is the mutex identifier value which matches to the HASH_VALUE of the SQL statement that we are waiting to get the mutex on. The SQL can usually be found using the IDN value in a query of the form:
SELECT sql_id, sql_text, version_count
FROM V$SQLAREA where HASH_VALUE=&IDN;
If the SQL_TEXT shown for the cursor is of the form "table_x_x_x_x" then this is a special internal cursor - see Document 1298471.1 for information about mapping such cursors to objects.
P1RAW is the same value in hexadecimal and it can be used to search in trace files for SQL matching to that hash value.
Possible Causes (B)
II. Waits for very many different "idn" values when under load.
III. Oracle Bugs
Bug 10270888 - ORA-600[kgxEndExamine-Bad-State] / mutex waits after a self deadlock
Bug 9591812 - Wrong wait events in 11.2 ("cursor: mutex S" instead of "cursor: mutex X")
Bug 9499302 - Improve concurrent mutex request handling
Bug 7441165 - Prevent preemption while holding a mutex (fix only works on Solaris)
Bug 8575528 - Missing entries in V$MUTEX_SLEEP.location
Solutions (B)
II. For any identified "hot" SQLs, one can reduce the concurrency on specific cursors by replacing the one SQL with some variants which are executed by different sessions. Please review WAITEVENT: cursor: pin S Reference (Document 1310764.1) for further details.
III. Apply fixes for other known oracle bugs. The most effective way to get the fixes in is to apply the latest PSU patches. Document 756671.1 has more information on recommended patches.
Symptoms (C)
Background (C)
The library cache mutex is acquired for similar purposes that the library cache latches were acquired in prior versions of Oracle. In 10g, mutexes were introduced for certain operations in the library cache. Starting with 11g, the library cache latches were replaced by mutexes. This wait event is present whenever a library cache mutex is held in exclusive mode by a session and other sessions need to wait for it to be released.
Individual Waits:
Parameters:
P1 = "idn" = Unique Mutex Identifier
P2 = "value" = Session ID holding the mutex
P3 = "where" = location in code (internal identifier) where mutex is being waited for
Systemwide Waits:
At a systemwide level, there are two views which will help diagnose this wait:
GV$MUTEX_SLEEP (or V$MUTEX_SLEEPS for non-RAC)
and GV$MUTEX_SLEEP_HISTORY (or V$MUTEX_SLEEP_HISTORY for non-RAC)
These views track the instance wide usage of mutexes since instance startup. Since these views show values that are total values since startup, they are most meaningful when you obtain the difference in values during a short time interval when there was problem. The easiest way to see this information is through an AWR or statspack report in the "Mutex Sleep Summary" section.
Possible Causes (C)
II. High Version Counts.
III. Invalidations and reloads.
IV. Oracle Bugs. Please review WAITEVENT: "library cache: mutex X" (Document 727400.1) for the
list of known oracle bugs.
Solutions (C)
II. Review Troubleshooting: High Version Count Issues (Document 296377.1) to resolve high version count issue.
III. Apply fixes for the known oracle bugs. The most effective way to get the fixes in is to apply the latest PSU patches. Document 756671.1 has more information on recommended patches.
Issue #4: High gc buffer busy, enq: TX -row lock contention, enq: TX - index contention, enq: TX - ITL allocate entry waits
A. enq: TX - Index Contention
B. enq: TX - ITL allocate entry waits
C. enq: TX - row lock contention
Symptoms (A)
II. AWR reports show high values for branch node splits, leaf node splits and leaf node 90-10 splits
III. AWR reports (Segments by Row Lock Waits) show high row lock waits for a particular segment
IV. AWR reports show other waits such as gc buffer busy waits on index branch or Leaf blocks, gc current block busy on Remote Undo Headers and gc current split.
Example:
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: TX - index contention 29,870 1,238 41 9.52 Concurrency
Instance Activity Stats:
Statistic Total per Second per Trans
branch node splits 945 0.26 0.00
leaf node 90-10 splits 1,670 0.46 0.00
leaf node splits 35,603 9.85 0.05
Segments by Row Lock Waits:
Owner Tablespace Object Name Obj.Type Row Lock Waits % of Capture
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_PK INDEX 3,425 43.62
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_ST INDEX 883 11.25
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_DT INDEX 682 8.69
Background (A)
When a transaction inserting a row in an index has to wait for the end of an index block split being done by another transaction the session would wait on event enq: TX - index contention.
Possible causes (A)
II. Right-growing index with key generated from a sequence.
Solutions (A)
I. Global Hash partition the index
II. Recreate the index as reverse key index (not suitable for large table, could require buffer cache increased accordingly)
III. If index key is generated from a sequence, increase cache size of the sequence and make the sequence 'no order' if application supports it.
IV. Beginning in 18c, "Scalable sequences significantly reduce the sequence and index block contention and provide better data load scalability, compared to the solution of configuring a very large sequence cache using the CACHE clause of ... ALTER SEQUENCE."
Reference: https://docs.oracle.com/en/database/oracle/oracle-database/18/newft/new-features.html#GUID-CB2428B8-A5BA-4B13-B437-ECB5F0C2C84E
Symptom (B)
Background (B)
A session waits on enq: TX - allocate ITL entry when it wants to lock a row in the block but one or more other sessions have rows locked in the same block, and there is no free ITL slot in the block. Usually, Oracle Database dynamically adds another ITL slot. This may not be possible if there is insufficient free space in the block to add an ITL.
Possible Causes (B)
Solutions (B)
SELECT OWNER, OBJECT_NAME, OBJECT_TYPE
FROM V$SEGMENT_STATISTICS
WHERE STATISTIC_NAME = 'ITL waits' AND VALUE > 0
ORDER BY VALUE;
Increase inittrans value of the segments, which are seeing high ITL waits.
Symptom (C)
Background (C)
A session waits on enq: TX - row lock contention, when it waits for a row level lock that is held by another session. This happens when one user has updated or deleted a row, but has not committed or rolled back yet, and another session wants to update or delete the same row.
Solution (C)
Document 102925.1 - Tracing sessions: waiting on an enqueue
Document 179582.1 - How to Find TX Enqueue Contention in RAC or OPS
Document 1020008.6 - SCRIPT: FULLY DECODED LOCKING
Document 62354.1 - TX Transaction locks - Example wait scenarios
Document 224305.1 -Autonomous Transaction can cause locking
Issue #5: High CPU and memory consumption
A. High CPU Utilization
B. High Memory Utilization
Symptoms (A)
processes are either oracle shadow or background processes.
II. AWR reports show top waits are one or more of the following:
latch free
cursor pin S wait on X or cursor pin S wait or library cache mutex X
latch: cache buffers chains
resmgr: cpu quantum
enq: RO - fast object reuse
DFS lock handle
III. AWR reports (SQLs ordered by buffer gets) show SQLs with very high buffer gets per execution
and cpu time.
IV. Process stack of the high cpu consuming process shows that the process is spinning.
Possible Causes (A)
Bug 12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU
Bug 8199533 - NUMA enabled by default can cause high CPU
Bug 9226905 - Excessive CPU usage / OERI:kkfdPaPrm from Parallel Query / High Version
count on PX_MISMATCH
Bug 7385253 - Slow Truncate / DBWR uses high CPU / CKPT blocks on RO enqueue
Bug 10326338 - High "resmgr:cpu quantum" but CPU has idle time
Bug 6455161 - Higher CPU / Higher "cache buffer chains" latch gets / Higher "consistent gets"
after truncate/Rebuild
II. Very large SGA on Linux x86-64 platform without the implementation of Hugepages.
III. Expensive SQL queries with sub-optimal execution plans.
IV. Runaway processes.
Solutions (A)
II. Implement hugepages. Please refer to Document 361670.1 - Slow Performance with High CPU Usage on 64-bit Linux with Large SGA for further explanation.
III. Tune the SQLs, which are incurring excessive buffer gets and cpu time. Please refer to Document 404991.1 - How to Tune Queries with High CPU Usage for more info.
Symptoms (B)
#pmap -x 26677
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 496 480 - - r-x-- bash
0009A000 80 80 24 - rwx-- bash
000AE000 160 160 40 - rwx-- [ heap ]
FF100000 688 688 - - r-x-- libc.so.1
II. Session uga and/or pga memory of a oracle shadow process is growing.
select se.sid,n.name,max(se.value) maxmem
from v$sesstat se,
v$statname n
where n.statistic# = se.statistic#
and n.name in ('session pga memory','session pga memory max',
'session uga memory','session uga memory max')
group by n.name,se.sid
order by 3;
III. Number of open cursors for a session/process is growing monotonically.
Possible Causes (B)
Bug 9919654 - High resource / memory use optimizing SQL with UNION/set functions with
many branches
Bug 10042937 HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS
Bug 7429070 BIG PGA MEM ALLOC DURING PARSE TIME - KXS-HEAP-C
Bug 8031768 - ORA-04031 SHARED POOL "KKJ JOBQ WOR"
Bug 10220046 - OCI memory leak using non-blocking connection for fetches
Bug 6356566 - Memory leak / high CPU selecting from V$SQL_PLAN
Bug 7199645 - Long parse time / high memory use from query rewrite
Bug 10636231 - High version count for INSERT .. RETURNING statements with
reason INST_DRTLD_MISMATCH
Bug 9412660 - PLSQL cursor leak / ORA-600[kglLockOwnersListDelete]
Bug 6051972 - Memory leak if connection to database is frequently opened and closed
Bug 4690571 - Memory leak using BULK COLLECT within cursor opened using native
dynamic sql
II. Cursor leaks caused by application not closing cursors explicitly.
III. SQLs with abnormally large hash join and/or sort operation.
Possible Solutions (B)
II. Ensure cursors are explicitly closed by application.
III. Avoid very large hash join and/or sort operation.
REFERENCES
NOTE:1064487.1 - Script to Collect Log File Sync Diagnostic Information (lfsdiag.sql)NOTE:1356828.1 - FAQ: 'cursor: mutex ..' / 'cursor: pin ..' / 'library cache: mutex ..' Type Wait Events
NOTE:1357946.1 - Troubleshooting 'library cache: mutex X' Waits.
NOTE:1310764.1 - WAITEVENT: "cursor: pin S" Reference Note
NOTE:1349387.1 - Troubleshooting 'cursor: pin S wait on X' waits.
BUG:9591812 - INCORRECT WAIT EVENTS IN 11.2
BUG:7429070 - BIG PGA MEM ALLOC DURING PARSE TIME - KXS-HEAP-C
BUG:8031768 - ORA-04031 SHARED POOL "KKJ JOBQ WOR"
NOTE:873243.1 - Troubleshooting 'enq: TX - index contention' Waits
BUG:10042937 - HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS
NOTE:404991.1 - How to Tune Queries with High CPU Usage
NOTE:563566.1 - Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
NOTE:34592.1 - WAITEVENT: "log file sync" Reference Note
BUG:4690571 - MEMORY LEAK USING BULK COLLECT WITHIN CURSOR OPENED USING NATIVE DYNAMIC SQL
NOTE:164768.1 - Troubleshooting: High CPU Utilization
BUG:7199645 - QUERY_REWRITE_ENABLED TAKE LONG PARSE TIME AND CONSUME ALL SERVER MEMORY
BUG:7385253 - DBWR IS CONSUMING HIGH CPU
NOTE:361670.1 - Slow Performance with High CPU Usage on 64-bit Linux with Large SGA
NOTE:1511700.1 - ORA-00060 Single-Resource Deadlock Occurs on Autonomous Transaction
BUG:6051972 - MEMORY LEAK IF CONNECTION TO DATABASE SERVER IS FREQUENTLY OPENED AND CLOSED
BUG:6356566 - STATSPACK.SNAP COSUMES ALMOST 100% CPU
NOTE:1298471.1 - H$PSEUDO_CURSOR
NOTE:786507.1 - How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'
NOTE:62143.1 - Troubleshooting: Understanding and Tuning the Shared Pool
NOTE:727400.1 - WAITEVENT: "library cache: mutex X"
BUG:10411618 - ADD DIFFERENT WAIT SCHEMES FOR MUTEX WAITS
NOTE:102925.1 - Tracing Sessions that are Waiting on an Enqueue or a Lock
BUG:10636231 - HIGH VERSION COUNT FOR INSERT STATEMENTS WITH REASON INST_DRTLD_MISMATCH
BUG:6455161 - HIGHER "CONSISTENT GETS" AFTER TRUNCATE
NOTE:62354.1 - Waits for 'Enq: TX - ...' Type Events - Transaction (TX) Lock Example Scenarios
NOTE:857576.1 - Alternative and Specialised Options as to How to Avoid Waiting for Redo Log Synchronization
NOTE:756671.1 - Master Note for Database Proactive Patch Program
BUG:10220046 - MEMORY LEAK OCCURS WITH NON-BLOCKING APP IN THE LATER VERSION 10.2.0.4
NOTE:1298015.1 - WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE:1020008.6 - SCRIPT: FULLY DECODED LOCKING
No comments:
Post a Comment