Tuesday, May 11, 2021

Top 5 Database and/or Instance Performance Issues in RAC Environment (Doc ID 1373500.1)

 In this Document

Purpose
Scope
Details
 Issue #1: High counts of lost blocks(gc lost blocks, gc current/cr lost blocks)
 Issue #2: High log file sync waits
 Issue #3: High Waits on Mutex
 Issue #4: High gc buffer busy, enq: TX -row lock contention, enq: TX - index contention, enq: TX - ITL allocate entry waits
 Issue #5: High CPU and memory consumption
References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 12.1.0.1 [Release 10.1 to 12.1]
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Gen 1 Exadata Cloud at Customer (Oracle Exadata Database Cloud Machine) - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.

PURPOSE

The purpose of this document is to provide a summary of  top database and/or instance performance issues and the possible solutions in a RAC environment.

Please note that Issue #3 (High Waits on Mutexes) and Issue #5 (High CPU and Memory Consumption) are generic database issues and not RAC-specific. However, they have been included in this document because they are one of the TOP issues, and can occur locally on an instance in a RAC environment.

SCOPE

DBAs

DETAILS

Issue #1: High counts of lost blocks(gc lost blocks, gc current/cr lost blocks)

Symptoms

I. AWR reports show high number of lost blocks.

II. netstat -s reports increasing number of packet reassembly failure, dropped packets.


Solutions

Use the following document to troubleshoot and resolve lost block issue. The document describes symptoms, possible causes and solutions.

Document 563566.1 - gc block lost diagnostics

 

Issue #2: High log file sync waits

Symptoms

I. AWR reports show that log file sync is one of the TOP 5 waits consistently.

II. Average log file sync is high ( > 20 ms).

III. Average log file parallel write is high ( > 10 ms).

Iv. Average redo write broadcast ack time or wait for scn ack is high ( > 10 ms).

V. Average log file sync is low, but there are too many log file sync waits.


Background

When a user session COMMITs or ROLLBACKs, the session's redo information needs to be flushed by LGWR to the redo logfile. The user session waits on 'log file sync' while waiting for LGWR to post it back to confirm that all redo changes are safely on disk.

Example:

WAIT #0: nam='log file sync' ela= 977744 buffer#=754 p2=0 p3=0 obj#=114927 tim=1212384670107387

Parameters:

  P1 = buffer#
  P2 = Not used
  P3 = Not used
  obj# = object_id

All changes up to this buffer# (in the redo log buffer) must be flushed to disk and the writes confirmed to ensure that the transaction is committed, and will remain committed upon an instance crash.

A typical life cycle of 'log file sync' wait

1. A user sessions issues a commit or rollback and starts waiting on log file sync.
2. LGWR gather redo information to be written to redo log files, issues IO and queues BOC to an LMS process and posts LMS process.
3. LGWR waits for redo buffers to be flushed to disk and SCN to be ACK'd
4. Completion of IO and receiving ACK from LMS signal write complete. LGWR then posts foreground process to continue.
5. Foreground process wakes up and log file sync wait ends.

Important log file sync related statistics and events

redo write time - Total elapsed time of the write from the redo log buffer to the current redo log file in 10s of milliseconds.
redo writes - Total number of writes by LGWR to the redo log files. "redo blocks written" divided by this statistic equals the number of blocks per write.
log file parallel write - Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.
redo write broadcast ack time - Total amount of the latency associated with broadcast on commit beyond the latency of the log write (in microseconds).
user commits - Number of user commits. When a user commits a transaction, the redo generated that reflects the changes made to database blocks must be written to disk. Commits often represent the closest thing to a user transaction rate.
user rollbacks - Number of times users manually issue the ROLLBACK statement or an error occurs during a user's transactions.
The script provided in Document 1064487.1 - Script to Collect Log File Sync Diagnostic Information (lfsdiag.sql)  collects useful information for troubleshooting log file sync issues.

Possible Causes

I. Poor IO service time and throughput of the devices where redo logs reside.

II. Oracle bugs. Please review WAITEVENT: "log file sync" Reference (Document 34592.1) for known oracle bugs.

III. LMS not running in RT class.

IV. Scheduling delays with LGWR process.

V. Excessive number of commits.

VI. OS resource starvation.


Solutions

I. Move redo log files to disks with higher throughput and better response time if log file parallel write is consistently high ( > 10 ms). log file parallel write should ideally be within 1-2 ms range

II. Apply fixes for the known oracle bugs, which are applicable to your environment. The most effective way to get those fixes in is to apply the latest PSU patches. Document 756671.1 has more information on the latest PSUs.

III. Ensure that LMS processes are running in RT class. LMS processes run in RT class by default.

IV. Reduce number of commits by using batch commits instead of committing after every DML operation.

V. See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options.

VI. See if COMMIT NOWAIT option can be used. Please refer to Document 857576.1 for more information.          
@For Support Only: Renice LGWR to run at higher priority or run LGWR in RT class by adding LGWR to the parameter: _high_priority_processes='VKTM|LMS|LGWR'. Consider doing this only if log file sync is high and scheduling delay of LGWR is found to be causing it. Be prepared to test it thoroughly. See Doc IDs 12951619.8 and 1523164.1 where LGWR is already added to to _high_priority_processes in 11.2.0.4 (with a backport available for 11.2.0.3.5) and so there will not be any need to adjust LGWR priority manually.

Issue #3: High Waits on Mutex

Mutexes are a lighter-weight and more granular concurrency mechanism than latches.The reason for obtaining a mutex is to ensure that certain operations are properly managed for concurrency. e.g., if one session is changing a data structure in memory, then other session must wait to acquire the mutex before it can make a similar change. The following are most common mutex related waits:

A. cursor: pin S wait on X
B. cursor: pin S
C. library cache: Mutex X

Symptoms (
A)

AWR reports show cursor: pin S wait on X as one of the top wait.


Background (A)


A session may wait for this event when it is trying to get a mutex pin in Share mode but another session is holding the mutex pin on the same cursor object in exclusive. Frequently, waits for 'Cursor: pin S wait on X' is a symptom and not the cause. There may be underlying tuning requirements or known issues.

Possible Causes (A)

Please review Troubleshooting 'cursor: pin S wait on X' waits (Document 1349387.1).


Solutions (A)

Please review Troubleshooting 'cursor: pin S wait on X' waits (Document 1349387.1).


Symptoms (B)

AWR reports show cursor: pin S as one of the top waits


Background (B)

A session waits for "cursor: pin S" when it wants a specific mutex in S (share) mode on a specific cursor and there is no concurrent X holder but it could not acquire that mutex immediately. This may seem a little strange as one might question why there should be any form of wait to get a mutex which has no session holding it in an incompatible mode. The reason for the wait is that in order to acquire the mutex in S mode (or release it) the session has to increment (or decrement) the mutex reference count and this requires an exclusive atomic update to the mutex structure itself. If there are concurrent sessions trying to make such an update to the mutex then only one session can actually increment (or decrement) the reference count at a time. A wait on "cursor: pin S" thus occurs if a session cannot make that atomic change immediately due to other concurrent requests.
Mutexes are local to the current instance in RAC environments. 

Parameters:
   P1 = idn
   P2 = value
   P3 = where (where|sleeps in 10.2)

 idn is the mutex identifier value which matches to the HASH_VALUE of the SQL statement that we are waiting to get the mutex on. The SQL can usually be found using the IDN value in a query of the form:

SELECT sql_id, sql_text, version_count
FROM V$SQLAREA where HASH_VALUE=&IDN;

If the SQL_TEXT shown for the cursor is of the form "table_x_x_x_x" then this is a special internal cursor - see Document 1298471.1 for information about mapping such cursors to objects.

P1RAW is the same value in hexadecimal and it can be used to search in trace files for SQL matching to that hash value.

Possible Causes (B)

I. Heavy concurrency for a specific mutex, especially on systems with multiple CPUs.

II. Waits for very many different "idn" values when under load.

III. Oracle Bugs
      Bug 10270888 - ORA-600[kgxEndExamine-Bad-State] / mutex waits after a self deadlock
      Bug 9591812 - Wrong wait events in 11.2 ("cursor: mutex S" instead of "cursor: mutex X")
      Bug 9499302 - Improve concurrent mutex request handling
      Bug 7441165 - Prevent preemption while holding a mutex (fix only works on Solaris)
      Bug 8575528 - Missing entries in V$MUTEX_SLEEP.location


Solutions (B)

I. Apply fix for Bug 10411618.

II. For any identified "hot" SQLs, one can reduce the concurrency on specific cursors by replacing the one SQL with some variants which are executed by different sessions. Please review WAITEVENT: cursor: pin S Reference (Document 1310764.1) for further details.

III. Apply fixes for other known oracle bugs. The most effective way to get the fixes in is to apply the latest PSU patches. Document 756671.1 has more information on recommended patches.


Symptoms (C)

AWR reports show library cache: Mutex X as one of the TOP waits.


Background (C)

The library cache mutex is acquired for similar purposes that the library cache latches were acquired in prior versions of Oracle. In 10g, mutexes were introduced for certain operations in the library cache. Starting with 11g, the library cache latches were replaced by mutexes. This wait event is present whenever a library cache mutex is held in exclusive mode by a session and other sessions need to wait for it to be released.

Individual Waits:
Parameters:
  P1 = "idn" = Unique Mutex Identifier
  P2 = "value" = Session ID holding the mutex
  P3 = "where" = location in code (internal identifier) where mutex is being waited for

Systemwide Waits:
At a systemwide level, there are two views which will help diagnose this wait:

GV$MUTEX_SLEEP (or V$MUTEX_SLEEPS for non-RAC)
and GV$MUTEX_SLEEP_HISTORY (or V$MUTEX_SLEEP_HISTORY for non-RAC)

These views track the instance wide usage of mutexes since instance startup. Since these views show values that are total values since startup, they are most meaningful when you obtain the difference in values during a short time interval when there was problem. The easiest way to see this information is through an AWR or statspack report in the "Mutex Sleep Summary" section.

Possible Causes (C)

I. Frequent Hard Parses.

II. High Version Counts.

III. Invalidations and reloads.

IV. Oracle Bugs. Please review WAITEVENT: "library cache: mutex X" (Document 727400.1) for the
      list of known oracle bugs.


Solutions (C)

I. Reduce hard parsing, invalidations and reloads. Please review Troubleshooting: Tuning the Shared Pool and Tuning Library Cache Latch Contention (Document 62143.1) for more information.

II. Review  Troubleshooting: High Version Count Issues (Document 296377.1) to   resolve high version count issue.

III. Apply fixes for the known oracle bugs. The most effective way to get the fixes in is to apply the latest PSU patches. Document 756671.1 has more information on recommended patches.

Issue #4: High gc buffer busy, enq: TX -row lock contention, enq: TX - index contention, enq: TX - ITL allocate entry waits

A. enq: TX - Index Contention
B. enq: TX - ITL allocate entry waits
C. enq: TX - row lock contention

Symptoms
 (A)

I. AWR reports show high wait on enq: TX - index contention and enq: TX - row lock contention in mode 4.

II. AWR reports show high values for branch node splits, leaf node splits and leaf node 90-10 splits

III. AWR reports (Segments by Row Lock Waits) show high row lock waits for a particular segment

IV. AWR reports show other waits such as gc buffer busy waits on index branch or Leaf blocks, gc current block busy on Remote Undo Headers and gc current split.

Example:

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: TX - index contention 29,870 1,238 41 9.52 Concurrency

Instance Activity Stats:

Statistic Total per Second per Trans
branch node splits 945 0.26 0.00
leaf node 90-10 splits 1,670 0.46 0.00
leaf node splits 35,603 9.85 0.05

Segments by Row Lock Waits:

Owner Tablespace Object Name Obj.Type Row Lock Waits % of Capture
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_PK INDEX 3,425 43.62
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_ST INDEX 883 11.25
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_DT INDEX 682 8.69


Background (A)

When a transaction inserting a row in an index has to wait for the end of an index block split being done by another transaction the session would wait on event enq: TX - index contention.

Possible causes (A)

I. High number of concurrent inserts leading to excessive index block splits.

II. Right-growing index with key generated from a sequence.


Solutions (A)

Solution is to reorganize the index in a way to avoid the contention or hot spots. The options are:

I. Global Hash partition the index
II. Recreate the index as reverse key index (not suitable for large table, could require buffer cache increased accordingly)
III. If index key is generated from a sequence, increase cache size of the sequence and make the sequence 'no order' if application supports it.
IV.  Beginning in 18c, "Scalable sequences significantly reduce the sequence and index block contention and provide better data load scalability, compared to the solution of configuring a very large sequence cache using the CACHE clause of ... ALTER SEQUENCE."
Reference: https://docs.oracle.com/en/database/oracle/oracle-database/18/newft/new-features.html#GUID-CB2428B8-A5BA-4B13-B437-ECB5F0C2C84E


Symptom (B)

AWR reports show high wait on enq: TX - allocate ITL entry and enq: TX - row lock contention in mode 4.


Background (B)

A session waits on enq: TX - allocate ITL entry when it wants to lock a row in the block but one or more other sessions have rows locked in the same block, and there is no free ITL slot in the block. Usually, Oracle Database dynamically adds another ITL slot. This may not be possible if there is insufficient free space in the block to add an ITL.

Possible Causes (B)

Low inittrans setting relative to the number of concurrent transactions.


Solutions (B)

Find the segments those have high ITL waits from AWR reports or using the following SQL:

SELECT OWNER, OBJECT_NAME, OBJECT_TYPE
  FROM V$SEGMENT_STATISTICS
 WHERE STATISTIC_NAME = 'ITL waits' AND VALUE > 0
ORDER BY VALUE;

Increase inittrans value of the segments, which are seeing high ITL waits.


Symptom (C)

AWR reports show high wait on enq: TX - row lock contention in Exclusive mode (6).


Background (C)

A session waits on enq: TX - row lock contention, when it waits for a row level lock that is held by another session. This happens when one user has updated or deleted a row, but has not committed or rolled back yet, and another session wants to update or delete the same row.

Solution (C)

This is an application issue and application developer needs to be engaged to look into the SQL statements involved. The following documents might be helpful in drilling down further:

Document 102925.1 - Tracing sessions: waiting on an enqueue
Document 179582.1 - How to Find TX Enqueue Contention in RAC or OPS
Document 1020008.6 - SCRIPT: FULLY DECODED LOCKING
Document 62354.1 - TX Transaction locks - Example wait scenarios
Document 224305.1 -Autonomous Transaction can cause locking

Issue #5: High CPU and memory consumption

A. High CPU Utilization
B. High Memory Utilization

Symptoms (A)

I. OS tools such as TOP, prstat, vmstat shows user CPU usage is very high, and top cpu consuming
   processes are either oracle shadow or background processes.

II. AWR reports show top waits are one or more of the following:
    latch free
    cursor pin S wait on X or cursor pin S wait or library cache mutex X
    latch: cache buffers chains
    resmgr: cpu quantum
    enq: RO - fast object reuse
    DFS lock handle

III. AWR reports (SQLs ordered by buffer gets) show SQLs with very high buffer gets per execution
      and cpu time.

IV. Process stack of the high cpu consuming process shows that the process is spinning.


Possible Causes
 (A)

I. Oracle bugs:
  Bug 12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU
  Bug 8199533 - NUMA enabled by default can cause high CPU
  Bug 9226905 - Excessive CPU usage / OERI:kkfdPaPrm from Parallel Query / High Version
  count on PX_MISMATCH
  Bug 7385253 - Slow Truncate / DBWR uses high CPU / CKPT blocks on RO enqueue
  Bug 10326338 - High "resmgr:cpu quantum" but CPU has idle time
  Bug 6455161 - Higher CPU / Higher "cache buffer chains" latch gets / Higher "consistent gets"
  after truncate/Rebuild

II. Very large SGA on Linux x86-64 platform without the implementation of Hugepages.

III. Expensive SQL queries with sub-optimal execution plans.

IV. Runaway processes.


Solutions (A)

I. Apply fix for the bug(s) that you are encountering. Most effective way to get all those fixes in is to apply the latest PSU patches. Document 756671.1 has more information on the latest PSUs.

II. Implement hugepages. Please refer to Document 361670.1 -  Slow Performance with High CPU Usage on 64-bit Linux with Large SGA for further explanation.

III. Tune the SQLs, which are incurring excessive buffer gets and cpu time. Please refer to Document 404991.1 - How to Tune Queries with High CPU Usage for more info.


Symptoms (B)

I. OS tools such as ps, pmap show process is growing in memory. pmap shows the growth is in heap and/or stack area of the process. For example,

#pmap -x 26677

Address        Kbytes   RSS   Anon   Locked Mode   Mapped File
00010000    496        480     -             - r-x--             bash
0009A000   80          80       24          - rwx--             bash
000AE000  160        160     40           - rwx--             [ heap ]
FF100000   688        688     -             - r-x--              libc.so.1

II. Session uga and/or pga memory of a oracle shadow process is growing.
       
        select se.sid,n.name,max(se.value) maxmem
           from v$sesstat se,
                    v$statname n
        where n.statistic# = se.statistic#
            and n.name in ('session pga memory','session pga memory max',
                                    'session uga memory','session uga memory max')
         group by n.name,se.sid
         order by 3;

III. Number of open cursors for a session/process is growing monotonically.


Possible Causes (B)

I. Oracle bugs:
      Bug 9919654 - High resource / memory use optimizing SQL with UNION/set functions with  
      many branches
      Bug 10042937 HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS
      Bug 7429070 BIG PGA MEM ALLOC DURING PARSE TIME - KXS-HEAP-C
      Bug 8031768 - ORA-04031 SHARED POOL "KKJ JOBQ WOR"
      Bug 10220046 - OCI memory leak using non-blocking connection for fetches
      Bug 6356566 - Memory leak / high CPU selecting from V$SQL_PLAN
      Bug 7199645 - Long parse time / high memory use from query rewrite
      Bug 10636231 - High version count for INSERT .. RETURNING statements with
      reason INST_DRTLD_MISMATCH
      Bug 9412660 - PLSQL cursor leak / ORA-600[kglLockOwnersListDelete]
      Bug 6051972 - Memory leak if connection to database is frequently opened and closed
      Bug 4690571 - Memory leak using BULK COLLECT within cursor opened using native
      dynamic sql

II. Cursor leaks caused by application not closing cursors explicitly.

III.  SQLs with abnormally large hash join and/or sort operation.


Possible Solutions (B)

I. Apply fix for the applicable bug(s). The most effective way to get those fixes in is to apply the latest PSU patches. Document 756671.1 has more information on the latest PSUs.

II. Ensure cursors are explicitly closed by application.

III. Avoid very large hash join and/or sort operation.

REFERENCES

NOTE:1064487.1 - Script to Collect Log File Sync Diagnostic Information (lfsdiag.sql)
NOTE:1356828.1 - FAQ: 'cursor: mutex ..' / 'cursor: pin ..' / 'library cache: mutex ..' Type Wait Events
NOTE:1357946.1 - Troubleshooting 'library cache: mutex X' Waits.
NOTE:1310764.1 - WAITEVENT: "cursor: pin S" Reference Note
NOTE:1349387.1 - Troubleshooting 'cursor: pin S wait on X' waits.




BUG:9591812 - INCORRECT WAIT EVENTS IN 11.2
BUG:7429070 - BIG PGA MEM ALLOC DURING PARSE TIME - KXS-HEAP-C

BUG:8031768 - ORA-04031 SHARED POOL "KKJ JOBQ WOR"


NOTE:873243.1 - Troubleshooting 'enq: TX - index contention' Waits
BUG:10042937 - HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS
NOTE:404991.1 - How to Tune Queries with High CPU Usage
NOTE:563566.1 - Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
NOTE:34592.1 - WAITEVENT: "log file sync" Reference Note
BUG:4690571 - MEMORY LEAK USING BULK COLLECT WITHIN CURSOR OPENED USING NATIVE DYNAMIC SQL
NOTE:164768.1 - Troubleshooting: High CPU Utilization
BUG:7199645 - QUERY_REWRITE_ENABLED TAKE LONG PARSE TIME AND CONSUME ALL SERVER MEMORY
BUG:7385253 - DBWR IS CONSUMING HIGH CPU
NOTE:361670.1 - Slow Performance with High CPU Usage on 64-bit Linux with Large SGA

NOTE:1511700.1 - ORA-00060 Single-Resource Deadlock Occurs on Autonomous Transaction
BUG:6051972 - MEMORY LEAK IF CONNECTION TO DATABASE SERVER IS FREQUENTLY OPENED AND CLOSED
BUG:6356566 - STATSPACK.SNAP COSUMES ALMOST 100% CPU
NOTE:1298471.1 - H$PSEUDO_CURSOR


NOTE:786507.1 - How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'
NOTE:62143.1 - Troubleshooting: Understanding and Tuning the Shared Pool
NOTE:727400.1 - WAITEVENT: "library cache: mutex X"
BUG:10411618 - ADD DIFFERENT WAIT SCHEMES FOR MUTEX WAITS
NOTE:102925.1 - Tracing Sessions that are Waiting on an Enqueue or a Lock
BUG:10636231 - HIGH VERSION COUNT FOR INSERT STATEMENTS WITH REASON INST_DRTLD_MISMATCH
BUG:6455161 - HIGHER "CONSISTENT GETS" AFTER TRUNCATE
NOTE:62354.1 - Waits for 'Enq: TX - ...' Type Events - Transaction (TX) Lock Example Scenarios
NOTE:857576.1 - Alternative and Specialised Options as to How to Avoid Waiting for Redo Log Synchronization
NOTE:756671.1 - Master Note for Database Proactive Patch Program
BUG:10220046 - MEMORY LEAK OCCURS WITH NON-BLOCKING APP IN THE LATER VERSION 10.2.0.4

NOTE:1298015.1 - WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE:1020008.6 - SCRIPT: FULLY DECODED LOCKING

No comments:

Post a Comment

Database Options/Management Packs Usage Reporting for Oracle Databases 11.2 and later (Doc ID 1317265.1)

  Database Options/Management Packs Usage Report You can determine whether an option is currently in use in a database by running options_pa...