Monday, December 2, 2019

Oracle Exadata Database Machine Setup/Configuration Best Practices (Doc ID 1274318.1)

Skip to content
Copyright (c) 2019, Oracle. All rights reserved. Oracle Confidential.

APPLIES TO:

Oracle Database Backup Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Exadata Storage Server Software - Version 11.2.1.2.0 to 12.2.1.1.2 [Release 11.2 to 12.2]
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Cloud Schema Service - Version N/A and later
Information in this document applies to any platform.

PURPOSE

The goal of this document is to present the best practices for the deployment of  Sun Oracle Database Machine  V2/X2-2/X2-8/X3-2/X3-8/X4-2/X4-8/X5-2  in the area of Setup and Configuration.

SCOPE

General audience working on Oracle Exadata X2-2/X2-8/X3-2/X3-8/X4-2/X4-8/X5-2/X6-2/X6-8/X7-2/X-8

DETAILS

Primary and standby databases should NOT reside on the same IB Fabric
Use hostname and domain name in lower case
Verify ILOM Power Up Configuration
Verify Hardware and Firmware on Database and Storage Servers
Verify InfiniBand Cable Connection Quality
Verify Ethernet Cable Connection Quality
Verify InfiniBand Fabric Topology (verify-topology)
Verify key InfiniBand fabric error counters are not present
Verify InfiniBand switch software version is 1.3.3-2 or higher
Verify InfiniBand subnet manager is running on an InfiniBand switch
Disable Infiniband subnet manager service where subnet manager master should never run
Verify key parameters in the InfiniBand switch /etc/opensm/opensm.conf file
Verify There Are No Memory (ECC) Errors
Verify celldisk configuration on disk drives
Verify celldisk configuration on flash memory devices
Verify there are no griddisks configured on flash memory devices
Verify griddisk count matches across all storage servers where a given prefix name exists
Verify griddisk ASM status
Verify that griddisks are distributed as expected across celldisks
Verify the percent of available celldisk space used by the griddisks
Verify Database Server ZFS RAID Configuration
Verify InfiniBand is the Private Network for Oracle Clusterware Communication
Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers
Verify Oracle RAC Databases use RDS Protocol over InfiniBand Network.
Verify Database and ASM instances use same SPFILE
Verify Berkeley Database location for Cloned GI homes
Configure Storage Server alerts to be sent via email
Configure NTP and Timezone on the InfiniBand switches
Configure NTP slew_always settings as SMF property for Solaris
Verify NUMA Configuration
Enable Xeon Turbo Boost
Verify Exadata Smart Flash Log is Created
Verify Exadata Smart Flash Cache is Created
Verify Exadata Smart Flash Cache status is "normal"
Verify Master (Rack) Serial Number is Set
Verify Management Network Interface (eth0) is on a Separate Subnet
Verify RAID disk controller CacheVault capacitor condition
Verify RAID Disk Controller Battery Condition
Verify Ambient Air Temperature
Verify operating system hugepages count satisfies total SGA requirements
Verify MaxStartups 100 in /etc/ssh/sshd_config on all database servers
Verify all datafiles have AUTOEXTEND attribute ON
Verify all BIGFILE tablespaces have non-default MAXBYTES values set
Ensure Temporary Tablespace is correctly defined
Enable portmap service if app requires it
Enable proper services on database nodes to use NFS
Be Careful when Combining the InfiniBand Network across Clusters and Database Machines
Set fast_start_mttr_target=300 to optimize run time performance of writes
Enable auditd on database servers
Verify AUD$ and FGA_LOG$ tables use Automatic Segment Space Management
Use dbca templates provided for current best practices
Updating database node OEL packages to match the cell
Disable cell level flash caching for grid disks that don't need it when using Write Back Flash Cache
Gather system statistics in Exadata mode if needed
Verify Hidden Database Initialization Parameter Usage
Verify BDB location for Cloned GI homes
Verify Shared Servers do not perform serial full table scans
Verify Write Back Flash Cache minimum version requirements
Verify bundle patch version installed matches bundle patch version registered in database
Verify database server file systems have "Maximum mount count" = "-1"
Verify database server file system have "Check interval" = "0"
Verify Automated Service Request (ASR) configuration
Verify ZFS File System User and Group Quotas are configured

Verify the file /.updfrm_exact does not exist
Verify the vm.min_free_kbytes configurationValidate key sysctl.conf parameters on database servers
Remove "fix_control=32" from dbfs mount options
Set Linux kernel log buffer size to 1MB
Verify IP routing configuration on DB nodes
Set SQLNET.EXPIRE_TIME=10 in DB Home
Verify there are no .fuse_hidden files under the dbfs mount
Verify that the SDP over IB option "sdp_apm_enable(d)" is set to "0"
Verify /etc/oratab
Verify consistent software and configuration across nodes
Verify all database and storage servers time server configuration
Verify Sar files have read permissions for non-root user
Verify that the patch for bug 16618055 is applied
Verify the Name Service Cache Daemon (NSCD) is Running
Verify kernels and initrd in /boot/grub/grub.conf are available on the system
Verify basic Logical Volume(LVM) system devices configuration
Ensure db_unique_name is unique across the enterprise
Verify average ping times to DNS nameserver
Verify Running-config and Startup-config are the same on the Cisco switch
Validate SSH is installed and configured on Cisco management switch
Verify Database Memory Allocation is not Greater than Physical Memory Installed on Database node
Verify Cluster Verification Utility(CVU) Output Directory Contents Consume < 500MB of Disk Space
Verify active system values match those defined in configuration file "cell.conf"
Verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED"
Verify TCP Segmentation Offload (TSO) is set to off
Check alerthistory for stateful alerts not cleared
Check alerthistory for non-test open stateless alerts
Verify clusterware state is "Normal"
Verify the grid Infrastructure management database (MGMTDB) does not use hugepages
Verify the "localhost" alias is pingable
Verify bundle patch version installed matches bundle patch version registered in database

Verify database is not in DST upgrade state
Verify there are no failed diskgroup rebalance operations
Verify the CRS_HOME is properly locked
Verify storage server data (non-system) disks have no partitions
Verify db_unique_name is used in I/O Resource Management (IORM) interdatabase plans
Verify Datafiles are Placed on Diskgroups consisting of griddisks with cachingPolicy = DEFAULT
Verify all datafiles are placed on griddisks that are cached on flash disks
Validate key sysctl.conf parameters on database servers
Detect duplicate files in /etc/*init* directories
Verify Database Server Quorum Disks configuration
Verify Oracle Clusterware files are placed appropriately
Verify "_reconnect_to_cell_attempts=9" on database servers which access X6 storage servers
Verify passwordless SSH connectivity for Enterpise Manager (EM) agent owner userid to target component userids
Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files
Verify active kernel version matches expected version for installed Exadata Image
Verify Storage Server user "CELLDIAG" exists
Verify installed rpm(s) kernel type match the active kernel version
Verify Flex ASM Cardinality is set to "ALL"
Verify "downdelay" is set correctly for bonded client interfaces
Verify ExaWatcher is executing
Verify non-Default services are created for all Pluggable Databases
Verify Grid Infrastructure Management Database (MGMTDB) configuration
Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database files
Verify the ownership and permissions of the "oradism" file
Verify the SYSTEM, SYSAUX, USERS and TEMP tablespaces are of type bigfile
Verify the storage servers in use configuration matches across the cluster
Verify "asm_power_limit" is greater than zero
Verify the recommended patches for Adaptive features are installed
Verify initialization parameter cluster_database_instances is at the default value
Verify the database server NVME device configuration
Verify that Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size
Evaluate Automated Maintenance Tasks configuration
Verify proper ACFS drivers are installed for Spectre v2 mitigation
Verify Exafusion Memory Lock Configuration
Verify there are no unhealthy InfiniBand switch sensors
Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric
Verify the ib_sdp module is not loaded into the kernel
Verify all voting disks are online
Verify available ksplice fixes are installed
Archived Best Practices
Revision History

Primary and standby databases should NOT reside on the same IB Fabric

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8,X4-2Linux11.2.x +11.2.x +

Benefit / Impact:
To properly protect the primary databases residing on the "primary" Exadata Database Machine, the physical standby database requires fault isolation from IB switch maintenance issues,
IB switch failures and software issues, RDS bugs and timeouts or any issue resulting from a complete IB fabric failure. To protect the standby from these failures that impact the primary's
availability, we highly recommend that at least one viable standby database resides on a separate IB fabric.

Risk:
If the primary and standby resides on the same IB fabric, both primary and standby systems can be unavailable due a bug causing an IB fabric failure.
Action / Repair:
The primary and at least one viable standby database must not reside on the same inter-racked Exadata Database Machine. The communication between the primary and standby
Exadata Database Machines must use GigE or 10GigE. The trade-off is lower network bandwidth. The higher network bandwidth is desirable for standby database instantiation
(should only be done first time) but that requirement is eliminated for post-failover operations when flashback database is enabled.

Use hostname and domain name in lower case

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux, Solari11.2.x +11.2.x +

Benefit / Impact:
Using lowercase will avoid known deployment time issues.
Risk:
OneCommand deployment will fail in step 16 if this is not done. This will abort the installation with:
"ERROR: unable to locate file to check for string 'Configure Oracle Grid Infrastructure for a Cluster ... succeeded' #Step 16#"
Action / Repair:
As a best practice, user lower case for hostnames and domain names

Verify ILOM Power Up Configuration

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL11/11/12<Name>ProductionExadata, SSC14281920- exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8,X4-211.2.2.2.0+Solaris - 11
Linux x86-64 UEK5.8
exachk 2.2.2 
Benefit / Impact:
Verifying the ILOM power up configuration helps to ensure that a server (or more) are booted up after a power interruption as quickly as possible.
Risk:
Not verifying the ILOM power up configuration may result in unexpected server boot behavior after a power interruption.
Action / Repair:
To verify the ILOM power up configuration, as the root userid enter the following command on each database and storage server:
if [ -x /usr/bin/ipmitool ]
then
#Linux
ipmitool sunoem cli force "show /SP/policy" | grep -i power
else
#Solaris
/opt/ipmitool/bin/ipmitool sunoem cli force "show /SP/policy" | grep -i power
fi;
The output varies by Exadata software version and should be similar to:
Exadata software version 11.2.3.2.1 or higher:
HOST_AUTO_POWER_ON=disabled
HOST_LAST_POWER_STATE=enabled
Exadata software version 11.2.3.2.0 or lower:
HOST_AUTO_POWER_ON=enabled
HOST_LAST_POWER_STATE=disabled
If the output is not as expected, as the root userid use the ipmitool "set /SP/policy" command. For example:
# ipmitool sunoem cli force "set /SP/policy HOST_AUTO_POWER_ON=enabled"
Connected. Use ^D to exit.
-> set /SP/policy HOST_AUTO_POWER_ON=enabled
Set 'HOST_AUTO_POWER_ON' to 'enabled'
-> Session closed
Disconnected

 Verify Hardware and Firmware on Database and Storage Servers


PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
The Oracle Exadata Database Machine is tightly integrated, and verifying the hardware and firmware before the Oracle Exadata Database Machine is placed into or returned to
production status can avoid problems related to the hardware or firmware modifications.
The impact for these verification steps is minimal.
Risk:
If the hardware and firmware are not validated, inconsistencies between database and storage servers can lead to problems and outages.
Action / Repair:
To verify the hardware and firmware configuration for a database server, execute the following command as the "root" userid:
/opt/oracle.SupportTools/CheckHWnFWProfile 
The output will contain a line similar to the following:
[SUCCESS] The hardware and firmware profile matches one of the supported profile 
If any result other than "SUCCESS" is returned, investigate and correct the condition.
To verify the hardware and firmware configuration for a storage server, execute the following "cellcli" command as the "cellmonitor" userid:
CellCLI> alter cell validate configuration
The output will be similar to:
Cell <cell> successfully altered 
If any result other than "successfully altered" is returned, investigate and correct the condition.
NOTE: CheckHWnFWProfile is also executed at each boot of a database server.

NOTE: "alter cell validate configuration" is also executed once a day on a storage server by the MS process and the result is written into the storage server alert history.

Verify InfiniBand Cable Connection Quality

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8,X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
InfiniBand cables require proper connections for optimal efficiency. Verifying the InfiniBand cable connection quality helps to ensure that the InfiniBand network operates at optimal efficiency.
There is minimal impact to verify InfiniBand cable connection quality.
Risk:
InfiniBand cables that are not properly connected may negotiate to a lower speed, work intermittently, or fail.
Action / Repair:
Execute the following command on all database and storage servers:
for ib_cable in `ls /sys/class/net | grep ^ib`; do printf "$ib_cable: "; cat /sys/class/net/$ib_cable/carrier; done 
The output should look similar to:
ib0: 1 
ib1: 1 
If anything other than "1" is reported, investigate that cable connection

Linux

Execute the following command as the "root" userid on all database and storage servers:

for ib_cable in `ls /sys/class/net | grep ^ib`; do printf "$ib_cable: "; cat /sys/class/net/$ib_cable/carrier; done 
The output should look similar to:
ib0: 1 ib1: 1 
If anything other than "1" is reported, investigate that cable connection.

Solaris

Execute the following command as the "root" userid on all database servers:
dladm show-ib | grep -v LINK | sed -e 's/ */ /g' -e 's/ *//' | awk '{print $1":", $5}'| sort 
The output should be similar to:
ib0: up
ib1: up 
If anything other than "up" is reported, investigate that cable connection.
NOTE: Storage servers should report 2 connections. X2-2(4170) and X2-2 database servers should report 2 connections. X2-8 database servers should report 8 connections.

Verify Ethernet Cable Connection Quality

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8,X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
Ethernet cables require proper connections for optimal efficiency. Verifying the Ethernet cable connection quality helps to ensure that the Ethernet network operates at optimal efficiency.
There is minimal impact to verify Ethernet cable connection quality.
Risk:
Ethernet cables that are not properly connected may negotiate to a lower speed, work intermittently, or fail.
Action / Repair:
Execute the following command as the root userid on all database and storage servers:
for cable in `ls /sys/class/net | grep ^eth`; do printf "$cable: "; cat /sys/class/net/$cable/carrier; done 
The output should look similar to:
eth0: 1
eth1: cat: /sys/class/net/eth1/carrier: Invalid argument
eth2: cat: /sys/class/net/eth2/carrier: Invalid argument
eth3: cat: /sys/class/net/eth3/carrier: Invalid argument
eth4: 1
eth5: 1 
"Invalid argument" usually indicates the device has not been configured and is not in use. If a device reports "0", investigate that cable connection.
NOTE: Within machine types, the output of this command will vary by customer depending on how the customer chooses to configure the available ethernet cards.

Verify the InfiniBand Fabric Topology (verify-topology)

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical WARN 09/05/18 <Name> Production Exadata - Physical,
Exadata - Management Domain,
RA 
ALL 20144798 - exachk 
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A 
Benefit / Impact:
Verifying that the InfiniBand network is configured with the correct topology for an Oracle Exadata Database Machine helps to ensure that the InfiniBand network operates at maximum efficiency.
Risk:
An incorrect InfiniBand topology will cause the InfiniBand network to operate at degraded efficiency, intermittently, or fail to operate.
Action / Repair:
To verify the InfiniBand Fabric Topology, execute the following code set as the "root" userid on one database server in the Exadata environment:
unset VT_ERRORS
unset VT_WARNINGS
VT_OUTPUT=$(/opt/oracle.SupportTools/ibdiagtools/verify-topology)
VT_WARNINGS=$(echo "$VT_OUTPUT" | egrep WARNING)
VT_ERRORS=$(echo "$VT_OUTPUT" | egrep ERROR)
if [ -n "$VT_ERRORS" ]
then
  echo -e "FAILURE: verify-topology returned one or more errors (and perhaps warnings).\nDETAILS:\n$VT_OUTPUT"
elif [ -n "$VT_WARNINGS" ]
then
  echo -e "WARNING: verify-topology returned one or more warnings.\nDETAILS:\n$VT_OUTPUT"
else
  echo -e "SUCCESS: verify-topology returned no errors or warnings."
fi
The expected output is:
SUCCESS: verify-topology returned no errors or warnings.
An example of a "FAILURE:" message:
FAILURE: verify-topology returned one or more errors (and perhaps warnings).
DETAILS:
   [ DB Machine Infiniband Cabling Topology Verification Tool. ]
Every node is connected to two leaf switches in a single rack.......................................................[FAILED]
[ERROR] 
Node randomcel06 (Guid: 21280001f00464 ) is connected to just one leaf switch randomsw-ib2(Guid: 2128f57723a0a0 )
Error found in following rack
<output truncated>
An example of a "WARNING:" message:
WARNING: verify-topology returned one or more warnings.
DETAILS:

   [ DB Machine Infiniband Cabling Topology Verification Tool ]
        [Version IBD VER 2.b ]

[WARNING] - Non-Exadata nodes detected! Please ensure this is OK
Approximating classification into cells and db hosts

Software UPGRADE required for the tool to be accurate

Looking at 1 rack(s).....
<output truncated>
If anything other than "SUCCESS:" is reported, investigate and correct the underlying fault(s).

Verify key InfiniBand fabric error counters are not present

PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalWARN09/28/16<Name>ProductionExadata-Management Domain, Exadata-Physical, SSC, Exalogic 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X6-2, X6-811.2.x+Linux x86-64
Solaris - 11
exachk 12.1.0.2.8 

Benefit / Impact:

Verifying key InfiniBand fabric error counters are not present helps to maintain the InfiniBand fabric at peak efficiency.
The impact of verifying key InfiniBand fabric error counters are not present is minimal. The impact of correcting key InfiniBand fabric error counters varies depending upon the root cause of the specific error counter present, and cannot be estimated here.

Risk:
If key InfiniBand fabric error counters are present, the fabric may be running in degraded condition or lack redundancy.
NOTE: Uncorrected symbol errors increase the risk of node evictions and application outages.

Action / Repair:
To verify key InfiniBand fabric error counters are not present, execute the following command set as the "root" userid on one database server:
NOTE: This will not work in the user domain of a virtualized environment.
if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]
then
     echo -e "\nThis check will not run in a user domain of a virtualized environment. Execute this check in the management domain.\n"
else
    RAW_DATA=$(ibqueryerrors | egrep 'SymbolError|LinkDowned|RcvErrors|RcvRemotePhys|LinkIntegrityErrors');
    CRITICAL_DATA=$(echo "$RAW_DATA" | egrep 'SymbolError|RcvErrors');
    WARNING_DATA=$(echo "$RAW_DATA" | egrep -v 'SymbolError|RcvErrors');
    if [ -z "$RAW_DATA" ]
    then
          echo -e "SUCCESS: Key InfiniBand fabric error counters were not found"
    else
    if [ 'echo "$RAW_DATA" | egrep 'SymbolError|RcvErrors' | wc -l' -gt 0 ]
    then
        echo -e "FAILURE: receive errors or symbol errors or both were found:\n\nCounters found:\n"
        echo -e "CRITICAL DATA:\n\n$CRITICAL_DATA\n\n\nWARNING DATA:\n\n$WARNING_DATA"
    else
       echo -e "WARNING: Key InfiniBand fabric error counters were found\n\nCounters Found:\n";
       echo -e "CRITICAL DATA:\n\n$CRITICAL_DATA\n\n\nWARNING DATA:\n\n$WARNING_DATA"
   fi;
  fi;
fi;

The expected output should be:
SUCCESS: Key InfiniBand fabric error counters were not found

- OR -

This check will not run in a user domain of a virtualized environment. Execute this check in the management domain.

Example of a FAILURE result:
FAILURE: receive errors or symbol errors or both were found:
Counters found:
CRITICAL DATA:
GUID 0x10e00001451161 port 1: [SymbolErrorCounter == 1367] [PortRcvErrors == 1367]
GUID 0x10e08027b8a0a0 port ALL: [SymbolErrorCounter == 54679] [LinkErrorRecoveryCounter == 76]
<output truncated>
WARNING DATA:
GUID 0x21280001fca219 port 1: [LinkDownedCounter == 1]
GUID 0x21280001fca21a port 2: [LinkDownedCounter == 1]
<output truncated>
Example of a WARNING result:
WARNING: Key InfiniBand fabric error counters were found

CRITICAL DATA:
WARNING DATA:
GUID 0x10e00001886289 port 1: [LinkDownedCounter == 1] [PortXmitDiscards == 272] [PortXmitWait == 2021116]
GUID 0x10e0802617a0a0 port ALL: [LinkErrorRecoveryCounter == 63]
GUID 0x10e0802617a0a0 port 1: [LinkErrorRecoveryCounter == 10]
GUID 0x10e0802617a0a0 port 2: [LinkErrorRecoveryCounter == 11]
<output truncated>
In general, if the output is not "SUCCESS...", follow the diagnostic guidance in the following documents:
Special Notes on Symbol errors:

Symbol errors create a much higher risk of node evictions if the error rate is too high. On the InfiniBand switches, there is a mechanism that will automatically down a port if the error rate becomes too high. On the database and storage servers, there is no such mechanism at this time, so it is recommended to examine the Symbol error rate manually, using ExaWatcher data.

NOTE: In the following example, all data pertaining to InfiniBand switches has been filtered out for brevity.

As the "root" userid, the following example demonstrates how to examine the Symbol error rate using ExaWatcher.

1) From the manual output, make note of the GUIDs with SymbolErrorCounter present:

FAILURE: receive errors or symbol errors or both were found:
  Counters found:   
<output truncated>   
GUID 0x10e00001451161 port 1: [SymbolErrorCounter == 1123] [PortRcvErrors == 1123] [PortXmitWait == 230121020]   
<output turncated>

2) Use the following command to identify the server with the symbol errors present:

[root@randomadm01 ~]# ibqueryerrors -G 0x10e00001451161 | head -1 Errors for "randomadm02 S 192.168.8.16,192.168.8.17 HCA-4"

3) Log onto the database server identified in the command above, randomadm02.

4) Change to the ExaWatcher directory for IB hca information (the default is in use here):

# cd /opt/oracle.ExaWatcher/archive/IBCardInfo.ExaWatcher

5) Using the port identification provided in 1), use the following output to condense (removes "0" entries) all relevant available ExaWatcher data:

[root@randomadm02 IBCardInfo.ExaWatcher]# cat <(bzcat *.bz2) <(cat *.dat) | egrep "port 1" -A23 | egrep SymbolError | grep -v '0[[:blank:][:cntrl:]]*$' | sort -k1.2,10 -k2.1,8
<output truncated>            
[09/13/2016 02:38:18] SymbolErrorCounter           999                    1
[09/13/2016 02:43:20] SymbolErrorCounter           1030                   31             
 <output truncated>           
[09/13/2016 17:28:56] SymbolErrorCounter           1062                   1
[09/13/2016 17:39:00] SymbolErrorCounter           1085                   23
[09/13/2016 17:59:10] SymbolErrorCounter           1100                   5
<output truncated>

6) Calculate the symbol error rate per minute. By default, ExaWatcher data intervals are 5 minutes, but that can be changed. Using these two lines:

[09/13/2016 17:28:56] SymbolErrorCounter           1062                   1               
[09/13/2016 17:39:00] SymbolErrorCounter           1085                   23 

The delta between 17:28 and 17:39 is "23". The time interval is 10 minutes, so 23 / 10 is 2.3 symbol errors per minute.

NOTE ESPECIALLY!! If the symbol error rate is consistently greater than 2 per minute, investigate for root cause and take corrective action!

NOTE: The InfiniBand fabric error counters should be cleared and validated after any maintenance activity.

NOTE: The InfiniBand fabric error counters are cumulative and the errors may have occurred at any time in the past. This check is the result at one point in time, and cannot advise anything about history or an error rate.

NOTE: This check should not be considered complete validation of the InfiniBand fabric. Even if this check indicates success, there may still be issues on the InfiniBand fabric caused by other, more rare Infiinband fabric error counters being present. If there are or appear to be issues with the InfiniBand fabric while this check passes, perform a full evaluation of the "ibqueryerrors" command output and the output of other commands such as "ibdiagnet".

NOTE: Depending upon the Exadata version, the key InfiniBand fabric error counters have different names. In the following list, the older version of the counter name is shown in square brackets.

Key Infiniband fabric error counters list:
SymbolErrorCounter [SymbolErrors]
LinkErrorRecoveryCounter [LinkRecovers]
LinkDownedCounter [LinkDowned]
PortRcvErrors [RcvErrors]
PortRcvRemotePhysicalErrors [RcvRemotePhysErrors]
LocalLinkIntegrityErrors [LinkIntegrityErrors]
NOTE: Some Infiinband fabric error counters (for example, "SymbolErrorCounter [SymbolErrors]","PortRcvErrors [RcvErrors]") can increment when nodes are rebooted. Small values for these Infiinband fabric error counters which are less than the "LinkDownedCounter [LinkDowned]" counters are generally not a problem. The "LinkDownedCounter [LinkDowned]" counters indicate the number of times the port has gone down (usually for valid reasons, such as a node reboot) and are not typically an error indicator by themselves.

NOTE: Links reporting high, persistent error rates (especially "SymbolErrorCounter [SymbolErrors]", "LinkErrorRecoveryCounter [LinkRecovers]", "PortRcvErrors [RcvErrors]", "LocalLinkIntegrityErrors [LinkIntegrityErrors]") often indicate a bad or loose cable or port issues.

 

Verify InfiniBand switch software version is 1.3.3-2 or higher

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical11/01/11X2-2(4170), X2-2, X2-8,X4-2Linux, [WIP:VW]Solaris11.2.x +11.2.x +

Benefit / Impact:
The Impact of verifying that the InfiniBand switch software is at version 1.3.3-2 or higher is minimal. The impact of upgrading the InfiniBand switch(s) to 1.3.3-2 varies depending upon the upgrade method
chosen and your current InfiniBand switch software level.
Risk:
InfiniBand switch software version 1.3.3-2 fixes several potential InfiniBand fabric stability issues. Remaining on an InfiniBand switch software version below 1.3.3-2 raises the risk of experiencing a potential outage.
Action / Repair:
To verify the InfiniBand switch software version, log onto the InfiniBand switch and execute the following command as the "root" userid:
version | head -1 | cut -d" " -f5
The output should be similar to:
1.3.3-2
If the output is not 1.3.3-2 or higher, upgrade the InfiniBand switch software to at least version 1.3.3-2.
NOTE: Patch 12373676 provides InfiniBand software version 1.3.3-2 and instructions.
NOTE: Upgrading to 1.3.3-2 may be performed as a rolling upgrade without an outage. The InfiniBand switch software is not dependent upon any other components in the Oracle Exadata Database Machine
and may be upgraded at any time.
NOTE: If your InfiniBand switch is at software version 1.0.1-1, it will need to first be upgraded to 1.1.3-1 or 1.1.3-2 before it can be upgraded to 1.3.3-2. The InfiniBand switch software cannot be upgraded
directly from 1.0.1-1 to 1.3.3-2.

Verify the Master Subnet Manager is running on an InfiniBand switch

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL11/28/18<Name>DevelopmentExadata - Physical,
Exadata - Managment Domain
ALL28862740 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.4.0N/A
Benefit / Impact:
Having the Master Subnet Manager reside in the correct location improves the stability, availability and performance of the InifiniBand fabric. The Impact of verifying the Master Subnet Manager is running on an InfiniBand switch is minimal. The impact of moving the Master Subnet Manager varies depending upon where it is currently executing and to where it will be relocated.
Risk:
If the Master Subnet Manager is not running on an InfiniBand switch, the InfiniBand fabric may crash during certain fabric management transitions.
Action / Repair:
To verify the Master Subnet Manager is located on an InfiniBand switch, execute the following command set as the "root" userid on a database server:
SUBNET_MGR_MSTR_OUTPUT=$(sminfo)
IBSWITCHES_OUTPUT=$(ibswitches)
SUBNET_MGR_MSTR_GID=$(echo "$SUBNET_MGR_MSTR_OUTPUT" | cut -d" " -f7 | cut -c3-16)
SUBNET_MGR_MSTR_LOC_RESULT=1
for IB_NODE_GID in $(echo "$IBSWITCHES_OUTPUT" | cut -c14-27)
do
  if [ $SUBNET_MGR_MSTR_GID = $IB_NODE_GID ]
  then
    SUBNET_MGR_MSTR_LOC_RESULT=0
    SUBNET_MGR_MSTR_LOC_SWITCH=$(echo "$IBSWITCHES_OUTPUT" | grep $IB_NODE_GID)
  fi
done
if [ $SUBNET_MGR_MSTR_LOC_RESULT -eq 0 ]
then
  echo -e "SUCCESS: the Master Subnet Manager is executing on InfiniBand switch:\n$(echo "$SUBNET_MGR_MSTR_LOC_SWITCH")"
else
  echo -e "FAILURE: the Master Subnet Manager does not appear to be executing on an InfiniBand switch:\n$(echo "$SUBNET_MGR_MSTR_OUTPUT")"
fi

The output should be similar to:

SUCCESS: the Master Subnet Manager is executing on InfiniBand switch:
Switch  : 0x002128469b03a0a9 ports 36 "SUN DCS 36P QDR randomsw-iba0 <IP>" enhanced port 0 lid 1 lmc 0

Example of a "FAILURE" result:

FAILURE: the Master Subnet Manager does not appear to be executing on an InfiniBand switch:
sminfo: sm lid 3 sm guid 0x10e0cdce81a0a9, activity count 3362634 priority 8 state 3 SMINFO_MASTER

If the result is "FAILURE", investigate the guid provided, relocate the Master Subnet Manager to a correct InfiniBand switch, and prevent the Subnet Manager from starting on the component where the Master Subnet Manager was found to be executing.

NOTES:
  1. The InfiniBand network can have more than one Subnet Manager, but only one Subnet Manager is active at a time. The active Subnet Manager is the Master Subnet Manager. The other Subnet Managers are the Standby Subnet Managers. If a Master Subnet Manager is shut down or fails, then a Standby Subnet Manager automatically becomes the Master Subnet Manager.
  2. There are typically several Standby Subnet Managers waiting to take over should the current Master Subnet Manager either fail or is manually moved to some other component with an available Standby Subnet Manager. Only run Subnet Managers on the InfiniBand switches specified for use in Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud, Oracle Big Data Appliance, and Oracle SuperCluster. Running Subnet Manager on any other device is not supported.
  3. For pure multirack Exadata deployments with less than 4 racks, the Subnet Manager should run on all spine and leaf InfiniBand switches. For deployments with 4 or more Exadata racks, the Subnet Manager should run only on spine InfiniBand switches. For additional configuration information, please see section "4.6.7 Understanding the Network Subnet Manager Master" of the "Exadata Database Machine Maintenance Guide".
  4. For InfiniBand fabric configurations that involve a mix of different Oracle Engineered Systems, please refer to: MOS note 1682501.1
  5. Moving the Master Subnet Manager is sometimes required during maintenance and patching operations. For additional guidance on maintaining the Master Subnet Manager, please see section "4.6 Maintaining the InfiniBand Network" of the "Exadata Database Machine Maintenance Guide".

 

  Verify the Subnet Manager is properly disabled

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL11/28/18<Name>DevelopmentExadata - Physical,
Exadata - Managment Domain
ALL28768896- exachk
14534296- exachk
16270663- exachk
16795289- exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.4.0N/A
Benefit / Impact:
NOTE: The Subnet Manager should only execute on InfiniBand switches. It should be disabled on any other component attached to an InfiniBand fabric.

Having the Subnet Manager executing in the correct locations improves the stability, availability and performance of the InifiniBand fabric. The Impact of verifying the Subnet Manager is disabled on components where the Master Subnet Manager should never reside is minimal. The impact of disabling the Subnet Manager varies depending upon the component type where it is found to be incorrectly executing, and whether or not the Master Subnet Manager is incorrectly executing on that component.

Risk:
Unexpected behavior, such as connectivity or performance loss, can occur if the Subnet Manager is executing on an unexpected component in the InfiniBand fabric.
Action / Repair:
To Verify the Subnet Manager is disabled on components where the Master Subnet Manager should never reside, execute the following command set as the "root" userid on all database and storage servers:
unset COMMAND_OUTPUT
COMMAND_OUTPUT=$(ps -ef | grep -i [o]pensm)
if [ -n "$COMMAND_OUTPUT" ]
then
  echo -e "FAILURE: the Subnet Manager is executing.\nDETAILS:\n$COMMAND_OUTPUT"
else
  echo -e "SUCCESS: the Subnet Manager is not executing."
fi

The expected output is:

SUCCESS: the Subnet Manager is not executing.

Example of a "FAILURE" output:

FAILURE: the Subnet Manager is executing.
DETAILS:
root      2627     1  0 Mar24 ?        12:14:31 /usr/sbin/opensm --daemon

If the result is "FAILURE", investigate why the Subnet Manager is executing, relocate the Master Subnet Manager if necessary, and prevent the Subnet Manager from starting in the future.

NOTES:
  1. The command set provided is for Oracle Exadata Database Machines only. If there are non-Exadata components residing on the InifiniBand fabric (e.g., a media server), refer to the provided documentation for that component.
  2. There are typically several Standby Subnet Managers waiting to take over should the current Master Subnet Manager either fail or is manually moved to some other component with an available Standby Subnet Manager. Only run Subnet Managers on the InfiniBand switches specified for use in Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud, Oracle Big Data Appliance, and Oracle SuperCluster. Running Subnet Manager on any other device is not supported.
  3. For pure multirack Exadata deployments with less than 4 racks, the Subnet Manager should run on all spine and leaf InfiniBand switches. For deployments with 4 or more Exadata racks, the Subnet Manager should run only on spine InfiniBand switches. For additional configuration information, please see section "4.6.7 Understanding the Network Subnet Manager Master" of the "Exadata Database Machine Maintenance Guide".
  4. For InfiniBand fabric configurations that involve a mix of different Oracle Engineered Systems, please refer to: MOS note 1682501.1
  5. Moving the Master Subnet Manager is sometimes required during maintenance and patching operations. For additional guidance on maintaining the Master Subnet Manager, please see section "4.6 Maintaining the InfiniBand Network" of the "Exadata Database Machine Maintenance Guide".

Verify There Are No Memory (ECC) Errors

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
11/16/16
      <Name>
Production
Exadata - Physical,
Exadata - Management Domain,
SSC
 
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X3-2, X3-8,
X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
11.2.2.2.0+
Solaris - 11
Linux x86-64
EXAchk 12.2.0.1.2
exachk 2.2.4
Benefit / Impact:
Memory modules that have corrected Memory Errors (ECC) can show degraded performance, IPMI driver timeouts, and BMC error messages in /var/log/messages file.
Correcting the condition restores optimal performance.
The impact of checking for memory ECC errors is slight. Correction will likely require node downtime for hardware diagnostics or repair.
Risk:
If not corrected, the faulty memory will lead to performance degradation and other errors.
Action / Repair:
To verify there are no memory (ECC) errors, run the following commands as the "root" userid on all database and storage servers:
if [ -x /usr/bin/ipmitool ]
then
  #Linux
  IPMI_COMMAND=ipmitool;
else
  #Solaris
  IPMI_COMMAND=/opt/ipmitool/bin/ipmitool
fi;
ECC_OUTPUT=$($IPMI_COMMAND sel list | grep Memory | grep ECC)
if [ -z "$ECC_OUTPUT" ]
then
  echo -e "SUCCESS: No memory ECC errors were found.\nECC list:\n\n$ECC_OUTPUT"
else
  echo -e "FAILURE: Memory ECC errors were found.\nECC list:\n\n$ECC_OUTPUT"
fi

The expected output should be:
SUCCESS: No memory ECC errors were found. ECC list:
Example of a FAILURE result:
FAILURE: Memory ECC errors were found. ECC list:  24f | 09/16/2016 | 09:32:59 | Memory #0x53 | Correctable ECC | Asserted
If any errors are reported, take the following corrective actions in order:
1) Reseat the DIMM.
2) Open an SR for hardware replacement.

Verify celldisk configuration on disk drives

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical12/06/11X2-2(4170), X2-2, X2-8,X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact:
The definition and maintenance of storage server celldisks is critical for optimal performance and outage avoidance.
The impact of verifying the basic storage server celldisk configuration is minimal. Correcting any abnormalities is dependent upon the reason for the anomaly, so the impact cannot be estimated here.
Risk:
If the basic storage server celldisk configuration is not verified, poor performance or unexpected outages may occur.
Action / Repair:
To verify the basic storage server celldisk configuration on disk drives, execute the following command as the "celladmin" user on each storage server:
cellcli -e "list celldisk where disktype=harddisk and status=normal" | wc -l

The output should be:

12

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, there should be 12 celldisks on disk drives with a status of "normal".

Verify celldisk configuration on flash memory devices

PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFAIL11/15/2017<Name>ProductionExadata27119016 - exachk
24514400 - exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AX2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-2, X7-811.2+Linux x86-64exachk 12.2.0.1.4 

Benefit / Impact:
The definition and maintenance of storage server celldisks is critical for optimal performance and outage avoidance. The number of celldisks configured on flash memory devices varies by hardware version. Each celldisk configured on flash memory devices should have a status of "normal".
The impact of verifying the celldisk configuration on flash memory devices is minimal. The impact of correcting any anomalies is dependent upon the reason for the anomaly and cannot be estimated here.
Risk:
If the celldisk configuration on flash memory devices is not verified, poor performance or unexpected outages may occur.
Action / Repair:
To verify the celldisk configuration on flash memory devices, execute the following command as the "root" userid on each storage server:
cellcli -e "list celldisk where disktype=flashdisk and status=normal" | wc -l
The output should be similar to the following and match one of the rows in the "Celldisk on Flash Memory Devices Mapping Table":
16
Celldisk on Flash Memory Devices Mapping Table

System DescriptionCommon NameDisk TypeNumber of Devices
X4275 X2-2(4170) MIXED 16 
X4270 M2 X2-2, X2-8 MIXED 16 
X4270 M3 X3-2, X3-8 MIXED 16 
X4270 M3 EIGHTH MIXED 
X4-2L X4-2 MIXED 16 
X4-2L EIGHTH MIXED 
X5-2L X5-2, X5-8 MIXED 
X5-2L X5-2, X5-8 FLASH 
X5-2L EIGHTH MIXED 
X5-2L EIGHTH FLASH 
X6-2L X6-2, X6-8 MIXED 
X6-2L X6-2, X6-8 FLASH 
X6-2L EIGHTH MIXED 
X6-2L EIGHTH FLASH 
X7-2L X7-2, X7-8 MIXED 
X7-2L X7-2, X7-8 FLASH 
X7-2L EIGHTH MIXED 
X7-2L EIGHTH FLASH 

If the output is not as expected, execute the following command as the "root" userid:
cellcli -e "list celldisk where disktype=flashdisk and status!=normal"
Perform your root cause analysis and corrective actions based upon the key words returned in the "status" field. For additional information, please reference the following:
The "Maintaining Flash Disks" section of "Oracle® Exadata Database Machine, Owner's Guide 11g Release 2 (11.2), E13874-24"
Troubleshooting guide for Sick or underperforming storage cell/Performance Issue (Doc ID 1348736.1)
Troubleshooting guide for Underperforming FlashDisks (Doc ID 1348938.1)


Verify there are no griddisks configured on flash memory devices

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
12/08/15
      <Name>
Production
Exadata - Physical,
Exadata - Management Domain,
BDA, Exalogic, Exalytics, SSC, ZDLRA
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
11.2.2.2.0+
Linux x86-64
exachk 12.1.0.2.6
Benefit / Impact:
The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.
The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the reason for the anomaly, so the impact cannot be estimated here.
Risk:
If the storage server griddisk configuration is not verified, poor performance or unexpected outages may occur.
Action / Repair:
To verify there are no storage server griddisks configured on flash memory devices, execute the following command as the "celladmin" user on each storage server:
cellcli -e "list griddisk where disktype=flashdisk" | wc -l
The output should be:
0
If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the unexpected result.
NOTE:
Experience has shown that the Oracle recommended Best Practice of using all available flash device space for Smart Flash Log and Smart Flash Cache provides the highest overall performance benefit with lowest maintenance overhead for an Oracle Exadata Database Machine.

In some very rare cases for certain highly write-intensive applications, there may be some performance benefit to configuring grid disks onto the flash devices for datafile writes only. With the release of the Smart Flash Log feature in 11.2.2.4, redo logs should never be placed on flash grid disks. Smart Flash Log leverages both hard disks and flash devices with intelligent caching to achieve the fastest possible redo write performance, optimizations which are lost if redo logs are simply placed on flash grid disks.

The space available to Smart Flash Cache and Smart Flash Log is reduced by the amount of space allocated to the grid disks deployed on flash devices. The usable space in the flash grid disk group is either half or one-third of the space allocated for grid disks on flash devices, depending on whether the flash grid disks are configured with ASM normal or high redundancy.

If after thorough performance and recovery testing, a customer chooses to deploy grid disks on flash devices, it would be a supported, but not Best Practice, configuration.

Verify griddisk count matches across all storage servers where a given prefix name exists

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical12/06/11X2-2(4170), X2-2, X2-8,X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact:
The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.
The impact of verifying the basic storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the reason for the anomaly, so the impact cannot be estimated here.
Risk:
If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.
Action / Repair:
To verify the storage server griddisk count matches across all storage server where a given prefix name exists, execute the following command as the "root" userid on the database server from which the
onecommand script was executed during initial deployment:

for GD_PREFIX in `dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk attributes name" | cut -d" " -f2 | gawk -F "_CD_" '{print $1}' | sort -u`;
do
GD_PREFIX_RESULT=`dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk where name like \'$GD_PREFIX\_.*\' | wc -l" | cut -d" " -f2 | sort -u | wc -l`;
if [ $GD_PREFIX_RESULT = 1 ]
then
echo -e "$GD_PREFIX: SUCCESS"
else
echo -e "$GD_PREFIX: FAILURE"
dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list griddisk where name like \'$GD_PREFIX\_.*\' | wc -l";
fi
done

The output should be similar to:

DATA_SLCC16: SUCCESS
DBFS_DG: SUCCESS
RECO_SLCC16: SUCCESS

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, the total number of griddisks per storage server for a given prefix name (e.g: DATA) should match across all storage servers
where the given prefix name exists.

NOTE: Not all storage servers are required to have all prefix names in use. This is possible where for security reasons a customer has segregated the storage servers, is using a data lifecycle management methodology,
or an Oracle Storage Expansion Rack is in use. For example, when an Oracle Storage Expansion Rack is in use for data lifecycle management, those storage servers will likely have griddisks with unique names that
differ from the griddisk names used on the storage servers that contain real time data, yet all griddisks are visible to the same cluster.

NOTE: This command requires that SSH equivalence exists for the "root" userid from the database server upon which it is executed to all storage servers in use by the cluster.

Verify griddisk ASM status

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical12/06/11X2-2(4170), X2-2, X2-8,X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact:
The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.
The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the reason for the anomaly, so the impact cannot be estimated here.
Risk:
If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.
Action / Repair:
To verify the storage server griddisk ASM status, execute the following command as the "celladmin" user on each storage server:
ASM_STAT_RESLT=`cellcli -e "list griddisk attributes name,status, asmmodestatus,asmdeactivationoutcome" | egrep -v ".*\<active\>.*\<ONLINE\>.*\<Yes\>" | wc -l`;
if [ $ASM_STAT_RESLT = 0 ]
then
echo -e "\nSUCCESS\n"
else
echo -e "\nFAILURE:";
cellcli -e "list griddisk attributes name,status, asmmodestatus,asmdeactivationoutcome" | egrep -v ".*\<active\>.*\<ONLINE\>.*\<Yes\>";
echo -e "\n";
fi;

The output should be:
SUCCESS

If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the unexpected result.

NOTE: On a storage server configured according to Oracle best practices, all griddisks should have "status" of "active", "asmmodestatus" of "online" and "asmdeactivationoutcome" of "yes".

Verify that griddisks are distributed as expected across celldisks

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System PlatformBug(s)
CriticalFAIL10/11/17<Name>ProductionExadata - Physical,
Exadata - Management Domain
ALLBug 26651266 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 12.2.0.1.4N/A

Benefit / Impact:
The definition and maintenance of storage server griddisks is critical for optimal performance and outage avoidance.
The impact of verifying the storage server griddisk configuration is minimal. Correcting any abnormalities is dependent upon the reason for the anomaly, so the impact cannot be estimated here.
Risk:
If the storage server griddisk configuration as designed is not verified, poor performance or unexpected outages may occur.
Action / Repair:
NOTE: The recommended best practice is to have each griddisk distributed across all celldisks. For older versions of Exadata storage server software and hardware, the griddisks "SYSTEM" or "DBFS_DG" had a slightly different distribution, and the code below correctly accounts for those cases.
To verify that griddisks are distributed as expected across celldisks, execute the following command as the "root" userid on each storage server:
RAW_CELLDISK=$(cellcli -e "list celldisk attributes name" | sed -e 's/^[ \t]*//')
RAW_GRIDDISK=$(cellcli -e "list griddisk attributes name" | sed -e 's/^[ \t]*//')
if [ `echo -e $RAW_CELLDISK | grep CD | wc -l` -ge 1 ]
then
  PARSED_CELLDISK=$(echo -e "$RAW_CELLDISK" | grep CD)
else
  PARSED_CELLDISK=$(echo -e "$RAW_CELLDISK")
fi
CELLDISK_COUNT=$(echo -e "$PARSED_CELLDISK" | wc -l)
if [ `echo -e $RAW_GRIDDISK | grep CD | wc -l` -ge 1 ]
then
  SHORT_GD_NAME_ARRAY=$(echo -e "$RAW_GRIDDISK" | awk -F "_CD_" '{print $1}' | sort -u)
else
  SHORT_GD_NAME_ARRAY=$(echo -e "$RAW_GRIDDISK" | awk -F "_FD_" '{print $1}' | sort -u)
fi
RETURN_RESULT=0
for GD_SHORT_NAME in $SHORT_GD_NAME_ARRAY 
do
  if [[ $GD_SHORT_NAME = "SYSTEM" || $GD_SHORT_NAME = "DBFS_DG" || $GD_SHORT_NAME = "CATALOG"  ]]
  then
    GD_COUNT=$(expr `echo "$RAW_GRIDDISK" | grep $GD_SHORT_NAME | wc -l` + 2)
  else
    GD_COUNT=$(echo "$RAW_GRIDDISK" | grep $GD_SHORT_NAME | wc -l)
  fi
  if [ $GD_COUNT -eq $CELLDISK_COUNT ]
  then 
    :
  else
    OUTPUT_ARRAY+=`echo -e "\n$GD_SHORT_NAME: FAILURE:\tGriddisk count:  $GD_COUNT\tCelldisk count:  $CELLDISK_COUNT"`
    RETURN_RESULT=1
  fi
done
if [ $RETURN_RESULT -eq 0 ]
then
    echo -e "SUCCESS: All griddisks are distributed as expected across celldisks."
else
    echo -e -n "FAILURE: One or more griddisks are not distributed as expected across celldisks. Details:"
    echo -e "${OUTPUT_ARRAY[@]}"
fi
The expected output should be:
SUCCESS: All griddisks are distributed as expected across celldisks.
Example of a "FAILURE" result:
FAILURE: One or more griddisks are not distributed as expected across celldisks. Details:
C_DATA:  FAILURE:       Griddisk count:  7      Celldisk count:  8
If the output is not as expected, investigate the condition and take corrective action based upon the root cause of the unexpected result.

Verify the percent of available celldisk space used by the griddisks

PriorityAlert LevelDateOwnerStatusEngineered System
CriticalINFO11/09/16<Name>ProductionExadata - Physical,
Exadata - Management Domain
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-811.2+Linux x86-64exachk 12.2.0.1.2
Benefit / Impact:
The impact of verifying the percent of available celldisk space used by the griddisks is minimal.
Risk:
If the percent of available celldisk space used by the griddisks is not verified, an unexpected configuration change may be missed.
Action / Repair:
To verify the percent of available celldisk space used by the griddisks, execute the following command set as the "root" userid on each storage server:
ALLFLASHCELL=$(cellcli -e "list cell attributes makemodel"|egrep  -ic 'ALLFLASH|EXTREME_FLASH');
RAW_GRIDDISK_SIZE=$(cellcli -e "list griddisk attributes size");
TOTAL_GRIDDISK_SIZE=$(echo "$RAW_GRIDDISK_SIZE" | sed 's/\s//g'|awk '/G$/ { print $0 } /T$/ { size=substr($0, 0, length($0)-1); size=size*1024; print size "" "G";}'|awk '{ SUM += $1} END { print SUM}');
if [ $ALLFLASHCELL -eq 0 ]
then
  RAW_CELLDISK_SIZE=$(cellcli -e "list celldisk attributes size where disktype=harddisk");
else
  RAW_CELLDISK_SIZE=$(cellcli -e "list celldisk attributes size where disktype=flashdisk");
fi;
TOTAL_CELLDISK_SIZE=$(echo "$RAW_CELLDISK_SIZE" | sed 's/\s//g'|awk '/G$/ { print $0 } /T$/ { size=substr($0, 0, length($0)-1); size=size*1024; print size "" "G";}'| awk '{ SUM += $1} END { print SUM}');
GRIDDISK_CELLDISK_PCT=$(echo $TOTAL_GRIDDISK_SIZE $TOTAL_CELLDISK_SIZE | awk '{ printf("%d", ($1/$2)*100) }');
echo -e "INFO:  The percent of available celldisk space used by the griddisks is: $GRIDDISK_CELLDISK_PCT\nThe total griddisk size found is: $TOTAL_GRIDDISK_SIZE\nThe total celldisk size found is: $TOTAL_CELLDISK_SIZE";
The expected output will be similar to:
INFO:  The percent of available celldisk space used by the griddisks is: 99
The total griddisk size found is: 87818.7
The total celldisk size found is: 87819.3
If the output is not as expected for a given known configuration, investigate and take corrective action based upon the root cause of the unexpected result.

NOTE: On a storage server not in an Oracle Virtual Machine environment configured according to Oracle best practices, the percent utilization will typically be >= 99 for spinning disk and >= 94 <= 95 for Extreme Flash. The lower percentage of utilization for Extreme Flash is because the griddisks, Flash Log, and Flash Cache are all built on the same flash hardware.

NOTE: In an Oracle Virtual Machine environment, it is not unusual for the percentage of available celldisk space used by the griddisks to be in the middle 60 range. This is due in part to the fact the DBFS griddisk is not created by default, and user requirements to reserve free space for future use. For example:
INFO:  The percent of available celldisk space used by the griddisks is: 63
The total griddisk size found is: 4236
The total celldisk size found is: 6636.06
 

 Verify Database Server ZFS RAID Configuration

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical01/27/12X2-2, X2-8, X4-2Solaris11.2.x +11.2.x +
Benefit / Impact:
For a database server running Solaris deployed according to Oracle standards, there will be two ZFS RAID-1 pools, named "rpool" and "data". Each mirror in the pool contains two disk drives. For an X2-2,
there is one mirror for each name. For an X2-8, there is one mirror for "rpool" and 3 for "data". Verifying the database server ZFS RAID configuration helps to avoid a possible performance impact, or an outage.
The impact of validating the ZFS RAID configuration is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the ZFS RAID configuration increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server ZFS RAID configuration, execute the following command as the "root" userid:
/opt/oracle.SupportTools/disks_map.pl | ggrep mirror -A3
The output will be similar to:
------------------- mirror-0 ---------------------
16:5 c1t2d0s0 rpool
16:4 c1t1d0s0 rpool
--------------------------------------------------
------------------- mirror-0 ---------------------
16:6 c1t3d0 data
16:7 c1t4d0 data
--------------------------------------------------
------------------- mirror-2 ---------------------
16:0 c1t5d0 data
16:2 c1t6d0 data
--------------------------------------------------
------------------- mirror-1 ---------------------
16:3 c1t0d0 data
16:1 c1t7d0 data
--------------------------------------------------
For an X2-2, the expected output is one pool named "rpool", and one named "data", each comprised of 1 mirror with 2 disk drives.
For an X2-8, the expected output is one pool named "rpool", comprised of 1 mirror with 2 disk drives, and one pool named "data" comprised of 3 mirrors each with 2 disk drives.
If the reported output differs, investigate and correct the condition.

Verify InfiniBand is the Private Network for Oracle Clusterware Communication

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
The InfiniBand network in an Oracle Exadata Database Machine provides superior performance and throughput characteristics that allow Oracle Clusterware to operate at optimal efficiency.
The overhead for these verification steps is minimal.
Risk:
If the InfiniBand network is not used for Oracle Clusterware communication, performance will be sub-optimal.
Action / Repair:
The InfiniBand network is preconfigured on the storage servers. Perform the following on the database servers:
Verify the InfiniBand network is the private network used for Oracle Clusterware communication with the following command:
$GI_HOME/bin/oifcfg getif -type cluster_interconnect
For X2-2 the output should be similar to:
bondib0 192.168.8.0 global cluster_interconnect
For X2-8 the output should be similar to:
bondib0 192.168.8.0 global cluster_interconnect
bondib1 192.168.8.0 global cluster_interconnect 
bondib2 192.168.8.0 global cluster_interconnect 
bondib3 192.168.8.0 global cluster_interconnect
If the InfiniBand network is not the private network used for Oracle Clusterware communication, configure it following the instructions in MOS Note 283684.1,
"How to Modify Private Network Interface in 11.2 Grid Infrastructure".
NOTE: It is important to ensure that your public interface is properly marked as public and not private. This can be checked with the oifcfg getif command. If it is inadvertantly marked private,
you can get errors such as "OS system dependent operation:bind failed with status" and "OS failure message: Cannot assign requested address".
It can be corrected with a command like oifcfg setif -global eth0/<public IP address>:public
 In each database verify that it is using the private IB interconnect withe following query :
SQL> select name,ip_address from v$cluster_interconnects;
NAME IP_ADDRESS
--------------- ----------------
bondib0 192.168.40.25
Or in the database alert log you can look for the following message:
Cluster communication is configured to use the following interface(s) for this instance
192.168.40.26

Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers

PriorityAlert LevelDateOwnerStatusEngineered System
CriticalFAIL7/13/16 <NameProduction StatusExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
DB VersionDB RoleEngineered System PlatformExadata VersionOS VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-811.2.2.2.0+Linux x86-64exachk 12.1.0.2.7
Benefit / Impact: There are specific ARP configurations required for Real Application Clusters (RAC) to work correctly that vary between an active/passive or active/active configuration.
For an active/passive configuration, the settings for all IB interfaces should be:
  accept_local = 1
  rp_filter = 0
For an active/active configuration, the settings for all IB interfaces should be:
  accept_local = 1
  rp_filter = 0
  arp_announce = 2 (8 socket only!)
  - AND the three single attributes -
  net.ipv4.conf.all.rp_filter = 0
  net.ipv4.conf.default.rp_filter = 0
  net.ipv4.conf.all.accept_local = 1
The impact of verifying the ARP configuration is minimal. Correcting a configuration requires editing "/etc/sysctl.conf" and restarting the interface(s).
Risk:
Incorrect ARP configurations may prevent RAC from starting, or result in dropped packets and inconsistent RAC operation.
Action / Repair:
To verify the InfiniBand interface ARP settings for a database server, use the following command as the "root" userid:
RAW_OUTPUT=$(sysctl -a)
RF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "\.ib|bondib" | egrep -i "\.rp_filter")
AL_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "\.ib|bondib" | egrep -i "\.accept_local")
if [ `echo "$RAW_OUTPUT" | grep -i bondib | wc -l` -ge 1 ]
then #active/passive case
  if [[ `echo "$AL_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && `echo "$AL_OUTPUT" | cut -d" " -f3 | sort -u | head -1` -eq 1 ]]
  then
    AL_RSLT=0 #all AL same value and value is 1
  else 
    AL_RSLT=1
  fi;
  if [[ `echo "$RF_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && `echo "$RF_OUTPUT" | cut -d" " -f3 | sort -u | head -1` -eq 0 ]]
  then
    RF_RSLT=0 #all RF same value and value is 0
  else 
    RF_RSLT=1
  fi;
  if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 ]]
  then
    echo -e "Success:  The active/passive ARP configuration is as recommended:\n"
  else
    echo -e "Failure:  The active/passive ARP configuration is not as recommended:\n"
  fi;
  echo -e "$AL_OUTPUT\n\n$RF_OUTPUT"
else #active/active case
  NICARF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.all.rp_filter")
  NICDRF_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.default.rp_filter")
  NICAAL_OUTPUT=$(echo "$RAW_OUTPUT" | egrep -i "net.ipv4.conf.all.accept_local")
  NICARF_RSLT=$(echo "$NICARF_OUTPUT" | cut -d" " -f3)
  NICDRF_RSLT=$(echo "$NICDRF_OUTPUT" | cut -d" " -f3)
  NICAAL_RSLT=$(echo "$NICAAL_OUTPUT" | cut -d" " -f3)
  IB_INTRFCE_CNT=$(echo "$RAW_OUTPUT" | egrep "\.ib.\." | cut -d"." -f4 | sort -u | wc -l)
  if [[ `echo "$AL_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && `echo "$AL_OUTPUT" | cut -d" " -f3 | sort -u | head -1` -eq 1 ]]
  then
    AL_RSLT=0 #all AL same value and value is 1
  else 
    AL_RSLT=1
  fi;
  if [[ `echo "$RF_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && `echo "$RF_OUTPUT" | cut -d" " -f3 | sort -u | head -1` -eq 0 ]]
  then
    RF_RSLT=0 #all RF same value and value is 0
  else 
    RF_RSLT=1
  fi;
  if [ $IB_INTRFCE_CNT -eq 2 ] # 2 socket case
  then
    if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 && $NICARF_RSLT -eq 0 && $NICDRF_RSLT -eq 0 && $NICAAL_RSLT -eq 1 ]]
    then
      echo -e "Success:  The active/active ARP configuration is as recommended:\n"
    else
      echo -e "Failure:  The active/active ARP configuration is not as recommended:\n"
    fi;
    echo -e "$AL_OUTPUT\n\n$RF_OUTPUT\n\n$NICARF_OUTPUT\n$NICDRF_OUTPUT\n$NICAAL_OUTPUT"
  else # 8 socket case
  NICIAA_OUTPUT=$(echo "$RAW_OUTPUT" | egrep "\.ib.\." | egrep arp_announce)
  if [[ `echo "$NICIAA_OUTPUT" | cut -d" " -f3 | sort -u | wc -l` -eq 1 && `echo "$NICIAA_OUTPUT" | cut -d" " -f3 | sort -u | head -1` -eq 2 ]]
  then
    NICIAA_RSLT=0 #all arp_announce same value and value is 2
  else 
    NICIAA_RSLT=1
  fi;
    if [[ $AL_RSLT -eq 0 && $RF_RSLT -eq 0 && $NICIAA_RSLT -eq 0 && $NICARF_RSLT -eq 0 && $NICDRF_RSLT -eq 0 && $NICAAL_RSLT -eq 1 ]]
    then
      echo -e "Success:  The active/active ARP configuration is as recommended:\n"
    else
      echo -e "Failure:  The active/active ARP configuration is not as recommended:\n"
    fi;
  echo -e "$AL_OUTPUT\n\n$RF_OUTPUT\n\n$NICIAA_OUTPUT\n\n$NICARF_OUTPUT\n$NICDRF_OUTPUT\n$NICAAL_OUTPUT"
  fi;
fi;


The expected output should be similar to:

Success: The active/passive ARP configuration is as recommended:
net.ipv4.conf.ib0.accept_local = 1
net.ipv4.conf.ib1.accept_local = 1
net.ipv4.conf.bondib0.accept_local = 1
net.ipv4.conf.ib0.rp_filter = 0
net.ipv4.conf.ib1.rp_filter = 0
net.ipv4.conf.bondib0.rp_filter = 0
- OR -
Success: The active/active ARP configuration is as recommended:
net.ipv4.conf.ib0.accept_local = 1
net.ipv4.conf.ib1.accept_local = 1
net.ipv4.conf.ib0.rp_filter = 0
net.ipv4.conf.ib1.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_local = 1
- OR -
Success: The active/active ARP configuration is as recommended:
net.ipv4.conf.ib0.accept_local = 1
<outpout truncated>
net.ipv4.conf.ib7.accept_local = 1
net.ipv4.conf.ib0.rp_filter = 0
<output truncated>
net.ipv4.conf.ib7.rp_filter = 0
net.ipv4.conf.ib0.arp_announce = 2
<output turncated>
net.ipv4.conf.ib7.arp_announce = 2
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_local = 1
If a "FAILURE: ..." message appears, investigate for root cause, make the necessary edits to "/etc/sysctl.conf", and restart the interface(s).
NOTE: These recommendations are for the InfiniBand interfaces on database servers only! They do not apply to the Ethernet interfaces on the database servers. No changes are permitted on the storage servers.

Verify Oracle RAC Databases use RDS Protocol over InfiniBand Network.

PriorityAlert LevelDateOwnerStatusEngineered System     Bug(s)        
CriticalFAIL03/01/2017 <Name>ProductionExadata - Physical,
Exadata - User Domain,
SSC
25490898 - exachk
24958292 - exachk
Reference: 23039723
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
11.2.0.2+Primary,
Standby, ASM
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, SL6AllLinux x86-64,
Sparc Linux
exachk 12.2.0.1.3 
Benefit / Impact:
The RDS protocol over InfiniBand provides superior performance because it avoids additional memory buffering operations when moving data from process memory to the network interface for IO operations.
This includes both IO operations between the Oracle instance and the storage servers, as well as instance to instance block transfers via Cache Fusion.
There is minimal impact to verify that the RDS protocol is in use. Implementing the RDS protocol over InfiniBand requires an outage to relink the Oracle software.
Risk:
If the Oracle RAC databases do not use RDS protocol over the InfiniBand network, IO operations will be sub-optimal.
Action / Repair:
To verify the RDS protocol is in use by a given Oracle instance, set the ORACLE_HOME and LD_LIBRARY_PATH variables properly for the instance and execute the following command as the oracle userid
on each database server where the instance is running:
$ORACLE_HOME/bin/skgxpinfo
The output should be:
rds
Note: For Oracle software versions below 11.2.0.2, the skgxpinfo command is not present. For 11.2.0.1, you can copy over skgxpinfo to the proper path in your 11.2.0.1 environment from an
available 11.2.0.2 environment and execute it against the 11.2.0.1 database home(s) using the provided command.
Note: An alternative check (regardless of Oracle software version) is to scan each instance's alert log (must contain a startup sequence!) for the following line:

Cluster communication is configured to use the following interface(s)for this instance 192.168.20.21 cluster interconnect IPC version:Oracle RDS/IP (generic)
 
If the instance is not using the RDS protocol over InfiniBand, relink the Oracle binary using the following commands (with variables properly defined for each home being linked):

  • (as oracle) Shutdown any processes using the Oracle binary
  • If and only if relinking the grid infrastructure home, then (as root) GRID_HOME/crs/install/rootcrs.pl -unlock
  • (as oracle) cd $ORACLE_HOME/rdbms/lib
  • (as oracle) make -f ins_rdbms.mk ipc_rds ioracle
  • If and only if relinking the Grid Infrastructure home, then (as root) GRID_HOME/crs/install/rootcrs.pl -patch
Note: Avoid using the relink all command due to various issues. Use the make commands provided.

Verify Database and ASM instances use same SPFILE

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalMarch 2013AllAllAllAll
Benefit / Impact:
All instances for a particular database or ASM cluster should be using the same spfile. Making changes to databases and ASM instances needs to be done in a reliable and consistent way across all instances.
Risk:
Multiple 'sources of truth' can cause confusion and possibly unintended values being set.
Action / Repair:
Verify what spfile is used across all instances of one particular ASM or database cluster. If multiple spfiles for one database are found, provide a recommendation to consolidate them into one.
Scope includes all machine types, os types and db versions
SQL> select name, value from gv$parameter where name = 'spfile';

NAME                           VALUE
------------------------------ ------------------------------------------------------------
spfile                         +DATA/racone/spfileracone.ora

The value for pfile should be empty:
SQL> select name, value from gv$parameter where name = 'pfile';
no rows selected

Verify Berkeley Database location for Cloned GI homes

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalMarch 2013X2-2(4170), X2-2, X2-8, X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact
After cloning a Grid Home the Berkeley Database configuration file ($GI_HOME/crf/admin/crf<node>.ora) in the new home should not be pointing to the
previous GI home where it is cloned from. During previous patch set updates Berkeley Database configuration files were found still pointing to the
'before (previously cloned from) home'. It was due an invalid cloning procedure the Berkeley Database location of the 'new home' was not updated during
the out of place bundle patching procedure
Risk:
Berkeley Database configurations still pointing to the old GI home, will fail GI Upgrades to 11.2.0.3. Error messages in $GRID_HOME/log/crflogd/crflogdOUT.log logfile
Action / Repair:
Detect:
# cat $GI_HOME/crf/admin/crf`hostname -s`.ora | grep CRFHOME | grep $GI_HOME | wc -l 

# cat $GI_HOME/crf/admin/crf`hostname -s`.ora | grep BDBLOC | egrep "default|$GI_HOME | wc -l

For each of the above commands, when no '1' is returned, the CRFHOME or BDBLOC as mentioned the crf.ora file has the wrong reference to the GI_HOME in it.
To solve this, manually edit $GI_HOME/crf/admin/crf<node>.ora in the cloned Grid Infrastructure Home and change the values for BDBLOC and CRFHOME
and make sure none of them point to the previous GI_HOME but to their current home. The same change needs to be done on all nodes in the cluster.
It is recommended to set BDBLOC to "default". This needs to be done prior the upgrade.
. Reference: 1485970.1 / 14168708


Configure Storage Server alerts to be sent via email

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
Oracle Exadata Storage Servers can send various levels of alerts and clear messages via email or snmp, or both. Sending these messages via email at a minimum helps to ensure that a problem is detected and corrected.
There is little impact to storage server operation to send these messages via email.
Risk:
If the storage servers are not configured to send alerts and clear messages via email at a minimum, there is an increased risk of a problem not being detected in a timely manner.
Action / Repair:
Use the following cellcli command to validate the email configuration by sending a test email:
alter cell validate mail; 
The output will be similar to:
Cell slcc09cel01 successfully altered 
If the output is not successful, configure a storage server to send email alerts using the following cellcli command (tailored to your environment):
ALTER CELL smtpServer='mailserver.maildomain.com', -
smtpFromAddr='firstname.lastname@maildomain.com', -
smtpToAddr='firstname.lastname@maildomain.com', -
smtpFrom='Exadata cell', -
smtpPort='<port for mail server>', -
smtpUseSSL='TRUE', -
notificationPolicy='critical,warning,clear', -
notificationMethod='mail';

NOTE: The recommended best practice to monitor an Oracle Exadata Database Machine is with Oracle Enterprise Manager (OEM) and the suite of OEM plugins developed for the Oracle Exadata Database Machine.
Please reference My Oracle Support (MOS) Note 1110675.1 for details.

Configure NTP and Timezone on the InfiniBand switches

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
Synchronized timestamps are important to switch operation and message logging, both within an InfiniBand switch between the InfiniBand switches. There is little impact to correctly configure the switches.
Risk:
If the InfiniBand switches are not correctly configured, there is a risk of improper operation and disjoint message timestamping.
Action / Repair:
The InfiniBand switches should be properly configured during the initial deployment process. If for some reason they were were not, please consult  the "Configuring Sun Datacenter InfiniBand Switch 36 Switch"
section of the "Oracle® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

Configure NTP slew_always settings as SMF property for Solaris

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2, X2-8, X4-2Solaris11.2.x +11.2.0.2 +
Benefit / Impact:
Configuring NTP slew settings as an SMF property will make sure the time is equally managed on all the systems which will prevent timing issues that may impact availability. This also helps in problem analysis
and will prevent for error messages in the system-log about in incorrect ntp setting
Risk:
Not having a working NTP configuration using SMF will result in different time settings on the nodes. This may impact stability and makes problem analysis difficult.
"syntax error in /etc/inet/ntp.conf line 95, ignored"
Action / Repair:
As a best practice the ntp configuration setting slew_always should be configured as an SMF setting. After setting slew_always in SMF the other setting 'disable pll' is not required anymore.
On Solaris 11 Express and Solaris 11 both should not exist in ntp.conf

Enable Xeon Turbo Boost

PriorityAlert LevelDateOwnerStatusScopeBug(s)
ImportantWARN2014-Jan-17<Name>ProductionExadata17898503
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX4-2, X4-811.2.3.3.0+Allexachk TBD 
Benefit / Impact:
Xeon Turbo Boost automatically allows processor cores to run faster than their rated frequency if operating below power, current, and temperature specification limits, which may result in better performance for some applications. Turbo Boost is supported on X4 systems only.
Action / Repair:
Verify your system is using X4-based hardware using the dmidecode command:
# dmidecode -s system-product-name
The output on an X4-based database server is "SUN SERVER X4-2". The output on an X4-based storage server is "SUN SERVER X4-2L".
Verify Turbo Boost is enabled on X4 database and storage servers using the following command:
# ubiosconfig export all -E | fgrep Turbo_Mode
Turbo Boost is enabled if the output is the following:
 <Turbo_Mode>Enabled</Turbo_Mode>
Turbo Boost is disabled if the output is the following:
 <Turbo_Mode>Disabled</Turbo_Mode> 
If Turbo Boost is disabled, then enable it (on X4 systems only) by following the instructions in MOS Document 1487339.1, Issue 1.6 - Enable the Xeon Turbo Boost mode for X4 storage and database servers.
NOTE: Although it is possible to enable Turbo Boost on X3-based Exadata hardware, it is not supported.

Verify NUMA Configuration

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit / Impact:
X2-2 Database servers in Oracle Exadata Database Machine by default are booted with operating system NUMA support enabled. Commands that manipulate large files without using direct I/O on ext3 file systems
will cause low memory conditions on the NUMA node (Xeon 5500 processor) currently running the process.
By turning NUMA off, a potential local node low memory condition and subsequent performance drop is avoided.
X2-8 Database servers should have NUMA on
The impact of turning NUMA off is minimal.
Risk:
Once local node memory is depleted, system performance as a whole will be severely impacted.
Action / Repair:
Follow the instructions in MOS Note 1053332.1 to turn NUMA off in the kernel for database servers.
NOTE: NUMA is configured to be off in the storage servers and should not be changed.

Verify Exadata Smart Flash Log is Created


PriorityAlert LevelDateOwnerStatusScope
CriticalFAIL03/05/2013<Name> ProductionExadata, SSC, Exalogic
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, X4-211.2.2.2.0+Solaris - 11
Linux x86-64 UEK5.8
exachk 2.2.1
Benefit / Impact:
When created, Exadata Smart Flash Log uses 512MB of flash memory per storage server by default to help minimize redo log write latency.

The impact of verifying that Exadata Smart Flash Log is created is minimal.
Risk:
Without Exadata Smart Flash Log, the LGWR process may be delayed causing longer "log file parallel write" and "log file sync" waits.
Action / Repair:
To verify that Exadata Smart Flash Log is created, execute the following cellcli command as the "celladmin" user on each storage server:
list flashlog attributes size,status
The output should be similar to:
 512M normal
If the size is not as expected, Exadata Smart Flash Log may not be created, or there may be a hardware issue, or there may be a configuration issue.
It is extremely important that the root cause for the size not being as expected is understood before attempting corrective action. Because Smart Flash Log and Smart Flash Cache share the same physical memory structure on the storage servers, both are likely to be impacted by hardware failures, for example. Corrective action is also impacted by whether or not Write Back Flash Cache is in use, and solutions for the same root cause may vary if Write Back Flash Cache is in use.
After determining the root cause, refer to the Database Machine Owner's Guide and the Exadata Software User's Guide for the appropriate corrective action steps.
Because they share the same storage server physical flash memory, there is a space usage relationship between Exadata Smart Flash Log and Exadata Smart Flash Cache. Exadata Smart Flash Log should be created before Exadata Smart Flash Cache, because the default configuration for Exadata Smart Flash Cache will use all available storage server flash memory. If Exadata Smart Flash Cache already exists, a subsequent attempt to create Exadata Smart Flash Log will fail because all the available storage server flash memory is in use.

NOTE: Exadata Smart Flash Log is created by default with Exadata Storage Server Software version 11.2.2.4.0 and above.
NOTE: Exadata Smart Flash Log will be used by Oracle software 11.2.0.2 Bundle Patch 9 (or higher) or 11.2.0.3.0. The recommended Oracle software version levels are 11.2.0.2 Bundle Patch 11 (or higher) or 11.2.0.3 Bundle Patch 1 (or higher).
NOTE: The default Exadata Smart Flash Log size of 512MB is the recommended value.
NOTE: See also "Configure Storage Server Flash Memory as Exadata Smart Flash Cache"

Verify Exadata Smart Flash Cache is Created

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAILupdated 10/11/17<Name>ProductionExadata - Physical,
Exadata - Management Domain,
SSC, Exalogic
<26637216>- exachk
<24514430>- exachk
<23063691>- exachk
<22344656>- exachk
<18691846>- exachk 
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AALL11.2.3.2.0+
Linux x86-64
exachk 12.2.0.1.4 
Benefit / Impact:
For the vast majority of situations, maximum performance is achieved by configuring the storage server flash memory as cache, allowing the Exadata software to determine the content of the cache.
The impact of configuring storage server flash memory as cache at initial deployment is minimal. If there are already grid disks configured in the flash memory, consideration must be given as to the relocation of the data when converting the flash memory back to cache.
Risk:
Not configuring the storage server flash memory as cache may result in a degradation of overall performance.
Action / Repair:
To confirm all storage server flash memory is configured as smart flash cache, execute the command shown below:
cellcli -e "list flashcache detail" | grep size
The output will be similar to:
 size: 5.82122802734375T
Starting with Exadata software version 11.2.3.2.0, for an environment deployed according to Oracle standards, with the storage server "flashlog" feature in use at the default size of 512M, the size of the storage server "flashcache" should match one of the entries in this table:

Smart Flash Cache Expected Size Table
  System Description    Common Name  Cache Size with
Smart Flash Log
Cache Size without
Smart Flash Log
Cache Size with
flashCacheCompression
and Smart Flash Log
Cache Size with
flashCacheCompression
and no Smart Flash Log
X4275X2-2(4170)0.356201171875T
364.75G
0.356689453125T
365.25G
FCC not available on this hardwareFCC not available on this hardware
X4270 M2X2-2, X2-80.356201171875T
364.75G
0.356689453125T
365.25G
FCC not available on this hardwareFCC not available on this hardware
X4270 M3X3-2, X3-81.453857421875T
1488.75G
1.454345703125T
1489.25G
2.908935546875T
2978.75G
2.909423828125T
2979.25G
X4270 M3EIGHTH0.7266845703125T
744.125G
0.7271728515625T
744.625G
1.4542236328125T
1489.125G
1.4547119140625T
1489.625G
X4-2LX4-22.908935546875T
2978.75G
2.909423828125T
2979.25G
5.8193359375T
5959G
5.81982421875T
5959.5G
X4-2LEIGHTH1.4542236328125T
1489.125G
1.4547119140625T
1489.625G
2.909423828125T
2979.25G
2.909912109375T
2979.75G
X5-2LX5-25.82122802734375T5.82171630859375TFCC not available on this hardwareFCC not available on this hardware
X5-2LEIGHTH2.910369873046875T2.910858154296875TFCC not available on this hardwareFCC not available on this hardware
X6-2LX6-2, X6-811.64312744140625T11.64361572265625TFCC not available on this hardwareFCC not available on this hardware
X6-2LEIGHTH5.821319580078125T5.821807861328125TFCC not available on this hardwareFCC not available on this hardware
X7-2LX7-223.28692626953125T23.28741455078125TFCC-NAFCC-NA
X7-2L (all flash)X7-22.3287353515625T2.3287353515625TFCC-NAFCC-NA
If the size is not as expected, some of the storage server flash memory may be configured as grid disks, or there may be a hardware issue, or there may be a configuration issue.
It is extremely important that the root cause for the size not being as expected is understood before attempting corrective action. Because Smart Flash Log and Smart Flash Cache share the same physical memory structure on the storage servers, both are likely to be impacted by hardware failures, for example. Corrective action is also impacted by whether or not Write Back Flash Cache is in use, and solutions for the same root cause may vary if Write Back Flash Cache is in use.
After determining the root cause, refer to the Database Machine Owner's Guide and the Exadata Software User's Guide for the appropriate corrective action steps.
NOTE: While not configuring the Exadata Smart Flash Log is permitted, it is recommended that the Exadata Smart Flash Log be configured. If a decision is made not to create the Exadata Smart Flash Log, the expected size for the Smart Flash Cache is shown in column "Cache Size without Smart Flash Log" and "Cache Size with flashCacheCompression and no Smart Flash Log".
NOTE: On storage servers that use only flash memory devices(no spinning disks), the Exadata Smart Flash Cache size is the same whether or not Exadata Smart Flash Log is created. Therefore, the order in which Exadata Smart Flash Log and Exadata Smart Flash Cache are created does not matter.
NOTE: See also "Verify Exadata Smart Flash Log is Created".
Verify Exadata Smart Flash Cache status is "normal" 
PriorityAlert LevelDateOwnerStatusEngineered System
CriticalFAIL10/13/15<Name> ProductionExadata-Physical,
Exadata-Management Domain,
SSC, ZDLRA
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2,X5-8ALLLinux x86-64 el5uek,
Linux x86-64 el6uek
exachk 12.1.0.2.5

Benefit / Impact:
Verifying that the Exadata Smart Flash Cache status is "normal" helps to avoid a performance degradation.
The impact of verifying that the Exadata Smart Flash Cache status is "normal" is minimal. The impact of restoring the Exadata Smart Flash Cache status to "normal" varies, depending upon the reason for the abnormality, and cannot be estimated here.
Risk:
If the Exadata Smart Flash Cache status is not "normal", a performance degradation is likely.
Action / Repair:
To verify that the Exadata Smart Flash Cache status is "normal", as the root userid on each storage server, execute the following command set:
unset CACHE_STATE;
CACHE_STATE=$(cellcli -e "list flashcache attributes status");
if [ $CACHE_STATE = "normal" ]
then
echo -e SUCCESS: the Exadata Smart Flash Cache state is: $CACHE_STATE;
else
echo -e FAILURE: the Exadata Smart Flash Cache state is: $CACHE_STATE;
fi
The expected output is:
SUCCESS: the Exadata Smart Flash Cache state is: normal
If the output is not as expected, investigate for root cause and correct the discovered cause.
NOTE: If the word "degraded" appears in the output, investigate the hardware condition as a memory module may have failed.
NOTE: If the word "flushed" appears in the output, a cache flush command was issued and was not subsequently cancelled. For example:
FAILURE: the Exadata Smart Flash Cache state is: normal - flushed
In this condition, the Exadata Smart Flash Cache is not in use for cache operations of any type!
To cancel a flash cache flush operation, as the root userid on the storage server with the issue, execute the following command:
cellcli -e "alter flashcache all cancel flush"
The output should be:
Flash cache randomcel05_FLASHCACHE altered successfully

Verify Master (Rack) Serial Number is Set

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical03/02/11X2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +

Benefit/Impact
Setting the Master Serial Number (MSN) (aka Rack Serial Number) assists Oracle Support Services to resolve entitlement issues which may arise. The MSN is listed on a label on the front and the rear of the chassis
but is not electronically readable unless this value is set.
The impact to set the MSN is minimal.
Risk
Not having the MSN set for the system may hinder entitlement when opening Service Requests.
Action/Repair
Use the following command as the "root" userid to verify that all the MSN's are set correctly and match on all servers:
ipmitool sunoem cli "show /SP system_identifier" | grep "system_identifier ="
The output should resemble one of the following:
For X2-2(4170):
system_identifier = Sun Oracle Database Machine xxxxAKyyyy
For X2-2:
system_identifier = Exadata Database Machine X2-2 xxxxAKyyyy
For X2-8:
system_identifier = Exadata Database Machine X2-8 xxxxAKyyyy
(MSN's will be of the format either 4 numbers, the letters 'AK' followed by 4 more numbers or letters A-F, or the letters 'AK followed by 8 numbers or letters A-F)
On any server where the MSN is not set correctly, use the following command as the "root" userid to set it:
ipmitool sunoem cli 'set /SP system_identifier="text_identifier_string serial_number"'
Where "text_identifier_string" is one of:
For X2-2(4170): "Sun Oracle Database Machine"
For X2-2: "Exadata Database Machine X2-2"
For X2-8: "Exadata Database Machine X2-8"
and "serial_number" is the MSN from the label attached to the rack.

NOTE: The label with the Master Serial Number is located on the top left side wall (viewed from rear) inside the rack on the rear of the chassis.
NOTE: In the command to set the Master Serial Number there is a space between the "text_identifier_string" and the "serial_number".

Verify Management Network Interface (eth0) is on a Separate Subnet

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical03/02/11X2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
Benefit/Impact:
It is a requirement that the management network be on a different non-overlapping sub-net than the InfiniBand network and the client access network. This is necessary for better network security, better client access
bandwidths, and for Auto Service Request (ASR) to work correctly.
The management network comprises of the eth0 network interface in the database and storage severs, the ILOM network interfaces of the database and storage servers, and the Ethernet management interfaces of the
InfiniBand switches and PDUs.
Risk:
Having the management network on the same subnet as the client access network will reduce network security, potentially restrict the client access bandwidth to/from the Database Machine to a single 1GbE link,
and will prevent ASR from working correctly.
Action/Repair:
To verify that the management network interface (eth0) is on a separate network from other network interfaces, execute the following command as the "root" userid on both storage and database servers:
grep -i network /etc/sysconfig/network-scripts/ifcfg* | cut -f5 -d"/" | grep -v "#"
The output will be similar to:
ifcfg-bondeth0:NETWORK=10.204.77.0
ifcfg-bondib0:NETWORK=192.168.76.0
ifcfg-eth0:NETWORK=10.204.78.0
ifcfg-lo:NETWORK=127.0.0.0

The expected result is that the network values are different. If they are not, investigate and correct the condition.

Verify RAID disk controller CacheVault capacitor condition


PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL08/08/18 ProductionSSC, Exadata - Physical,
Exadata - Management Domain
X5-2, X5-8, X6-2, X6-8, X7-228438875 - exachk
27495768 - exachk
22911250 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/A18.1.0 or higherLinuxexachk 18.4.0N/A

Benefit/Impact:
The CacheVault capacitor loses its ability to support cache over time. Verifying the CacheVault capacitor condition helps to reasonably time proactive replacement.
The impact of verifying the CacheVault capacitor condition is minimal. Replacing the CacheVault will require downtime for the impacted server.
Risk:
A failed CacheVault capacitor will put the RAID controller into WriteThrough mode which significantly impacts write I/O performance.
Action/Repair:
NOTE: This check is not applicable to Extreme Flash Oracle Exadata Storage Servers nor X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
Execute the following command as the "root" userid on all storage and database servers:
RAW_OUTPUT=$(/opt/MegaRAID/storcli/storcli64 /c0/cv show all)
if [[ $(echo "$RAW_OUTPUT" | egrep -i "^state" | egrep -ic optimal)  -eq 1 ]]
then
  echo -e "SUCCESS: raid controller CacheVault condition is optimal."
else
  echo -e "FAILURE: raid controller CacheVault condition is not optimal.  Details:\n\n$RAW_OUTPUT"
fi
The expected output should be:
SUCCESS: raid controller CacheVault condition is optimal.
If the output is a "FAILURE" message, upload the detailed information provided into a hardware service request for component replacement.

Verify RAID Disk Controller Battery Condition

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL08/01/18<Name> ProductionSSC, Exadata - Physical,
Exadata - Management Domain
X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-228280123 - exachk
27502799 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/A18.1.0 or higherLinuxexachk 18.4.0N/A
Benefit/Impact:
Maintaining optimal condition maximizes RAID controller battery life.
The impact of verifying RAID controller battery condition is minimal. The impact of correcting a non-optimal condition varies, and may include a server shutdown to replace batteries.
Risk:
A non-optimal battery condition may place the RAID controller into WriteThrough mode which significantly impacts write I/O performance.
Action/Repair:
NOTE: This check is not applicable to Extreme Flash Oracle Exadata Storage Servers nor X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
To verify the RAID controller battery condition, execute the following command as the "root" userid on all database and storage servers:
RAW_OUTPUT=$(/opt/MegaRAID/storcli/storcli64 /c0/bbu show all)
if [[ $(echo "$RAW_OUTPUT" | egrep -i "battery state" | egrep -ic optimal)  -eq 1 ]]
then
  echo -e "SUCCESS: raid controller battery condition is optimal."
else
  echo -e "FAILURE: raid controller battery condition is not optimal.  Details:\n\n$RAW_OUTPUT"
fi
The expected output should be similar to:
SUCCESS: raid controller battery condition is optimal.
If the output is a "FAILURE" message, upload the detailed information provided into a hardware service request for component replacement.
 
Verify Ambient Air Temperature
 Alert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFail03/16/16<Name> ProductionExadata - Physical,
Exadata - Management Domain
 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8AllLinux x86-64exachk 12.1.0.2.7 
Benefit / Impact:
Maintaining ambient air temperature conditions within design specification for an Oracle Exadata Database Machine helps to achieve maximum efficiency and targeted component service lifetimes.
The impact of verifying the ambient air temperature is minimal. The impact of correcting ambient air temperatures outside of design specification range will vary depending upon the root cause of the issue.

Risk:
Ambient air temperatures outside the design specification range affect all components within the chassis of an Oracle Exadata Database Machine, possibly manifesting performance problems and shortened service lifetimes.
Action / Repair:
To verify the ambient air temperature, execute the following command set as the "root" userid on each storage and database server:
unset AMBIENT_TEMP;
AMBIENT_TEMP=$(ipmitool sunoem cli "show /SYS/T_AMB" | grep value | sed -e 's/^[ \t]*//;s/[ \t]*$//' | cut -d" " -f3);
if [[ 'echo "${AMBIENT_TEMP//./}"' -ge 5000 && 'echo "${AMBIENT_TEMP//./}"' -le 32000 ]]
then
echo "SUCCESS: Ambient air temperature is within the range of 5 to 32 degrees Centigrade: $AMBIENT_TEMP";
else
echo -e "FAILURE: Ambient air temperature is outside the range of 5 to 32 degrees Centigrade: $AMBIENT_TEMP";
fi;
The output should be similar to:
SUCCESS: Ambient air temperature is within the range of 5 to 32 degrees Centigrade: 27.250

If the ambient air temperature is not within the recommended range, investigate for root cause and take appropriate corrective action.
NOTE: Since there is no one sensor in the physical rack for overall ambient temperature of the data center air, this check reads the ambient temperature from each storage and database server.

Verify Platform Configuration and Initialization Parameters for Consolidation

Platform Consolidation Considerations

Consolidation Parameters Reference Table

Critical, 08/02/11
Benefit / Impact: Experience and testing has shown that certain database initialization parameter settings should use the following formulas for platform consolidation. By using these formulas as recommended, known
problems may be avoided and performance maximized.
The performance related settings provide guidance to maintain highest stability without sacrificing performance. Changing the default performance settings can be done after careful performance evaluation and clear
understanding of the performance impact.
Risk: If the operating system and database parameters are not set as recommended, a variety of issues may be encountered that can lead to system and database instability.
Action / Repair: To verify the database initialization parameters, use the following guidance:
The following are important platform level considerations in a consolidated environment.
  • Operating System Configuration Recommendations
  • Hugepages, when set, should equal the sum of shared memory from all databases, see MOS Note 401749.1 for precise computations and see MOS Note 361323.1
    for a description of Hugepages. Hugepages is generally required if "PageTables" in /proc/meminfo is > 2% of physical memory
    • Benefits: Memory savings. Prevent cases of paging and swapping when not configured.
    • Tradeoffs: Set Hugepages correctly and need to be adjusted when another instance is added/dropped or when sga sizes change.
    • As of 11.2.0.2 to disable hugepages on an instance set parameter "use_large_pages=false" 
    • Note that as of onecommad version that supports 11.2.0.2 BP9 hugepages is automatically configured upon deployment. The vm.nr_hugepages value
      may need to be adjusted if an instance memory parameters are changed post initial deployment
  • Amount of locked memory - 75% of physical memory
  • Number of Shared Memory Identifiers  - set greater than the number of databases
  • Size of Shared Memory Segments - OS setting for max size = 85% of physical memory
  • Number of semaphores - sum of processes cannot exceed the maximum number of semaphors. On linux, the max can be obtained with cat /proc/sys/kernel/sem | awk '{print $2}'.
    The number of semaphores on the system should not be so high such that maximizing oracle processes running causes performance problems .
  • Number of semaphores in a semaphore set: The number of semaphores in a semaphore set must be at least as high as the largest
    value for the processes parameter in all databases. On linux, the number of semaphore sets can be obtained with cat /proc/sys/kernel/sem | awk '{print $4}'
  • Applications with similar SLA requirements are best suited to co-exist in a consolidated environment together. Do not mix mission critical applications with non mission critical applications in the same consolidated environment. Do not mix production and test/dev databases in the same environment.
  • It is possible to œover-subscribe an application resource requirements in a consolidated environment as long as the other applications œunder-subscribe at that time. The exception
    to this is mission critical applications. Do not œover-subscribe in a consolidated environment that contains mission critical applications. Oracle Resource Manager can be used to
    manage varying degrees of IO and CPU requirements within one database and across databases. Within one database, Oracle Resource Manager can also manage parallel query processing.

Consolidation Parameters Reference Table

The performance related recommendations provide guidance to maintain highest stability without sacrificing performance. Changing these performance settings can be done after careful performance
evaluation and clear understanding of the performance impact.
This parameter consolidation health check table is a general reference for environments. This is not a hard prerequisite for a consolidated environment, rather a guideline used to establish
the formulas, maximum values, and notes below. It should suffice for most customers, but if you do not qualify for this formula, the table below can be used as a reference solely for important
parameters that must be considered. These values are per node.

ParameterFormulaMaxNotes
11.2*
Sga_target / Pga_aggregate_target
12.1*
sga_target/pga_aggregate_limit
11.2OLTP:
Sum of all sga_target and pga_aggregate_target for all databases < 75% of physical memory
DW/BI:
Sum of Sga_target + (pga_aggregate_target x 3) < 75% of physical memory
12.*
Both OLTP and DW/BI:
Sum of Sga_target + pga_aggregate_limit < 75% of physical memory
75% of total memoryCheck aforementioned formula. Exceeding recommended memory usage can potentially cause performance problems. It is important to also ensure that the value computed from the formula is sufficient for the application using the associated database.Pga_aggregate_target setting does not enforce a maximum PGA usage. For some data warehouse and BI applications, 3 X specified target has been observed. For OLTP applications, the spill over is much less. The 25% room provides insurance from any additional spill over and for non-SGA/PGA memory allocation. Process memory and non-memory allocations can add up to be 1-5 MB/process in some cases. Monitoring application and system memory utilizatoin is required to ensure there's sufficient memory throughout your workload/business cycles. Oracle recommends at least 5% memory free at all times.
In 12c, new parameter pga_aggregate_limit was introduced, it enforces a maximum PGA usage so the specified parameter value should be used in calculations. pga_aggregate_limit is derived from pga_aggregate_target and defaults to the greater of 2gb or 2 times the pga_aggregate_target setting.
DBM Machine Type: Memory Available : Oracle Memory Target
DBM V2 | 72 GB | 54 GB
X2-2 | 96 GB | 60.8 GB can be expanded to 144GB
X2-8 | 1 TB | 768GB
X3-2 | 256G | 192GB
X3-8 | 2 TB | 1536G
X4-2 | 512G | 384GB
X4-8 | 6 TB | 4608GB
X5-2 | 1 TB | 768GB

Cpu_countFor mission critical applications:Sum of cpu_count of all databases <= 75% X Total CPUs
Alternatively:
For light-weight CPU usage applications,
sum (CPU_COUNT) <=3 X CPUs
and
CPU intensive applications,
sum(CPU_COUNT) <= Total CPUs

Refer to the formulas in the previous columnRules of thumbs:1.Leverage CPU_COUNT and instance caging for platform consolidation (e.g. managing multiple databases within Exadata DBM). They are particularly helpful in preventing processes and jobs from over-consuming target CPU resources.
2. Most light weight applications are idle and consume < 3 CPUs.
3. Large reporting/DW/BI and some OLTP applications ("CPU intensive applications) can easily consume all the CPU so they need to be bounded with instance caging and resource management.
4. For consolidating mission critical applications, recommend not over-subscribing CPU resources to maximize stability and performance consistency.
For additional guidance and precautions, refer to <Doc ID 1362445.1>
Exadata DBM | # Cores |# CPUs
DBM V2 | 8 CPUs | 16 CPUs
X2-2 | 12 CPUs | 24 CPUs
X2-8 | 64 CPUs | 128 CPUs

resource_manager_planNANAEnsure this is enabled. A good starting value is '˜default_plan'™
processesSum of processes of all databases < maxNumber of semaphores on the systemCheck formula. Alert if > max
Alert if # Active Processes > 4 X CPUs
Sum (all processes for all instances) < 21K
Parallel parametersAutomatic Adjusting CPU_COUNT parameter for platform consolidation or resource management will automatically update PARALLEL_MAX_SERVERS and PARALLEL_SERVERS_TARGET parameter values provided these are not explicitly specified in the parameter file.
Db_recovery_file_dest_sizeSum of Db_recovery_file_dest_size <= Fast Recovery AreaSize of Usable Fast Recovery AreaCheck formula; Usable FRA space subtracts the space consumed by other files such as online log files in the case of RECO being the only high redundancy diskgroups


Verify operating system hugepages count satisfies total SGA requirements


Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
10/23/15
<Name>
Production
Exadata - Physical
Exadata - Management Domain

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2
ALL
Linux x86-64 el5uek
Linux x86-64 el6uek
exachk 12.1.0.2.6

Benefit / Impact:

Properly configuring operating system hugepages on Linux and setting the database initialization parameter "use_large_pages" to "only" results in more efficient use of memory and reduced paging.

The impact of validating that the total current hugepages are greater than or equal to estimated requirements for all currently active SGAs is minimal. The impact of corrective actions will vary depending on the specific configuration, and because the hugepages pool must be contiguous, it is recommended to reboot the database server.

Risk:

The risk of not correctly configuring operating system hugepages in advance of setting the database initialization parameter "use_large_pages" to "only" is that if not enough huge pages are configured, some databases will not start after you have set the parameter.

Action / Repair:

PREREQUISITE: All database instances that are supposed to run concurrently on a database server must be up and running for this check to be accurate.

To verify that the total number of configured hugepages is greater than or equal to the estimated requirements of all currently active SGAs using large pages. As the root user copy the following block of commands to a shell script (i.e, /tmp/hugepages_calculation.sh) and execute it.

#!/bin/bash

TOTAL_HUGEPAGES='grep HugePages_Total /proc/meminfo | cut -d":" -f 2 | sed -e 's/^[ \t]*//;s/[ \t]*$//''
HPG_SZ='grep Hugepagesize /proc/meminfo | awk '{print $2}''
NUM_PG=0
MGMT_PID='/usr/bin/pgrep -f mdb_pmon_'
if [ $? -eq 0 ]; then
MGMT_SEGIDS='grep SYSV /proc/${MGMT_PID}/maps | awk '{print $5}' | uniq'
else
MGMT_PID=0
fi
IPCARR=('ipcs -m | grep "^0x" | awk '{ print $2":"$5}'')
for SEGIDBYTES in "${IPCARR[@]}"
do
SEG_ID=${SEGIDBYTES%:*}
SEG_BYTES=${SEGIDBYTES##*:}
if [[ $MGMT_PID -eq 0 || ! "$MGMT_SEGIDS" =~ "$SEG_ID" ]]; then
MIN_PG='echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q'
if [ $MIN_PG -gt 0 ]; then
NUM_PG='echo "$NUM_PG+$MIN_PG+1" | bc -q'
fi
fi
done
if [ $TOTAL_HUGEPAGES -ge $NUM_PG ]
then echo -e "\nSUCCESS: Total current hugepages ($TOTAL_HUGEPAGES) are greater than or equal to"
echo -e " estimated requirements for all currently active SGAs ($NUM_PG).\n"
else echo -e "\nFAILURE: Total current hugepages ($TOTAL_HUGEPAGES) should be greater than or equal to"
echo -e " estimated requirements for all currently active SGAs ($NUM_PG).\n"
fi

The output should be similar to:

SUCCESS:  Total current hugepages (13004) are greater than or equal to         
                 estimated requirements for all currently active SGAs (632).

If the output is not "SUCCESS", investigate and correct the condition.

NOTE: Please refer to My Oracle Support notes MOS 401749.1, 361323.1, and 1392497.1 for additional details on configuring hugepages.

NOTE: If you have not reviewed notes 401749.1, 361323.1, and 1392497.1 and followed their guidance BEFORE using the database parameter "use_large_pages=only", this check will pass the environment but you will still not be able to start instances once the configured pool of operating system hugepages have been consumed by instance startups. If that should happen, you will need to change the "use_large_pages" inialization parameter to one of the other values, restart the instance, and follow the instructions in notes 401749.1 and 361323.1. The brute force alternative is to increase the huge page count until the newest instance will start, and then adjust the huge page count after you can see the estimated requirements for all currently active SGAs.

NOTE: While it is possible to modify the number of hugepages in active memory in the running kernel, it is not recommended for two reasons:
1) The hugepages pool must be contiguous, and it may not be possible to find enough contiguous pages to meet a request in the running kernel active memory.
2) Setting the value in the kernel configuration files and rebooting ensures the expected number of hugepages is properly configured and available. Misconfigurations in this area can impact server availability so following this operational best practice prevents an unexpected outage caused by user error.

Verify "MaxStartups 100" in /etc/ssh/sshd_config on all database servers

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical03/21/12X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.0.3+11.2.0.3 +

Benefit / Impact:
Configuring "MaxStartups 100" helps to avoid the risk of certain cluster operations failing for clusters containing more than 10 database servers.
Cluster operations examples include installing or upgrading the grid infrastructure, and adding a cluster node.
The impact of verifying "MaxStartups 100" is minimal. The impact of correcting the setting is moderate, requiring a restart of the sshd service.
Risk:
With "MaxStartups" configured at the default value (10), certain cluster operations for clusters containing more than 10 database servers may fail.
For example, if the Oracle Univeral Installer (OUI) calls the Cluster Verification Utility (CVU) and CVU starts an ssh session across all nodes
concurrently that fails because more than 10 concurrent ssh connections are required.
Action / Repair:
To verify that "MaxStartups 100" is set in /etc/ssh/sshd_config file, execute the following command as the "root" userid on the node where deploy112.sh was executed:
dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "egrep -i maxstartups /etc/ssh/sshd_config"
The output should be similar to:
randomdb01: MaxStartups 100
<output truncated>
randomdb16: MaxStartups 100
If the output is not as expected, as the root userid on each database server, edit the sshd_config file to include "MaxStartups 100" and restart the ssh service with the "service sshd restart" command.

Verify all datafiles have "AUTOEXTEND" attribute "ON"

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux, [WIP:VW]Solaris11.2.x +11.2.x +

Benefit / Impact
The benefit of having "AUTOEXTEND" on is that applications may avoid out of space errors.
The impact of verifying that the "AUTOEXTEND" attribute is "ON" is minimal. The impact of setting "AUTOEXTEND" to "ON" varies depending upon if it is done during database creation, file addition to a tablespace, or added to an existing file.
Risk

The risk of running out of space in either the tablespace or diskgroup varies by application and cannot be quantified here. A tablespace that runs out
of space will interfere with an application, and a diskgroup running out of space could impact the entire database as well as ASM operations (e.g., rebalance operations)..

Action / Repair

To obtain a list of tablespaces that are not set to "AUTOEXTEND", enter the following sqlplus command logged into the database as sysdba:
select file_id, file_name, tablespace_name from dba_data_files where autoextensible <>'YES'
union
select file_id, file_name, tablespace_name from dba_temp_files where autoextensible <> 'YES'; 
The output should be:
no rows selected
If any rows are returned, investigate and correct the condition.

NOTE: Configuring "AUTOEXTEND" to "ON" requires comparing space utilization growth projections at the tablespace level to space available in the diskgroups to permit the expected
projected growth while retaining sufficient storage space in reserve to account for ASM rebalance operations that occur either as a result of planned operations or component failure.
The resulting growth targets are implemented with the "MAXSIZE" attribute that should always be used in conjunction with the "AUTOEXTEND" attribute. The "MAXSIZE" settings should
allow for projected growth while minimizing the prospect of depleting a disk group. The "MAXSIZE" settings will vary by customer and a blanket recommendation cannot be given here.

NOTE: When configuring a file for "AUTOEXTEND" to "ON", the size specified for the "NEXT" attribute should cover all disks in the diskgroup to optimize balance. For example,
with a 4MB AU size and 168 disks, the size of the "NEXT" attribute should be a multiple of 672M (4*168).

Enable portmap service if app requires it

By default, the portmap service is not enabled on the database nodes and it is required for things such as NFS. If needed, enable and start it using the following with dcli across required nodes:
chkconfig --level 345 portmap on
service portmap start

Enable proper services on database nodes to use NFS

In addition to the portmap service previously explained, the nflsock service must also be enabled and running to use NFS on database nodes. Below is a working example, showing the errors that will be encountered with
various utilities if not setup correctly. MOS Note 359515.1 can also be referenced.
SQL> create tablespace nfs_test_on_nfs datafile '/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf' size 16M;
create tablespace nfs_test_on_nfs datafile '/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf' size 16M
*
ERROR at line 1:
ORA-01119: error in creating database file
'/shared/dscbbg02/users/user/nfs_test/nfs_test_on_nfs.dbf'
ORA-27086: unable to lock file - already in use
Linux-x86_64 Error: 37: No locks available
Additional information: 10
Elapsed: 00:00:30.08
SQL> create tablespace nfs_test datafile '+D/user/datafile/nfs_test.dbf' size 16M;
Tablespace created.
SQL> create table nfs_test(n not null) tablespace nfs_test as select rownum from dual connect by rownum < 1e5 + 1;
Table created.
SQL> alter tablespace nfs_test read only;
Tablespace altered.
SQL> create directory nfs_test as '/shared/dscbbg02/users/user/nfs_test';
Directory created.
SQL> create table nfs_test_x organization external(type oracle_datapump default directory nfs_test location('nfs_test.dp')) as select * from nfs_test;
create table nfs_test_x organization external(type oracle_datapump default directory nfs_test location('nfs_test.dp')) as select * from nfs_test
*
ERROR at line 1:
ORA-29913: error in executing ODCIEXTTABLEPOPULATE callout
ORA-31641: unable to create dump file
"/shared/dscbbg02/users/user/nfs_test/nfs_test.dp"
ORA-27086: unable to lock file - already in use
Linux-x86_64 Error: 37: No locks available
Additional information: 10
Elapsed: 00:00:31.17
$ expdp userid=scott/tiger parfile=nfs_test.par
Export: Release 11.2.0.1.0 - Production on Wed Jun 2 10:44:51 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORA-39001: invalid argument value
ORA-39000: bad dump file specification
ORA-31641: unable to create dump file "/shared/dscbbg02/users/user/nfs_test/nfs_test.dmp"
ORA-27086: unable to lock file - already in use
Linux-x86_64 Error: 37: No locks available
Additional information: 10
RMAN works:
$ rman target=/
Recovery Manager: Release 11.2.0.1.0 - Production on Wed Jun 2 10:46:40 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
connected to target database: USER (DBID=3710096878)
RMAN> backup as copy datafile '+D/user/datafile/nfs_test.dbf' format '/shared/dscbbg02/users/user/nfs_test/nfs_test.dbf';
Starting backup at 20100602104700
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=204 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00007 name=+D/user/datafile/nfs_test.dbf
output file name=/shared/dscbbg02/users/user/nfs_test/nfs_test.dbf tag=TAG<a target="_blank"
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 20100602104702
The solution is to ensure that the nfslock service (aka rpc.statd) is running:
# service nfslock status
rpc.statd (pid 10795) is running... Of course youâ€Â™d want to enable the service via chkconfig too.

Be Careful when Combining the InfiniBand Network across Clusters and Database Machines

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.x +
If you want multiple database machines to run as separate environments yet be connected through the InfiniBand network, please be aware of the following items especially when the database machines
were deployed as separate environments.
The cell name, cell disk name, grid disk name, ASM diskgroup name, and ASM failgroup name should be unique to help avoid accidental damage during maintenance operations. For example do not have
diskgroup DATA on both database machines, call them DATA_DM01 and DATA_DM02.

IP Addresses

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.x +
All nodes on the InfiniBand network must have a unique IP address. When an Oracle Database Machine is deployed, the default InfiniBand network is 192.168.10.x and we start with 192.168.10.1.
If you used the default IP address on each Database Machine, you will have duplicate IP addresses. You must modify the IP addresses on one of the machines before re-configuring the InfiniBand Network.
Ensure any additional equipment ordered from Oracle is marked for an Oracle Exadata Database Machine and the hardware engineer is using the correct Multi-rack Cabling when the physical InfiniBand network is modified.
After the hardware engineer has modified the network, ensure that network is working correctly by running verify topology and infinicheck. Infinicheck will create load on the system and should not be run when
there is active workload on the system. Note: Infinicheck will need an input file of all IP addresses on the network.
I.E. Create a temporary file in /tmp that contains all cells for both database machines. Pass this file to the inifnicheck command using the -c option. Also pass the -b option
#cd /opt/oracle.SupportTools/ibdiagtools
#./verify-topology -t fattree
#./infinicheck -c /tmp/combined_cellip.ora -b

CELLIP.ORA

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux,Solaris11.2.x +11.2.x +
The cellip.ora file in each database node of each cluster should only reference cells in use by that respective cluster.

Set fast_start_mttr_target=300 to optimize run time performance of writes

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8Linux11.2.x +11.2.x +
The deployment default for fast_start_mttr_target as of 12/22/2010 is 60. To optimize run time performance for write/redo generation intensive workloads, increase fast_start_mttr_target to 300.
This will reduce checkpoint writes from DBWR processes, making more room for LGWR IO. The trade-off is that instance recovery will run longer, so if instance recovery is more important than performance,
then keep fast_start_mttr_target low. Also keep in mind that an application with inadequately sized redo logs will likely not see an affect from this change due to frequent log switches.
Considerations for a direct writes in a data warehouse type of application: Even though direct operations aren't using the buffer cache, fast_start_mttr_target is very effective at controlling crash recovery time because
it ensures adequate checkpointing for the few buffers that are resident (ex: undo segment headers). fast_start_mttr_target should be set to the desired RTO (Recovery Time Objective) while still maintaining performance SLAs.

Enable auditd on database servers

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X4-2Linux11.2.x +11.2.x +
On database servers, when auditing is configured, as is done automatically by applying convenience pack 11.2.2.2.0 or higher, the audit records are logged in /var/log/messages if the auditd service is not running.
By logging these messages to /var/log/messages, it may cause more frequent rotation of the messages file which may result in losing historical data more quickly than necessary or desired. By enabling auditd, audit records
are sent to /var/log/audit/audit.log which is rotated and managed separately using settings in /etc/audit/audit.conf.

The best practice is to run the auditd service whenever auditing is configured during kernel bootup by setting audit=1 on the kernel line in /boot/grub/grub.conf, as shown here:

title Trying_LABEL_DBSYS
root (hd0,0)
kernel /vmlinuz-2.6.18-194.3.1.0.2.el5 root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb audit=1 numa=off console=ttyS0,115200n8 console=tty1 crashkernel=128M@16M
initrd /initrd-2.6.18-194.3.1.0.2.el5.img

To configure auditd to be enabled, run the following commands as root on each database server:
chkconfig auditd on
chkconfig --list auditd
auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
service auditd start
service auditd status
auditd (pid 32582) is running...

Verify AUD$ and FGA_LOG$ tables use Automatic Segment Space Management


PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical02/27/2012X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact:
With AUDIT_TRAIL set for database (AUDIT_TRAIL=db), and the AUD$ and FGA_LOG$ tables located in a dictionary segment space managed SYSTEM tablespace, "gc" wait events are sometimes observed
during heavy periods of database logon activity. Testing has shown that under such conditions, placing the AUD$ and FGA_LOG$ tables in the SYSAUX tablespace, which uses automatic segment space management,
reduces the space related wait events.
The impact of verifying that the AUD$ and FGA_LOG$ tables are in the SYSAUX table space is low. Moving them if they are not located in the SYSAUX does not require an outage, but should be done during a
scheduled maintenance period or slow audit record generation window.
Risk:
If AUD$ and FGA_LOG$ tables are not verifed to use automatic segment space management, there is a risk of a performance slowdown during periods of high database login activity.
Action / Repair:
To verify the segment space management policy currently in use by the AUD$ and FGA_LOG$ tables, use the following Sqlplus command:

select t.table_name,ts.segment_space_management from dba_tables t, dba_tablespaces ts where ts.tablespace_name = t.tablespace_name and t.table_name in ('AUD$','FGA_LOG$');

The output should be:
TABLE_NAME SEGMEN
------------------------------ ------
FGA_LOG$ AUTO
AUD$ AUTO 
If one or both of the AUD$ or FGA_LOG$ tables return "MANUAL", use the DBMS_AUDIT_MGMT package to move them to the SYSAUX tablespace:
BEGIN
DBMS_AUDIT_MGMT.set_audit_trail_location(audit_trail_type => DBMS_AUDIT_MGMT.AUDIT_TRAIL_AUD_STD,--this moves table AUD$ audit_trail_location_value => 'SYSAUX'); END; /
BEGIN
DBMS_AUDIT_MGMT.set_audit_trail_location(audit_trail_type => DBMS_AUDIT_MGMT.AUDIT_TRAIL_FGA_STD,--this moves table FGA_LOG$ audit_trail_location_value => 'SYSAUX');
END;
/ 
The output should be similar to:
PL/SQL procedure successfully completed. 
If the output is not as above, investigate and correct the condition.
NOTE: This "DBMS_AUDIT_MGMT.set_audit_trail" command should be executed as part of the dbca template post processing scripts, but for existing databases, the command can be executed,
but since it moves the AUD$ & FGA_LOG$ tables using "alter table ... move" command, it should be executed at a "quiet" time

Use dbca templates provided for current best practices

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact:
Starting with 11.2.2.3.1 onecommand v, dbca templates with built in best practices are provided at deployment time for OLTP, DW/BI, and DBFS.
The database created at deployment time uses one of these templates. If other databases are created, the templates should be used to ensure
current database configuration best practices are implemented. If custom scripts are used to create databases, the templates can be used as a reference for those customer scripts.
Risk:
Not adhering to best practices can lead to unnecessary outages and performance problems
Action / Repair:
Run health check to assess diffs with current best practices. Check configuration assistant logs for template use.

Updating database node OEL packages to match the cell

MOS Note 1284070.1 provides a working example of updating the db host OEL packages to match those on the cell.

Disable cell level flash caching for grid disks that don't need it when using Write Back Flash Cache

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
n/aAugust 2012X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.3.2+11.2.x +
Benefit / Impact
When using Write Back Flash Cache, disabling caching for grid disks that don't need it frees up cache space for more important objects.
The classic use-case for this is grid disks in the RECO diskgroup. Note that Exadata already has intelligence to not cache objects that
don't need it, but this extends that to the grid disk level in a Write Back Flash Cache configuration.
Risk:
Cache pollution (less caching benefit) leading to performance impact.
Action / Repair:
The following cellcli command displays the cell caching mode. It should be "WriteBack" for this best practice.
list cell attributes flashCacheMode
The following cellcli command displays the caching mode for all grid disks on a cell. A cachingPolicy of "none" indicates caching is turned off for that particular grid disk.
list griddisk attributes name,cachingPolicy
To disable caching for a particular griddisk, first flush the cache data for that grid disk, and then set the cachedPolicy attribute to "none" as illustrated in the cellcli commands below
alter griddisk <grid disk name> flush
alter griddisk <grid disk name> cachingPolicy="none"
If caching needs to be enabled again after these steps, first cancel the prior flush, and then set the caching Policy attribute back to "default" as illustrated in the cellcli commands below
alter griddisk <grid disk name> cancel flush
alter griddisk <grid disk name> cachingPolicy="default"

Gather system statistics in Exadata mode if needed

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
n/aAuguest 2012X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.0.2 BP18 and 11.2.0.3 BP8
Benefit / Impact
Gathering Exadata specific system statistics ensure the optimizer is aware of Exadata scan speed. Accurately accounting for the speed of scan operations will ensure the Optimizer chooses an optimal execution plan in a Exadata environment. The following command gathers Exadata specific system statistics
exec dbms_stats.gather_system_stats('EXADATA');
Note this best practice is not a general recommendation to gather system statistics in Exadata mode for all Exadata environments. For existing customers
who have acceptable performance with their current execution plans, do not gather system statistics in Exadata mode. For existing customers whose cardinality
estimates are accurate, but suffer from the optimizer over estimating the cost of a full table scan where the full scan performs better, then gather system
statistics in Exadata mode. For new applications where the impact can be assessed from the beginning, and dealt with easily if there is a problem, gather system statistics in Exadata mode.
Risk:
Lack of Exadata specific stats can lead to less performant optimizer plans.
Action / Repair:
To see if Exadata specific optimizer stats have been gathered, run the following query on a system with at least 11.2.0.2 BP18 or 11.2.0.3 BP8 Oracle software. If PVAL1 returns null or is not set, Exadata specific stats have not been gathered.
select pname, PVAL1 from aux_stats$ where pname='MBRC';

Verify Hidden Database Initialization Parameter Usage

PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s) 
Critical FAIL 08/01/18 <Name>Production Exadata - Physical,
Exadata - User Domain 
26638705 - exachk
28321838 - exachk
26638705 - exachk
26136659 - exachk
 25143408 - exachk 
DB VersionDB Role Engineered System PlatformExadata VersionOS & Version Validation Tool Version TBD
11.2.x
12.1.x 
Primary
Standby
ASM 
ALL 11.2.3.+ Linux x86-64 exachk 12.2.0.1.4, 18.3.0  

Benefit / Impact
Hidden database initialization parameters are typically set as a workaround to solve a specific problem, and should be removed once a system has been upgraded to a version level that contains the fix for the specific problem. Often they are not removed during the upgrade process to the version level that contains the correct fix. Verifying the hidden database initialization parameter usage helps avoid hidden parameters being used any longer than necessary.
Risk:
Use of hidden ASM or database initialization parameters not recommended by Oracle development in an Exadata environment can cause instability, performance problems, corruptions, and crashes.
Action / Repair:
To verify the hidden database initialization parameter usage in each ASM and database instance, execute the following sqlplus command as the owner of the respective home with the environment properly set to access the instance:
select name,value from v$parameter where substr(name,1,1)='_';
NOTE: v$parameter only contains hidden parameters that have been changed from the default, which are the ones of interest here.
The expected output should be a list of any hidden parameters in use that have been changed from the default value, similar to:
_enable_NUMA_support  FALSE
There should be no hidden parameters in use that are not shown in the "Generally Acceptable Hidden Parameters Table":
Generally Acceptable Hidden Parameters Table
Parameter
Name
ValueOracle
Version
Exadata
Version
Instance
Type
Notes
_file_size_increase_increment 2143289344 <= 11.2.0.3 BP11 ALL Database Enables more performant rman backup allocation sizes. 
_enable_NUMA_support Set _enable_NUMA_support=TRUE for all hardware generation 8-socket database servers (Note - applies to non-OVM only - OVM is not supported on 8-socket servers).

Set _enable_NUMA_support=TRUE for X5 and later 2-socket database servers deployed as non-OVM.

In all other cases do not explicitly set _enable_NUMA_support.
<12.1.0.2.6 ALL Database For any Exadata system using Database 12.1.0.2.6 or higher, do not explicitly set _enable_NUMA_support (includes all hardware generations, 2-socket, 8-socket, non-OVM, and OVM). _enable_NUMA_support setting is automatically configured by the database.

For any Exadata system using Database 12.1.0.2.5 or lower, reference the recommended setting in the Value column of this row.
_asm_resyncckpt 12.1.0.1 ONLY ALL ASM Turns off resync checkpointing 
_smm_auto_max_io_size 1024 12.1 and lower ALL Database This permits 1MB IOs for hash joins that spill to disk, which can increase performance up to 40% due to increased throughput. These performance increases can prevent the need to move TEMP to flash.

Internal only note: this will no longer be needed when bug 20925115 is fixed. 
_parallel_adaptive_max_users 12.1 and higher ALL Database Check to ensure not more than the recommended value. Setting this higher than this recommended value can deplete memory and impact performance.*

Parameter PARALLEL_MAX_SERVERS is evaluated based on the below calculation method:
parallel_threads_per_cpu*cpu_count*concurrent_parallel_users*5

Parameter PARALLEL_SERVERS_TARGET is evaluated based on the below calculation method:
parallel_threads_per_cpu*cpu_count*concurrent_parallel_users*2

_PARALLEL_ADAPTIVE_MAX_USERS provides the value of concurrent_parallel_users in the calculation. The value of this parameter is set to 4 in most cases which would result in a higher than recommended maximum number of parallel servers, therefore the recommended value is 2.

PARALLEL_MAX_SERVERS would be calculated as below assuming cpu_count is set to all available CPUs:
X2-2: 1 * 24 * 2 * 5 = 240
X6-2: 1 * 88 * 2 * 5 = 880
X2-8: 1 * 128 * 2 * 5 = 1280
X6-8: 1 * 288 * 2 * 5 = 2880 
_assm_segment_repair_bg FALSE 12.2 and higher ALL Database work-around for bug 23734075 
_asm_max_connected_clients Dynamically changes 12.2. and 18.1 ONLY ALL ASM Used internally; Removed in release 19c 
_backup_disk_bufcnt 64 12.1 and lowerALL Database Only when ZFS based backups are in use
_backup_disk_bufsz 1048576 12.1 and lowerALL Database Only when ZFS based backups are in use
_backup_file_bufcnt 64 12.1 and lowerALL Database Only when ZFS based backups are in use
_backup_file_bufsz 1048576 12.1 and lowerALL Database Only when ZFS based backups are in use
NOTES:

1) For additional ZFS based backup configuration information, please see: Oracle ZFS Storage: FAQ: Exadata RMAN Backup with The Oracle ZFS Storage Appliance (Doc ID 1354980.1)
2) This best practice check does not include any application specific hidden parameters. If an application in use requires hidden parameters that are failed by this best practice, refer to the proper documentation for the application version in use. If the extra hidden parameters are correct, then ignore the failures reported for those specific parameters.

For Oracle E-Business Suite, please see: Database Initialization Parameters for Oracle E-Business Suite Release 12 (Doc ID 396009.1)
For Siebel CRM Application, please see: Performance Tuning Guidelines for Siebel CRM Application on Oracle Database (Doc ID 2077227.2)

Verify BDB location for Cloned GI homes


PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
n/aAugust 2012X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux, Solaris11.2.x +11.2.x +
Benefit / Impact
After cloning a Grid Home the $GI_HOME/crf/admin/crf<node>.ora configuration file in the new home has the BDB location still pinpointing the GI home where it is cloned from.
Risk:
GI Upgrade to 11203 from 11201 and 11202 can fail
Error messages in $GRID_HOME/log/crflogd/crflogdOUT.log logfile
Action / Repair:
Manually edit $GI_HOME/crf/admin/crf<node>.ora in the cloned Grid Infrastructure Home and change the values for BDBLOC and CRFHOME.
This same change needs to be done on all nodes in the cluster to the file referenced above if it exists. Reference: 1485970.1 / 14168708

Verify Shared Servers do not perform serial full table scans

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
WarnSeptember 2012X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.x +

Benefit / Impact
As an Oracle kernel design decision, shared servers are intended to perform quick transactions and therefore do not issue serial (non PQ) direct reads. Consequently, shared servers do not perform serial (non PQ) Exadata smart scans.
The impact of verifying that shared servers are not doing serial full table scans is minimal. Modifying the shared server environment to avoid shared server serial full table scans varies by configuration and application behavior, so the impact cannot be estimated here.
Risk:
Shared servers doing serial full table scans in an Exadata environment lead to a performance impact due to the loss of Exadata smart scans.
Action / Repair:
To verify shared servers are not in use, execute the following SQL query as the "oracle" userid:

SQL> select NAME,value from v$parameter where name='shared_servers';
The expected output is:
NAME VALUE
--------------- ------------------------------
shared_servers 0
If the output is not "0", use the following command as the "oracle" userid with properly defined environment variables and check the output for "SHARED" configurations:

$ORACLE_HOME/bin/lsnrctl service
If shared servers are confirmed to be present, check for serial full table scans performed by them. If shared servers performing serial full table
scans are found, the shared server environment and application behavior should be modified to favor the normal Oracle foreground processes so that
serial direct reads and Exadata smart scans can be used.

Verify Write Back Flash Cache minimum version requirements

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL02/06/13<Name> DevelopmentExadata, SSC16012455- exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
11.2.0.3 BP9+ASMX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
All eng systems with Exadata Storage
11.2.3.2.1+Solaris - 11
Linux x86-64
exachk 2.2.1 
Benefit / Impact
Oracle Write Back Flash Cache requires Oracle version 11.2.0.3 Bundle Patch 9 (BP9) in the Grid Infrastructure ORACLE_HOME or higher and Exadata version 11.2.3.2.1 or higher.
Oracle 11.2.0.3 BP9 or higher in the Grid Infrastructure ORACLE_HOME enables the resilvering feature, which drastically reduces the time required to restore redundancy after a flash disk failure (FDOM) failure.
Exadata software 11.2.3.2.1 has critical optimizations and fixes (e.g. fix for bug 16232581) to fully take advantage of Exadata Write Back Flash Cache.
Risk:
Without 11.2.0.3 BP9 in the Grid Infrastructure ORACLE_HOME, disks cached by the failed DOM will be dropped and added which significantly extends the repair time.
Without the fixes in Exadata cell 11.2.3.2.1, IO errors and possible data corruptions may appear for very large IO intensive workloads when using Write Back Flash Cache.
Action / Repair:
To check if Write Back Flash Cache is in use, run the following cellcli command on all storage servers and check for 'WriteBack'
CellCLI> list cell attributes flashCacheMode WriteBack 
To check the Grid Infrastructure ORACLE_HOME for BP9 or above, run the following command from the Grid Infrastructure ORACLE_HOME as the oracle userid:
$ $ORACLE_HOME/OPatch/opatch lspatches
The output should be similar to:
14307915;DISKMON PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14307915) 
14275572;CRS PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14275572) 
14662263;DATABASE PATCH FOR EXADATA (NOV 2012 - 11.2.0.3.12) : (14662263)
In this case, patch 14275572 is applied, which is 11.2.0.3 BP12, and therefore the proper fixes are in place.
If the Oracle version is less than 11.2.0.3 BP9, upgrade to 11.2.0.3 BP9 or higher.
To check the Exadata software version, execute the following command as the root userid on all storage servers:
imageinfo -version
The output should be similar to:
11.2.3.2.1.130109
If the Exadata software version is less than 11.2.3.2.1, upgrade to 11.2.3.2.1 or higher.

Verify bundle patch version installed matches bundle patch version registered in database

DB VersionAlert LevelDateOwnerStatusScope
CriticalFAIL11/04/15<Name>ProductionExadata, Exalogic, SSC
DB VersionDB RoleEngineered SystemExadata VersionOS & VerionValidaton Tool Version
>= 12.1.0.2ALLX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2, X5-811.2.x +Linux, Solarisexachk 12.1.0.2.6
Benefit / Impact:
Crosschecking the software bundle patch version installed with the bundle patch registered in the database to make sure they match ensures software correctness and stability. If a bundle patch is being installed in a Data Guard configuration in a standby-first manner where the SQL portion of the bundle patch is not installed inside the database until the primary and all standby software homes have the same version installed, then this crosscheck is expected to fail until both the binary and SQL portion of the bundle patch application is fully installed.
Risk:
Incomplete bug fixes, software instability, and unexpected behavior
Action / Repair:
To verify that the bundle patch version installed matches bundle patch version registered in database, as the oracle home owner for the primary database, and with ORACLE_SID and ORACLE_HOME properly set, execute the following command:
opatch_bp=$($ORACLE_HOME/OPatch/opatch lspatches 2>/dev/null|grep -iwv javavm|grep -wi database|head -1|awk -F';' '{print $1}');
database_bp_status=$(echo -e "set heading off feedback off timing off \n select ACTION, STATUS from (select * from dba_registry_sqlpatch where PATCH_ID = $opatch_bp order by action_time desc) where rownum=1;"|$ORACLE_HOME/bin/sqlplus -s " / as sysdba" | sed -e '/^ *$/d');
database_bp_status='echo $database_bp_status';
if [ "$database_bp_status" == "APPLY SUCCESS" ];
then
echo "SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.";
else
echo "FAILURE: Bundle patch installed in the database does not match the software home, or is installed with errors.";
fi;
The output should be similar to:
SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.
If FAILURE is reported, then investigate and correct the discrepancy.

NOTE: For versions less than 12.1.0.2, please see this archived best practice:
Verify bundle patch version installed matches bundle patch version registered in database (ARCHIVE)

Verify database server file systems have "Maximum mount count" = "-1"


PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFAIL03/16/16 <Name>ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
 
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, X4-8, X5-2, X5-811.2.2.2.0+Linux x86-64exacheck 12.1.0.2.7 

Benefit / Impact:
A filesystem will be checked for consistency (fsck) after the number of times it is mounted exceeds the "Maximum mount count" setting, typically at reboot time. On a database server, the "Maximum mount count" is set to "-1" by default.
Verifying that the database server file systems all have "Maximum mount count" set to "-1" helps to avoid an unexpectedly long reboot sequence as an fsck of the file system completes. The Impact of verifying the database server file systems "Maximum mount count" is minimal. The impact of changing the "Maximum mount count" value is minimal as it can be changed dynamically.

Risk:
A database server reboot may take an unexpectedly long time as an fsck operation completes, potentially extending an outage or maintenance window.
Action / Repair:
To verify the database server disk devices maximum mount count configuration, execute the following command as the "root" userid on all database servers:
LVM_IN_USE=$(parted -ls 2>/dev/null | egrep -i lvm | wc -l);
if [ $LVM_IN_USE -ge 1 ]
then
if test -f /proc/xen/capabilities && grep -q "control_d" /proc/xen/capabilities
then
FS_COMMAND=tune4fs # dom0 case
else
FS_COMMAND=tune2fs # physical, domU case
fi;
LOGICAL_VOLUME_ARRAY=$(lvscan | cut -d"'" -f2);
for INDIVIDUAL_LOGICAL_VOLUME in $LOGICAL_VOLUME_ARRAY
do
if [ 'file -sL $INDIVIDUAL_LOGICAL_VOLUME | egrep -wc "ext3|ext4" 2> /dev/null' -eq 1 ]
then
FILESYSTEM_ARRAY+="$INDIVIDUAL_LOGICAL_VOLUME "
fi;
done;
MNT_CNT_CHK_RSLT=0;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
if [ "'$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | egrep "^Maximum mount" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''" -ne "-1" ]
then MNT_CNT_CHK_RSLT=1;
fi;
done;
if [ "$MNT_CNT_CHK_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server logical volumes found with filesystems had \"Maximum mount count\" equal to -1";
else
echo -e "\nFAILURE: One or more database server logical volumes found with filesystems had \"Maximum mount count\" not equal to -1";
fi;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
echo "$INDIVIDUAL_LOGICAL_VOLUME: '$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | egrep \"^Maximum mount\" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
else
export SWAP_DEVICE='swapon -s | grep -v Filename | cut -d" " -f1'
export PARTITIONED_DEVICE_ARRAY='fdisk -l 2>/dev/null | egrep ^/dev | egrep -v $SWAP_DEVICE | cut -d" " -f1';
export MNT_CNT_CHK_RSLT=0;
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
if [ "'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep "^Maximum mount" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''" -ne "-1" ]
then MNT_CNT_CHK_RSLT=1;
fi;
done;
if [ "$MNT_CNT_CHK_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server partitioned devices (other than swap) found had \"Maximum mount count\" equal to -1";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep \"^Maximum mount\" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
else
echo -e "\nFAILURE: One or more database partitioned devices (other than swap) found had \"Maximum mount count\" not equal to -1";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | egrep \"^Maximum mount\" | cut -d ":" -f 2 | sed -e 's/^[ \t]*//''";
done;
fi;
fi;
The output should be similar to:
SUCCESS: All database server logical volumes found (other than swap) and the boot device had "Maximum mount count" equal to -1
Boot Device /dev/sda1: -1
/dev/VGExaDb/LVDbSys1: -1
/dev/VGExaDb/LVDbOra1: -1
/dev/VGExaDb/LVDbSys2: -1
- OR -
SUCCESS: All database server partitioned devices (other than swap) found had "Maximum mount count" equal to -1
/dev/sda1: -1
/dev/sda3: -1
If the output is not as expected, you can change the "Maximum mount count" value as the "root" userid using the appropriate command for your environment ("tune2fs" or "tune4fs") on the database server for either partitioned or logical volume devices. Only the device name portion of the command differs. For example, if the appropriate command for your environment is "tune2fs":
# tune2fs -c -1 /dev/mapper/VGExaDb-LVDbOra1
tune2fs 1.39 (29-May-2006)
Setting maximal mount count to -1
NOTE: fsck should be periodically executed as part of the regular maintenance schedule for an Oracle Exadata Database Machine, where the timing is controlled by the customer. This check only verifies that the timing of the run should be controlled and not unexpected.

NOTE: In Exadata versions 11.2.3.2.0, 11.2.3.2.1, and 11.2.3.2.2, the database server may reset "Maximum mount count" to 27 and "Check interval" to 15552000 for some devices upon reboot. This is due to a change introduced in bug 14223777. The recommended fix is to upgrade to 11.2.3.3.0 or higher.

Verify database server file systems have "Check interval" = "0"


PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFAIL03/16/16 <Name>ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
 
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, X4-8, X5-2, X5-811.2.2.2.0+Linux x86-64exachk 12.1.0.2.7 

Benefit / Impact:
A filesystem will be checked for consistency (fsck) after the elapsed time from the last fsck run exceeds the "Check interval" setting, typically at reboot time. On a database server, the "Check interval" is set to "0" by default.
Verifying that the database server filesystems all have the "Check interval" set to "0" helps to avoid an unexpectedly long reboot sequence as an fsck of the file system completes. The Impact of verifying the database server file system "Check interval" is minimal. The impact of changing the file system "Check interval" value is minimal as it can be changed dynamically.

Risk:
A database server reboot may take an unexpectedly long time as an fsck operation completes, potentially extending an outage or maintenance window.
Action / Repair:
To verify the database server disk devices check interval configuration, execute the following command as the "root" userid on all database servers:
LVM_IN_USE=$(parted -ls 2>/dev/null | egrep -i lvm | wc -l);
if [ $LVM_IN_USE -ge 1 ]
then
if test -f /proc/xen/capabilities && grep -q "control_d" /proc/xen/capabilities
then
FS_COMMAND=tune4fs # dom0 case
else
FS_COMMAND=tune2fs # physical, domU case
fi;
LOGICAL_VOLUME_ARRAY=$(lvscan | cut -d"'" -f2);
for INDIVIDUAL_LOGICAL_VOLUME in $LOGICAL_VOLUME_ARRAY
do
if [ 'file -sL $INDIVIDUAL_LOGICAL_VOLUME | egrep -wc "ext3|ext4" 2> /dev/null' -eq 1 ]
then
FILESYSTEM_ARRAY+="$INDIVIDUAL_LOGICAL_VOLUME "
fi;
done;
LVM_CHECK_INTERVAL_RSLT=0;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
if [ "'$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | grep "Check interval:"|awk '{print $3}''" -ne "0" ]
then LVM_CHECK_INTERVAL_RSLT=1;
fi;
done;
if [ "$LVM_CHECK_INTERVAL_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server logical volumes found with filesystems had \"Check interval\" equal to 0";
else
echo -e "\nFAILURE: One or more database server logical volumes found with filesystems had \"Check interval\" not equal to 0";
fi;
for INDIVIDUAL_LOGICAL_VOLUME in 'echo ${FILESYSTEM_ARRAY[@]}'
do
echo "$INDIVIDUAL_LOGICAL_VOLUME: '$FS_COMMAND -l $INDIVIDUAL_LOGICAL_VOLUME | grep "Check interval:"|awk '{print $3}''";
done;
else
export SWAP_DEVICE='swapon -s | grep -v Filename | cut -d" " -f1'
export PARTITIONED_DEVICE_ARRAY='fdisk -l 2>/dev/null | egrep ^/dev | egrep -v $SWAP_DEVICE | cut -d" " -f1';
export PRTN_CHECK_INTERVAL_RSLT=0;
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
if [ "'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{print $3}''" -ne "0" ]
then PRTN_CHECK_INTERVAL_RSLT=1;
fi;
done;
if [ "$PRTN_CHECK_INTERVAL_RSLT" -eq "0" ]
then
echo -e "\nSUCCESS: All database server partitioned devices (other than swap) found had \"Check interval\" equal to 0";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{print $3}''";
done;
else
echo -e "\nFAILURE: One or more database partitioned devices (other than swap) found had \"Check interval\" not equal to 0";
for INDIVIDUAL_PARTITIONED_DEVICE in $PARTITIONED_DEVICE_ARRAY
do
echo "$INDIVIDUAL_PARTITIONED_DEVICE: 'tune2fs -l $INDIVIDUAL_PARTITIONED_DEVICE | grep "Check interval:"|awk '{print $3}''";
done;
fi;
fi;
The output should be similar to:
SUCCESS: All database server disk devices found (other than swap) and the boot device had "Check interval" equal to 0
Boot Device /dev/sda1: 0
/dev/VGExaDb/LVDbSys1: 0
/dev/VGExaDb/LVDbOra1: 0
/dev/VGExaDb/LVDbSys2: 0
- OR -
SUCCESS: All database server partitioned devices (other than swap) found had "Check interval" equal to 0
/dev/cciss/c0d0p1: 0
/dev/cciss/c0d0p3: 0
If the output is not as expected, you can change the "Check interval" value as the "root" userid using the appropriate command for your environment ("tune2fs" or "tune4fs") on the database server for either partitioned or logical volume devices. Only the device name portion of the command differs. For example, if the appropriate command for your environment is "tune2fs":
# tune2fs -i 0 /dev/VGExaDb/LVDbOra1
tune2fs 1.39 (29-May-2006)
Setting interval between checks to 0 seconds
NOTE: fsck should be periodically executed as part of the regular maintenance schedule for an Oracle Exadata Database Machine, where the timing is controlled by the customer. This check only verifies that the timing of the run should be controlled and not unexpected.

NOTE: In Exadata versions 11.2.3.2.0, 11.2.3.2.1, and 11.2.3.2.2, the database server may reset "Maximum mount count" to 27 and "Check interval" to 15552000 for some devices upon reboot. This is due to a change introduced in bug 14223777. The recommended fix is to upgrade to 11.2.3.3.0 or higher.


Verify Automated Service Request (ASR) configuration

PriorityAlert LevelDateOwnerStatusScope
CriticalFAIL11/11/12 <Name>DevelopmentExadata, SSC, Exalogic
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-211.2.2.2.0+Solaris - 11
Linux x86-64
exachk 2.1.6
Benefit / Impact:
Verifying the Automated Service Request (ASR) is necessary to ensure that an Oracle Exadata Database Machine can automatically open an Oracle support Service Request when a qualifying condition is detected.
The Impact of verifying the ASR configuration is minimal. The impact of correcting deficiencies found varies by the corrective action required, and cannot be estimated here.
Risk: If the ASR configuration is not correct, service requests will not be correctly opened automatically when a qualifying condition is detected, leading to delays in correcting the qualifying condition.
Action / Repair:
There are two methods to verify that the ASR configuration is correct:
1) Read and follow the instructions in My Oracle Support Doc ID 1450112.1, which provides the asrexacheck script to verify the ASR configuration.
2) Download and execute the latest exachk from My Oracle Support Doc ID 1070954.1, which includes the asrexacheck script.
Refer to the output of the asrexacheck script, or the "Systemwide Automatic Service request (ASR) healthcheck" section of the exachk HTML report, for findings and corrective actions.

Verify ZFS File System User and Group Quotas are configured

PriorityAlert LevelDateOwnerStatusScope
CriticalWARN3/1/2013<Name>ReviewExadata, SSC
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool Version
N/AN/AX2-2(4170), X2-2, X3-2, X4-211.2.1.0.0 +Solaris - 11exachk 2.2.0
Benefit / Impact:
Filesystem quotas enable control of filesystem space to users and groups. Especially on systems where the grid infrastructure and RDBMS software are managed through separate OS users, restrictions on space consumption are helpful to ensure that system stability and application availability are maximized.
Risk:
Without quotas, filesystems can fill up and application availability can be impacted. When quotas are used, soft limits enable warnings when the quota limits approach and hard limits keep the filesystem from filling to ensure that the system remains stable.
Action / Repair:
To verify ZFS file system user and group quotas are configured, as the "root" userid on all storage servers, perform the following commands:
# zfs get userquota@oracle data/u01 NAME PROPERTY VALUE SOURCE data/u01 userquota@oracle none local # zfs get groupquota@oinstall data/u01 NAME PROPERTY VALUE SOURCE data/u01 groupquota@oinstall none local 
NOTE: a value of "none" means quotas have not yet been created.

NOTE: This procedure only applies to Solaris database servers in Exadata database machine. No changes are permitted on Exadata storage cells. For instructions on how to implement ZFS quotas on Exadata, please refer to Chapter 7 of the Database Machine Owners Guide - "Resetting the Quota of a ZFS Storage Pool File System"

Verify the file /.updfrm_exact does not exist

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL04/02/2014<Name>ProductionExadata, SSC, Exalogic18746642- exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2AllAllexachk 2.2.5 
Benefit / Impact:
To workaround a firmware patching issue for an earlier Exadata release, the file /.updfrm_exact had to be manually created. This file should only be temporarily created during patching at the direction of Oracle Support, and should be removed immediately after patching is complete.
The impact of verifying the existance of the file /.updfrm_exact and removing it is minimal.
Risk:
If /.updfrm_exact exists, a manual firmware upgrade may be inadvertantly rolled back when the server is next rebooted.
Action / Repair:
To verify that the file /.updfrm_exact does not exist, as the root userid on all database and storage servers, execute the following command:
bash -c '[ -f /.updfrm_exact ] && echo "FAIL: /.updfrm_exact exists"'
The output should be empty.
If the output is similar to the following:
randomdb01: FAIL: /.updfrm_exact exists
then remove the file /.updfrm_exact with the following command executed as the root userid:
 rm -f /.updfrm_exact

Verify the vm.min_free_kbytes configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL04/10/19<Name>ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain, RA
ALL29604454 - exachk
27679610 - exachk
26308040 - exachk,
1725123316984594,
17200041
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 19.3.0N/A
Benefit / Impact:
Maintaining vm.min_free_kbytes as recommended helps a Linux system to reclaim memory faster. For a database server with 1 NUMA node, the minimum value is 512KB. For database servers with more than 1 NUMA node, the minimum value is the_number_of_NUMA_nodes multiplied by 512KB.
The impact of verifying the vm.min_free_kbytes configuration is minimal. The impact of adjusting vm.min_free_kbytes should include a reboot to verify the configuration is correctly configured and retained during the boot cycle.
NOTE: It is possible, but NOT recommended, especially for a system already under memory pressure, to modify the setting interactively.
Risk:
Exposure to unexpected node eviction and reboot.
Action / Repair:
To verify the vm.min_free_kbytes configuration, as the "root" userid on each database server, execute the following command set:
MIN_FREE_KBYTES_SYSCTL=$(egrep ^vm.min_free_kbytes /etc/sysctl.conf | awk '{print $3}');
MIN_FREE_KBYTES_MEMORY=$(cat /proc/sys/vm/min_free_kbytes);
RAW_NUMA_DATA=$(numactl -s | egrep ^cpubind | awk '{$1=$1;print}')
FIELD=$(expr $(echo "$RAW_NUMA_DATA" | tr -cd ' ' | wc -c) + 1)
NUMA_NODE_COUNT=$(expr $(echo "$RAW_NUMA_DATA" | cut -d " " -f$FIELD) + 1)
if [[ $NUMA_NODE_COUNT = 1 ]]
then
  MINIMUM_SIZE=524288
else
  MINIMUM_SIZE=$(expr $NUMA_NODE_COUNT '*' 524288)
fi
DETAIL=$(
echo -e "NUMA node count:   $NUMA_NODE_COUNT";
echo -e "minimum size:      $MINIMUM_SIZE";
echo -e "in sysctl.conf:    $MIN_FREE_KBYTES_SYSCTL";
echo -e "in active memory:  $MIN_FREE_KBYTES_MEMORY";
)
if [[ $MIN_FREE_KBYTES_SYSCTL -eq $MIN_FREE_KBYTES_MEMORY && $MIN_FREE_KBYTES_SYSCTL -ge $MINIMUM_SIZE ]]
then
  echo -e "\nSUCCESS: vm.min_free_kbytes is set as recommended:\n$DETAIL";
else
  echo -e "\nFAILURE: vm.min_free_kbytes is not set as recommended:\n$DETAIL";
fi;
The output should be similar to:
SUCCESS: vm.min_free_kbytes is set as recommended:
NUMA node count:   8
minimum size:      4194304
in sysctl.conf:    4194304
in active memory:  4194304
-- OR --
SUCCESS: vm.min_free_kbytes is set as recommended:
NUMA node count:   2
minimum size:      1048576
in sysctl.conf:    1048576
in active memory:  1048576
-- OR --
SUCCESS: vm.min_free_kbytes is set as recommended:
NUMA node count:   1
minimum size:      524288
in sysctl.conf:    524288
in active memory:  524288
Example of a "FAILURE" result:
FAILURE: vm.min_free_kbytes is not set as recommended:
NUMA node count:   8
minimum size:      4194304
in sysctl.conf:    1048576
in active memory:  2097152
NOTE: In the above "FAILURE" example, it appears the sysctl.conf file setting is too low, and then the active kernel setting was expanded but still too low, and neither is close to the recommended minimum value.
If the output is a "FAILURE" result, investigate and take corrective action. Corrective action should include setting the minimum recommended vm.min_free_kbytes value for the given NUMA configuration in sysctl.conf and reboot the database server.

Validate key sysctl.conf parameters on database servers

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL5/8/13<Name>DesignExadata 
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2AllLinux  
Benefit / Impact:
Kernel parameter settings in /etc/sysctl.conf are applied to the kernel automatically at boot time and manually via the sysctl utility at runtime. The semantics of each kernel parameter are known only to the kernel, so the sysctl utility passes all values directly to the kernel with minimal processing and validation. Invalid values can be misinterpreted by the kernel, leading to unexpected results. For certain key parameters, such invalid values can have an immediate and critical impact on the system. Invalid values stored in /etc/sysctl.conf at boot time can prevent the system from booting, making it difficult to identify and correct the problem. Validating the format of some key parameters periodically or after changes to sysctl.conf can prevent unexpected outages due to human error.
Risk:
Applying improperly formatted values to kernel parameters can render a system unusable.
Action / Repair: Run the command "awk -f check_sysctl.awk /etc/sysctl.conf" and correct any parameters reported to be formatted incorrectly. The contents of check_sysctl.awk are shown below:
#########################################################################
# Notes:
#
# - The purpose of this script is to check certain kernel parameters in
# /etc/sysctl.conf that could prevent the server from booting if set
# incorrectly.
# - This script is only capable of checking the validity of the *syntax*
# of these parameters, but is not capable of assessing whether the
# values themselves are correct or optimal.
# - This script does not attempt to check all parameters in sysctl.conf.
# It only checks parameters which have been observed to cause severe
# impact on server stability.
#
# Revision history:
# 08-May-2014 - initial version
# 28-May-2014 - vm.nr_hugepages must be < 100% of physical memory
# 24-Jun-2014 - add corrective action guidance
#
#########################################################################

BEGIN {
 errcnt = 0
 BEGIN_memtotal_bytes()
}

END {
 if( !errcnt ) { print "All sysctl.conf formatting checks succeeded" }
 exit errcnt
}

function BEGIN_memtotal_bytes() {
 if( NR )
 {
 exit -1
 }

 cmd = "grep MemTotal /proc/meminfo"
 if( 1 != cmd | getline )
 {
 close( cmd )
 exit -1
 }
 else if( 3 != NF || $3 != "kB" )
 {
 print "Unexpected /proc/meminfo format"
 exit -1
 }
 close( cmd )
 memtotal_bytes = $2 * 1024

 cmd = "grep Hugepagesize /proc/meminfo"
 if( 1 != cmd | getline )
 {
 hugepage_size = 2048 * 1024
 }
 else if( 3 != NF || $3 != "kB" )
 {
 print "Unexpected /proc/meminfo format"
 exit -1
 }
 else
 {
 hugepage_size = $2 * 1024;
 }
 close( cmd )

 memtotal_hugepages = memtotal_bytes / hugepage_size
}

# This function extracts the value portion of the setting with whitespace
# before and after trimmed, as sysctl does
function extract_value( localval ) {
 localval = gensub( /^[^=]*=[[:space:]]*/, "", 1 )
 localval = gensub( /[[:space:]]*$/, "", 1, localval)
 return localval;
}

# This function verifies that the specified value consists entirely of
# numeric digits 0-9
function check_decimal_int( v ) {
 if( v !~ /^[[:digit:]]*$/ ) { return 0 }
 return 1;
}

# Check for comments first and skip to the next line if found
/^[[:space:]]*[#;]/ {
 next
}

/vm\.nr_hugepages/ {
 valstr = extract_value()
 if( !check_decimal_int(valstr) )
 {
 errcnt++
 print "Invalid hugepages line: '" $0 "'"
 print "ACTION: A valid hugepages line should look similar to the following example,"
 print " with no additional comments or other characters:"
 print ""
 print " vm.nr_hugepages = 10000"
 print ""
 next
 }

 # Add 0 to valstr to force it to numeric type. Otherwise
 # subsequent comparisons will use string comparisons,
 # which won't yield expected results
 valnum = 0 + valstr
 if( valnum >= memtotal_hugepages )
 {
 errcnt++
 print "Hugepages value '" valnum "' is larger than physical memory"
 print "ACTION: Reduce the hugepages value to something much less than the total size of"
 print " physical RAM in the server. For this server, a value of " memtotal_hugepages
 print " would consume all of physical RAM, and would prevent the server from"
 print " booting. Please refer to MOS Note 401749.1 for guidance on choosing"
 print " an appropriate value for this server."
 next
 }
}

Remove "fix_control=32" from dbfs mount options


PriorityAlert LevelDateOwnerStatusScope
CriticalNone5/2/2013<Name> Exadata
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool Version
11.2.3.2.1+AllX2-2(4170), X2-2, X2-8, X3-2, X3-8, SSC, X4-2AllLinux x86-64 UEK5.8, SPARC Solaris 11 
Benefit / Impact:
DBFS is designed to use an async statfs to handle the need of getting the filesystem info. Bug #13340960 added an extra mount option of "fix_control=32",
which allowed statfs to be done asynchronously due to a timeout issue. If patch 13340960 is already applied, it's recommended to remove "fix_control=32".
Bug 13340960 is fixed in 11.2.0.3 BP5 and higher.
Risk:
Changes the statfs behavior if mount option "fix_control=32" is not removed
Action / Repair:
1) Check on Exadata compute node(s) if DBFS is mounted with "fix_control=32";
On Linux:
 #ps -ef | grep -E 'dbfs_client' | grep -E 'fix_control'
 On Solaris:
# ps -ef | grep dbfs_client 
# pargs <pid> - from dbfs_client above

2) Check to see if  bug:1334096 is installed or 11.2.0.3 BP5+ is applied to the RDBMS Oracle home:
$RDBMS/OPatch/opatch lspatches 
3) Check make sure you're using the latest mount-dbfs.sh script from note: Configuring DBFS on Oracle Database Machine [ID 1054431.1]

Set Linux kernel log buffer size to 1MB

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalWARN7/31/13<Name>  Exadata17250965
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2AllLinux  
Benefit / Impact:
Set the kernel command line parameter "log_buf_len=1m" in /boot/grub/grub.conf to increase the size of the kernel's internal log buffer. This will help ensure all messages from the kernel's boot sequence can be captured to /var/log/messages by syslogd/klogd.
This is primarily a concern only on larger servers like the Sun Server X2-8, where the large number of hardware components causes the kernel to produce a larger volume of messages than the internal log buffer can hold during the boot sequence.
Risk:
The default size of the kernel's internal log buffer is not large enough to hold all messages from the entire boot sequence on some large hardware models.
Without this change, some messages from the kernel's boot sequence may be lost before they can be captured to /var/log/messages, which may make it difficult
 to diagnose some system issues.
Action / Repair:
Edit /boot/grub/grub.conf and add "log_buf_len=1m" (excluding quotes) to each kernel command line entry, as in the following example:
title Oracle Linux Server (2.6.32-400.21.1.el5uek)
root (hd0,0)
kernel /vmlinuz-2.6.32-400.21.1.el5uek root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb console=ttyS0,115200n8 console=tty1 crashkernel=512M bootfrom=BOOT audit=1 processor.max_cstate=1 log_buf_len=1m
initrd /initrd-2.6.32-400.21.1.el5uek.img 

Verify IP routing configuration on DB nodes

PriorityAlert LevelDateOwnerStatusEngineered System  Engineered System
Platform 
 Bug(s)
CriticalWARN05/31/17<Name>ProductionRA, Exadata - Physical,
Exadata - Management Domain
ALL  Bug 26138002 - exachk
Related to: Bug 17723513
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool Version MAA Scorecard Section
N/A      N/AN/AN/AN/ALinuxexachk 12.2.0.1.4  N/A
Benefit / Impact:
The default IP routing configuration on Exadata database nodes has changed over time so that the latest configuration works well in all environments, but due to a kernel bug in kernels pre-2.6.31, the older configurations only worked in some cases. Since the configurations aren't changed during Exadata software upgrades, legacy configurations should be updated to avoid issues during future upgrades.
Risk:
If the Linux routing configuration is not updated before the kernel is upgraded from pre-2.6.31 (Exadata pre-11.2.3.2.0) to Exadata software version 11.2.3.2.0 or later, it is likely that routing/network issues will surface following the upgrade. The required changes (or potential changes) are outlined in MOS note 1306154.1.
Action / Repair:
To verify the routing configuration requires updating, execute the following as any userid on a database server:
cd /etc/sysconfig/network-scripts
. ./network-functions
# find all the interfaces besides loopback.  ignore aliases, alternative configurations, and editor backup files
interfaces=$(ls ifcfg* | grep -v -e ifcfg-ib -e ifcfg-bondib | LANG=C sed -e "$__sed_discard_ignored_files" -e '/\(ifcfg-lo$\|:\|ifcfg-.*-range\)/d' -e '/ifcfg-[A-Za-z0-9#\._-]\+$/ { s/^ifcfg-//g;s/[0-9]/ &/}' | LANG=C sort -k 1,1 -k 2n | LANG=C sed 's/ //')

for i in $interfaces
  do
    unset SLAVE
    unset IPADDR
    unset NETWORK
    unset CNT
    unset NETMASK
    unset RNT
    unset IPV6ADDR

    . /etc/sysconfig/network-scripts/ifcfg-$i
    AGREE=`/bin/grep ^SLAVE= ifcfg-$i | /bin/cut -d= -f2`
  if [ [$AGREE] == [yes] ]
    then echo " NOTICE: Slave Interfaces ($i) do not have rule or route files"
    else
# IPv4 check
      if [ -z $IPADDR ]
        then echo " NOTICE: $i is not configured for IPv4"
        else
          if [ -z $NETWORK ]
            then NETWORK=`/bin/ipcalc $IPADDR $NETMASK -n | /bin/cut -d= -f2`
          fi
# check the rule file exists and has the two rules that apply (to and from)
          if [ ! -f rule-$i ]
            then echo "FAILURE: Need to create the rule configuration for rule-$i per 1306154.1"
            else
              CNT=`/sbin/ip rule list | /bin/grep -e $NETWORK -e $IPADDR -e GATEWAY | wc -l`
              if [ $CNT -lt 2 ]
                then echo "FAILURE: Need to update rule configuration for rule-$i per 1306154.1"
                else echo "   PASS: rule-$i is configured with rules."
              fi
          fi
# check the route file exists and have the proper route
          if [ ! -f route-$i ]
            then echo "FAILURE: Need to create the route configuration for route-$i per 1306154.1"
            else
              RNT=`/sbin/ip route list table all | /bin/grep $NETWORK | grep -v local | wc -l`
              if [ $RNT -lt 2 ]
                then echo "FAILURE: Need to update route configuration for route-$i per 1306154.1"
                else echo "   PASS: route-$i is configured with routes."
              fi
          fi
      fi
# IPv6 check
      if [ -z $IPV6ADDR ]
        then echo " NOTICE: $i is not configured for IPv6"
        else
          if [ -z $NETWORK ]
            then 
              NETWORK=`echo $IPV6ADDR | /bin/cut -d: -f1,2,3,4` 
              NETWORK=$NETWORK:
# check the rule file exists and has the two rules that apply (to and from)
              if [ ! -f rule6-$i ]
                then echo "FAILURE: Need to create the rule configuration for rule6-$i per 1306154.1"
                else
                  CNT=`/sbin/ip -6 rule list | /bin/grep $NETWORK | wc -l`
                  if [ $CNT -lt 2 ]
                    then echo "FAILURE: Need to update rule configuration for rule6-$i per 1306154.1"
                    else echo "   PASS: rule6-$i is configured with rules."
                  fi
              fi
# check the route file exists and have the proper route
              if [ ! -f route6-$i ]
                then echo "FAILURE: Need to create the route configuration for route6-$i per 1306154.1"
                else
                  RNT=`/sbin/ip -6 route list table all | /bin/grep $NETWORK | grep -v local | grep table | wc -l`
                  if [ $RNT -lt 2 ]
                    then echo "FAILURE: Need to update route configuration for route6-$i per 1306154.1"
                    else echo "   PASS: route6-$i is configured with routes."
                  fi
              fi
          fi
      fi
  fi
done

The expected result will be similar to:
   PASS: rule-bondeth0 is configured with rules.
   PASS: route-bondeth0 is configured with routes.
 NOTICE: bondeth0 is not configured for IPv6
   PASS: rule-eth0 is configured with rules.
   PASS: route-eth0 is configured with routes.
 NOTICE: eth0 is not configured for IPv6
 NOTICE: eth1 is not configured for IPv4
 NOTICE: eth1 is not configured for IPv6
 NOTICE: eth2 is not configured for IPv4
 NOTICE: eth2 is not configured for IPv6
 NOTICE: eth3 is not configured for IPv4
 NOTICE: eth3 is not configured for IPv6
 NOTICE: Slave Interfaces (eth4) do not have rule or route files
 NOTICE: Slave Interfaces (eth5) do not have rule or route files

Example of a "FAILURE" result:
   PASS: rule-bondeth0 is configured with rules.
FAILURE: Need to create the route configuration for route-bondeth0 per 1306154.1
 NOTICE: bondeth0 is not configured for IPv6
   PASS: rule-eth0 is configured with rules.
   PASS: route-eth0 is configured with routes.
 NOTICE: eth0 is not configured for IPv6
 NOTICE: eth1 is not configured for IPv4
 NOTICE: eth1 is not configured for IPv6
 NOTICE: eth2 is not configured for IPv4
 NOTICE: eth2 is not configured for IPv6
 NOTICE: eth3 is not configured for IPv4
 NOTICE: eth3 is not configured for IPv6
 NOTICE: Slave Interfaces (eth4) do not have rule or route files
 NOTICE: Slave Interfaces (eth5) do not have rule or route files


NOTE: If any "FAILURE:" results are returned, follow the guidance provided in the message.
 

Set SQLNET.EXPIRE_TIME=10 in DB Home

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalWARNING12/4/2013<Name>ProductionExadata, SSC, Exalogic17159324
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-811.2.3.3.0+Solaris - 11
Linux x86-64 UEK5.8
  
Benefit / Impact:
Setting this in DB Home will prevent a connection over SQL*Plus from timing out
Risk:
If this is not set then the SQL*Net connection held by RMAN can timeout while the database is backed up over HTTP protocol.
Action / Repair:
To verify the parameter is set - look in ${ORACLE_HOME}/network/admin/sqlnet.ora
The output should be similar to
SQLNET.EXPIRE_TIME=10

Verify there are no .fuse_hidden files under the dbfs mount

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Important
N/A
12/10/13
<Name>
Production
Exadata
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
All
OEL5
exachk TBD
 

Benefit / Impact:
Verifying the existence of .fuse_hidden files located under the dbfs mount point will positively identify a recommended bug fix. The impact of verifying the existance of these files is minimal.
This problem is specific to fuse on OEL5 (which is 2.7.4 based version).
Risk:
When a file is opened under the dbfs mount and later removed whilst a process still holds the file descriptor, the fuse library may not unlink
correctly leaving .fuse_hidden files remaining under the dbfs mount.These files can accumulate causing slow performance for simple filesystem
commands such as "ls". Also, the number of files can grow quite large taking up unnecessary space.
Action / Repair:
It's recommended to perform these actions during your next planned maintenance schedule as dbfs will need to be restarted.
These instructions are applicable to those environments who configured DBFS using MOS note:1054431.1
1) While dbfs is mounted, manually delete any existing .fuse_hidden files under the dbfs mount as the patch does not clear these.
2) Stop and unmount dbfs:
$GI/bin/crsctl stop res <dbfs_mount> 
3) Obtain and install the new fuse rpms related to bug:17401424 from Oracle's public Yum Server
4) Verify the new rpm is installed <fuse-libs-2.7.4-8.0.1.1.el5>:
# rpm -qa|grep fuse
fuse-devel-2.7.4-8.0.1.1.el5 
fuse-2.7.4-8.0.1.1.el5 
fuse-libs-2.7.4-8.0.1.1.el5 
5) Start and remount dbfs:
$GI/bin/crsctl start res <dbfs_mount> 

Verify that the SDP over IB option "sdp_apm_enable(d)" is set to "0"

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
06/03/15
      <Name>
Production
Exadata-Physical, Exadata-Management Domain,
Exadata-user Domain, SSC, Exalogic

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2
<=11.2.3.3.0
-or-
<=12.1.1.1.0
Linux x86-64 el5uek
Linux x86-64 el6uek
exachk 12.1.0.2.4

Benefit / Impact:
The Impact of verifying that the SDP over IB option "sdp_apm_enable" is set to "0" is minimal. To set the option, a reboot is recommended to make sure the configuration file syntax is correct.
Risk:
If the the SDP over IB option "sdp_apm_enable" is not set to "0" on all Exadata database servers and clients that communicate with each other using SDP, either the client or database server side of the connection request will eventually hang.
NOTE: While the original issue was reported in environments where Exalogic application servers where accessing an Oracle Exadata Database Machine using SDP, ANY client requesting a connection using SDP with Automatic Path Migration (APM) enabled to an Oracle Exadata Database Machine will cause the connection to hang on the database server. exachk cannot tell from querying an Oracle Exadata Database Machine if there is, or ever will be, an end user application accessing the database servers via SDP. The Best Practice recommendation for stability is therefore to turn off APM on all Oracle Exadata Database Machines and any clients that may seek to establish an SDP connection with them.
Action / Repair:
To verify that the SDP over IB option "sdp_apm_enable" is set to "0" in the proper configuration file and the running kernel, execute the following command as the "root" userid on all database servers.
unset IB_SDP_OUTPUT_FILE;
unset IB_SDP_OUTPUT_KERNEL_RSLT;
unset IB_SDP_FILE;
unset KERNEL_TYPE;
MODULE=ib_sdp;
OPTION=sdp_apm_enable
if ! /sbin/lsmod | grep -q "^${MODULE}[[:space:]]"; then
        echo "Module ${MODULE} is not loaded, so ${OPTION} will not be checked";
else
  echo "Module ${MODULE} is loaded, so ${OPTION} will be checked";
  KERNEL_TYPE=$(uname -r | cut -d"." -f6);
  if [ $KERNEL_TYPE = "el5uek" ]
  then
   IB_SDP_FILE="/etc/modprobe.conf";
   elif [ $KERNEL_TYPE = "el6uek" ]
   then
   IB_SDP_FILE="/etc/modprobe.d/ib_sdp.conf";
   else
   echo -e "ERROR: unable to determine IB_SDP_FILE: $KERNEL_TYPE";
   fi;
   IB_SDP_OUTPUT_FILE=$(egrep "ib_sdp" $IB_SDP_FILE);
   if [ -s /sys/module/ib_sdp/parameters/sdp_apm_enable ]
  then
     IB_SDP_OUTPUT_KERNEL_RSLT=$(cat /sys/module/ib_sdp/parameters/sdp_apm_enable);
  else
     IB_SDP_OUTPUT_KERNEL_RSLT="/sys/module/ib_sdp/parameters/sdp_apm_enable not found";
   fi;
   if [[ `echo "$IB_SDP_OUTPUT_FILE" | egrep "sdp_apm_enable*.=0" | wc -l | sed -e 's/^[ \t]*//'` = 1 && `echo "$IB_SDP_OUTPUT_FILE" | wc -l | sed -e 's/^[ \t]*//'` = 1 ]]
    then
    IB_SDP_OUTPUT_FILE_RSLT=0;
   fi;
   if [[ "$IB_SDP_OUTPUT_FILE_RSLT" = 0 && "$IB_SDP_OUTPUT_KERNEL_RSLT" = 0 ]]
     then
      echo -e "SUCCESS: sdp_apm_enable is set to 0 in $IB_SDP_FILE and running kernel.";
      echo -e "$IB_SDP_FILE: $IB_SDP_OUTPUT_FILE";
      echo -e "Running Kernel: $IB_SDP_OUTPUT_KERNEL_RSLT";
   else
    echo -e "FAILURE: sdp_apm_enable should be set to 0 in $IB_SDP_FILE and running kernel.";
    echo -e "$IB_SDP_FILE: $IB_SDP_OUTPUT_FILE";
    echo -e "Running Kernel: $IB_SDP_OUTPUT_KERNEL_RSLT";
   fi;
fi;

The output should be similar to:
Module ib_sdp is not loaded, so sdp_apm_enable will not be checked
- OR -
Module ib_sdp is loaded, so sdp_apm_enable will be checked
SUCCESS: sdp_apm_enable is set to 0 in /etc/modprobe.conf and running kernel.
/etc/modprobe.conf: options ib_sdp sdp_zcopy_thresh=0 recv_poll=0 sdp_apm_enable=0
Running Kernel: 0
If the output is not as expected, investigate the configuration for root cause and make appropriate corrections.
NOTE: The 11.x and 12.x series are separate code lines, which is why there are two entries under "Exadata Version". Above the versions listed in "Exadata Version", APM is off by default in the Linux kernel, but it can still be manually activated.
NOTE: For additional guidance on configuring sdp_apm_enable, please see "SDP Connection in inter-connected Exalogic and Exadata stopped working (Doc ID 1588546.1)"


Verify /etc/oratab

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Important
WARN
02/06/14
<Name>
Production
Exadata

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
ALL
Solaris - 11
Linux x86-64 UEK5.8
exachk 2.2.5

Benefit / Impact:
Validate oratab contents - prevents against invalid entries that make automation difficult
Risk:
oratab having stale or invalid entries takes away the ability to automate - for example relinking of oracle homes.
Action / Repair:

  • all directories point to real locations with $ORACLE_HOME/bin/oracle binary in place
  • only ony GI home
  • one and only one +ASM entry exists
  • +ASM entry matched with GI home with $ORACLE_HOME/bin/crsd.bin binary
A quick script with 5 basic checks is made available here. The script was written quick and only serves as an example of what we are trying to accomplish

Verify consistent software and configuration across nodes

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Important
WARN
02/6/2014
<Name>
Production
Exadata
See bug list in linked section below.
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
All
All
exachk various
 

Benefit / Impact:
Consistent software and configuration across nodes increases stabillty and performance, and facilitates problem diagnosis.
Risk:
Inconsistent software and configuration across nodes can cause crashes and performance degredation, and can make problem diagnosis difficult.
Action / Repair:
Recommended consistency checks are provided at the following location:
Exadata Best Practices Cross Node Consistency

 Verify all database and storage servers time server configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalCRITICAL05/01/19<Name>ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain, SSC
ALL29605287 - exachk
29031050 - exachk
27262264 - exachk
24696447 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinux, Sparcexachk 19.3.0N/A
Benefit / Impact:
Verifying all database and storage servers time server configurations are as expected can help avoid issues such as impaired performance or node eviction.
The impact of verifying all database and storage servers time server configuration is minimal. The impact of making corrections varies depending upon the root cause of the difference.
Risk:
Significant time drift on database and storage servers may cause unexpected storage server crashes or database server node evictions.
Action / Repair:
NOTE: This check will only pass if the following are all true on each database or storage server:
1) There are one or more time servers specified in the configuration file (/etc/chrony.conf or /etc/ntp.conf).
2) Each storage or database server is synched with one of the set of available time sources in the configuration file.
3) The maximum time drift for each storage or database server from the synched time source reported is less than or equal to 1 second.
To verify all database and storage servers time server configuration, run exachk and review the provided report.
The expected output in the exachk report should be as follows:
In the "Cluster Wide" section of the report, the overall result should be "PASS":
PASS   All database and storage servers time server configuration is as expected  Cluster Wide   View
In the "View" detail section of the report for this check the expected output should be similar to:
Status on Cluster Wide:
PASS => Time services are properly configured

DATA FROM RANDOM05ADM05 - VERIFY ALL DATABASE AND STORAGE SERVERS TIME SERVER CONFIGURATION 

SUCCESS: time services are properly configured.
In the "View" detail section of the report for this check a "FAILURE" example will be similar to:
FAILURE: time services are not properly configured.  Details:


randomadm05:    FAILURE:      server count:  1        synched server in conf:  1      timedrift: 2
randomceladm07: FAILURE:      server count:  0        synched server in conf:  1      timedrift: 0
randomceladm08: FAILURE:      server count:  1        synched server in conf:  0      timedrift: 0
NOTE: A "FAILURE" result prints the gathered data from the cluster to help identify the issue.
NOTE: This configuration failed because
1) randomadm05 timedrift is too high.
2) randomceladm07 has no servers defined in the configuration file.
3) randomceladm08 is not synchronized to a server defined in the configuration file.
If the result is not as expected, investigate for root cause and take appropriate corrective action.
NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute the following command as the "root" userid in the directory in which exachk was installed:
./exachk -check 85C96EAB566F8F13E053D498EB0AE6F1,85C9BA643125E253E053D598EB0A6D07,85CEDB9B0FBF1262E053D298EB0A29F9

Verify Sar files have read permissions for non-root user


Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
FAIL
1/24/2013
<Name>
Draft
Exadata
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
N/A
Solaris - 11
Linux x86-64 UEK5.8
exachkx


Benefit / Impact:
Ability for non-root users including EM to monitor System Activity Report (sar) files.
Risk:
Inability for non-root users including EM to monitor sar files.
Action / Repair:
Verify if read permissions are set for the sar files, execute the script below.
##### begin script 
#!/bin/bash

if [ `stat -c %A /var/log/sa/sa* | awk 'END{print}' | sed 's/.......\(.\).\+/\1/'` != "r" ]
then
 echo "Sar files does not have the proper read permission set for non-root users. To correct, issue this command as root: chmod o+r /var/log/sa/* "
else
 echo "Sar file permissions are correct and no further action is needed."
fi
#### end script 


Verify that the patch for bug 16618055 is applied


Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Important
Warn
05/29/14
<Name>
Production
Exadata

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
>= 11.2.0.4
and
< 11.2.0.4.8
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
N/A
N/A
exachk 2.2.5


Benefit / Impact:
Applying the patch for bug 16618055 allows recovery to utilize ASYNC I/O, providing greater recovery performance and a shorter Recovery Time Objective.
The impact of verifying that the patch for bug 16618055 is applied is minimal. The impact of applying the patch for bug 16618055 varies by method.
Risk:
Without the patch for bug 16618055 applied, recovery uses SYNC I/O for all log and block read operations which causes slower recovery slave performance and a longer Recovery Time Objective.
Action / Repair:
To verify that the patch for bug 16618055 is applied, as the owner of each RDBMS home, with the environment properly configured, execute the following command for each RDBMS home:

$ORACLE_HOME/OPatch/opatch lsinventory -bugs_fixed|egrep -w '^16618055|^Bug|Patch'|grep -v Installer

The output should be similar to:

Bug Fixed by Installed at Description
 Patch
16618055 18642122 Fri Jun 13 11:32:22 PDT 2014 SLOW REDO APPLY ON EXADATA DUE TO SYNC IOS

If the appropriate patch is not already applied, and the database software version is 11.2.0.4 and the Bundle Patch applied is less than Bundle Patch 8 then you must apply the patch for bug 16618055 to the appropriate database home.

NOTE: For additional detail, please see My Oracle Support note "ASYNC IO In Exadata Not Working (Doc ID 1642088.1)".


Verify the Name Service Cache Daemon (NSCD) is Running


PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
CriticalFAIL07/17/17 <Name>ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
ALL
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool Version
N/AN/AN/AN/AALLLinuxexachk 12.2.0.1.4
Benefit / Impact
Verifying the NSCD configuration ensures the correct configuration when providing cache for the most common name service requests, like passwords, groups, hosts.
The impact of verifying the NSCD configuration is minimal. While configuring and starting the NSCD can be done without a reboot, a reboot is recommended to prove the configuration is correct and survives a boot procedure.
NOTE: The recommended NSCD attribute values varying depending upon whether or not the System Security Service Daemon (SSSD) is also in use.
Risk:
When NCSD and SSSD daemons are running together, an incorrect configuration could cause processes to use the incorrect cache service. Typical problems are CRS start failure due to an invalid password, new connections to the database suddenly failing due to invalid password error (ORA-1031, ORA-1017) among others.
Action / Repair:
To verify the NSCD is properly configured, as the root userid on each database server, execute the following code:
NSCD_SERVICE_DATA=$(service nscd status 2>&1)
SSSD_SERVICE_DATA=$(service sssd status 2>&1)
NSCD_AUTOSTART_DATA=$(chkconfig --list nscd 2>&1 | sed -e 's/  */ /g' -e 's/ *//')
NSCD_AUTOSTART_CONFIGURED=$(echo $NSCD_AUTOSTART_DATA |awk '{if ($0 ~ /3:on/ || $0 ~ /5:on/) {print "1";exit 1}else{print "0";exit 0}}')
if [ -r /etc/nscd.conf ]
then
  NSCD_FILE_DATA=$(egrep "enable-cache" /etc/nscd.conf | grep -v "#" | awk '{print $2 ": " $3}')
else
  NSCD_FILE_DATA=$(ls -l /etc/nscd.conf 2>&1)
fi
NSCD_MEMORY_DATA=$(for CACHE_NAME in passwd group hosts services netgroup; do echo -e "$CACHE_NAME: `nscd -g 2>/dev/null | egrep -w "$CACHE_NAME" -A3 | egrep "is enabled" | cut -dc -f1 | sed -e 's/  */ /g' -e 's/ *//'`"; done)
NSCD_SERVICE_STATUS=$(echo $NSCD_SERVICE_DATA | grep running | wc -l)
SSSD_SERVICE_STATUS=$(echo $SSSD_SERVICE_DATA | grep running | wc -l)
NSCD_FILE_DATA_SHORT=$(echo "$NSCD_FILE_DATA" | awk '{print $2}' | tr -d " \t\n\r")
NSCD_MEMORY_DATA_SHORT=$(echo "$NSCD_MEMORY_DATA" | awk '{print $2}' | tr -d " \t\n\r")
if [ $NSCD_FILE_DATA_SHORT = $NSCD_MEMORY_DATA_SHORT ] 2>/dev/null
then
  MEMORY_MATCHES_FILE=1
else
  MEMORY_MATCHES_FILE=0
fi
if [ $SSSD_SERVICE_STATUS -eq 0 ] # only NSCD
then
  if [ "$NSCD_FILE_DATA_SHORT" == "yesyesyesyesno" ]
  then
    NSCD_ATTRIBUTES_CORRECT=1
  else
    NSCD_ATTRIBUTES_CORRECT=0
  fi
else # NSCD and SSSD
  if [ "$NSCD_FILE_DATA_SHORT" == "yesnononono" ]
  then
    NSCD_ATTRIBUTES_CORRECT=1
  else
    NSCD_ATTRIBUTES_CORRECT=0
  fi
fi
if [ $NSCD_SERVICE_STATUS -eq 1 ] && [ $NSCD_AUTOSTART_CONFIGURED -eq 1 ] && [ $MEMORY_MATCHES_FILE -eq 1 ] && [ $NSCD_ATTRIBUTES_CORRECT -eq 1 ]
then
  echo -e "SUCCESS:  The Name Service Cache Daemon (NSCD) configuration is correct:\n"
  echo -e "NSCD service data:       $NSCD_SERVICE_DATA\n"
  echo -e "SSSD service data:       $SSSD_SERVICE_DATA\n"  
  echo -e "NSCD autostart data:     $NSCD_AUTOSTART_DATA\n"
  echo -e "NSCD file data:\n$NSCD_FILE_DATA\n"
  echo -e "NSCD memory data:\n$NSCD_MEMORY_DATA\n"
else
  echo -e "FAILURE:  The Name Service Cache Daemon (NSCD) configuration is not correct:\n"
  echo -e "NSCD service data:       $NSCD_SERVICE_DATA\n"
  echo -e "SSSD service data:       $SSSD_SERVICE_DATA\n"
  echo -e "NSCD autostart data:     $NSCD_AUTOSTART_DATA\n"
  echo -e "NSCD file data:\n$NSCD_FILE_DATA\n"
  echo -e "NSCD memory data:\n$NSCD_MEMORY_DATA\n"
fi
The expected output should be similar to:
SUCCESS:  The Name Service Cache Daemon (NSCD) configuration is correct:

NSCD service data:       nscd (pid 69150) is running...
SSSD service data:       sssd: unrecognized service
NSCD autostart data:     nscd  0:off   1:off   2:on    3:on    4:on    5:on    6:off

NSCD file data:
passwd: yes
group: yes
hosts: yes
services: yes
netgroup: no

NSCD memory data:
passwd: yes 
group: yes 
hosts: yes 
services: yes 
netgroup: no 

-- OR --
SUCCESS:  The Name Service Cache Daemon (NSCD) configuration is correct:

NSCD service data:       nscd (pid 69150) is running...

SSSD service data:       sssd (pid 91505) is running...

NSCD autostart data:     nscd   0:off   1:off   2:on    3:on    4:on    5:on    6:off

NSCD file data:
passwd: yes
group: no
hosts: no
services: no
netgroup: no

NSCD memory data:
passwd: yes 
group: no 
hosts: no
services: no
netgroup: no 
If the output is not as expected take the following actions as the root userid:

1) If the NSCD is not set for autostart, enable the NSCD to autostart on reboots:
chkconfig --level 35 nscd on
NOTE: The autostart levels vary by Exadata Storage Server Software version, at least levels 3 and 5 should be set.
2) The entries for the /etc/nscd.conf file depend upon whether or not SSSD is in use with NSCD. For NSCD without SSSD, the following entries should be present in the /etc/nscd.conf file:
        enable-cache            passwd          yes
        enable-cache            group           yes
        enable-cache            hosts           yes
        enable-cache            services        yes
        enable-cache            netgroup        no
For NSCD with SSSD, the following entries should be present in the /etc/nscd.conf file:
        enable-cache            passwd          yes
        enable-cache            group           no
        enable-cache            hosts           no
        enable-cache            services        no
        enable-cache            netgroup        no
If the values are not as expected, modify the /etc/nscd.conf file.

NOTE: the /etc/nscd.conf file can be edited with the "vi" editor.
NOTE: these attributes are spread throughout the /etc/nscd.conf file, at the head of other attributes that pertain to each cache. They are not grouped together. For example:
        enable-cache            services        yes
        positive-time-to-live   services        28800
        negative-time-to-live   services        20
        suggested-size          services        211
        check-files             services        yes
        persistent              services        yes
        shared                  services        yes
        max-db-size             services        33554432

3) It is recommended to reboot the database server to ensure that the configuration is correct and is persistent across the reboot process.

4) If a reboot is not immediately possible, as a workaround, the service may be started or restarted manually:
service nscd start
Starting nscd:                                             [  OK  ]
- OR -
service nscd restart
Stopping nscd:                                             [  OK  ]
Starting nscd:                                             [  OK  ]
For additional guidance on NSCD, please see:
Oracle® Grid Infrastructure Installation Guide 11g Release 2 (11.2) for Linux
Oracle® Grid Infrastructure Installation Guide 12c Release 1 (12.1) for Linux

Verify kernels and initrd in /boot/grub/grub.conf are available on the system

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
FAIL
8/29/14
<Name>
Production
Exadata,
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
11.2.2.2.0+
Linux x86-64

Benefit / Impact:
The impact of verifying that the kernel and initrd listed in grub.conf are actually available on the system is minimal. When the kernel or initrd file is unavailable the user should either remove the corresponding entry from grub.conf (if possible) or install the appropriate files on the right location (recommended)
Risk:
If entries in grub.conf exist that refer to kernel and initrd files not installed on the system, a next reboot may fail. The system will 'hang' in the bootloaded.
Action / Repair:
To verify entries in grub.conf match with what is installed. I would think of the following approach in pseudo:
for each 'title' in /boot/grub/grub.conf 
do
 get the value for 'kernel' without other arguments; check if the file is found on disk in /boot; raise an alert when not found 
 get the value for 'initrd' without other arguments; check if the file is found on disk in /boot; raise an alert when not found 
done 

Verify basic Logical Volume(LVM) system devices configuration

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
12/09/15
<Name>
Production
Exadata - Physical,
Exadata - Management Domain

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
11.2.2.2.0+
Linux x86-64
exachk 12.1.0.2.6

Benefit / Impact:
The impact of verifying that the basic Logical Volume(LVM) system devices configuration is correct is minimal. The impact of correcting any abnormalities depends upon the specific abnormality.
Risk:
If the basic Logical Volume(LVM) system devices configuration is not correct, there may be risk of patching interruption or unexpected downtime.
Action / Repair:
The basic Logical Volume(LVM) system devices configuration varies by Exadata software version level and hardware type. exachk runs the appropriate checks based on Exadata software version levels and hardware type. To validate the basic Logical Volume(LVM) system devices configuration, run exachk and review the provided report.
The expected output in the exachk report should be as follows:
In the "Findings Passed" summary section of the report, the overall result should be "PASS":
PASS OS Check Basic Logical Volume(LVM) system devices configuration meets recommendations. All Database Servers View
In the "View" detail section of the report for each individual database server:
(*) PASS: This is an LV (Logical Volume) enabled system
(*) PASS: LVDbSys1 should reside in Volume Group (VG) VGExaDb.
(*) PASS: LVDbSys2 should reside in Volume Group (VG) VGExaDb.
(*) PASS: Minimum number of LVDbSys LV's
(*) PASS: Maximum number of LVDbSys LV's
(*) PASS: LVDbSys LV minimum size of /dev/mapper/VGExaDb-LVDbSys2
(*) PASS: LVDbSys LV size
(*) PASS: LVDbSys inactive LV minimum size of /dev/mapper/VGExaDb-LVDbSys1
(*) PASS: Inactive LVDbSys LV's not mounted
(*) PASS: Enough free space found for snapshot
(*) PASS: No filesystem label issues for DBSYS
(*) PASS: No reclaimdisk issues found
(*) PASS: No active lvm snapshots found
If the items reported are not all "PASS", investigate the root cause and take appropriate corrective action.

Ensure db_unique_name is unique across the enterprise

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
FAIL
02/25/2015
<Name>
Production
Exadata

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
11.2+
All
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
N/A
Solaris - 11
Linux x86-64 UEK5.8
exachk 12.1.0.2.2

Benefit / Impact:
db_unique_name is used extensively in many Clusterware, RDBMS, and Exadata code layers. Uniqueness is enforced within clusters but not across clusters. Ensuring db_unique_name is unique across clusters, especially those that are sharing the same Exadata storage, ensures that all code layers that use it work properly.
Risk:
Having databases with the same db_unique_name across different Real Application Clusters that share the same Exadata storage causes unexpected behavior such as database isolation, crashes, or failures to start.
Action / Repair:
The following is an example of a sqlplus command checking whether db_unique_name has been explicitely set:
SQL> select isdefault from v$parameter where name ='db_unique_name';

ISDEFAULT
---------
FALSE
If the output is "FALSE", then someone has explicitely set db_unique_name and not let it default to the value of db_name.
If the output is "TRUE", then db_unique_name is set to its default value, ie the same as db_name.
Oracle recommends that db_unique_name is unique across a customer's Oracle enterprise. exachk running on a given Real Application Cluster cannot check all values across a customer's enterprise. This exachk check assumes that "FALSE" means specific care has been taken to ensure uniqueness across the customer's enterprise and is considered the "PASS" condition. "TRUE" is assumed to imply that enterprise uniqeness may not have been considered and is the "FAIL" condition.
NOTE: the corrective action is to ensure all databases have a unique name across the customer's Oracle enterprise, especially those accessing the same Exadata storage. If every database is confirmed to have a unique name without setting db_unique_name universally, then this exachk check may be disabled or ignored.

Verify average ping times to DNS nameserver

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
WARN
01/14/2015
      <Name>
Production
Exadata

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, X4-2
11.2.3.2.0+
Solaris - 11
Linux x86-64 UEK5.8
exachk 12.1.0.2.2

Benefit / Impact:
Secure Shell (SSH) remote login procedures require communication between the remote target device and the DNS nameserver. Minimal average ping times to the DNS nameserver improve SSH login times and help to avoid problems such as timeouts or failed connection attempts.
The impact of verifying average ping times to the DNS nameserver is minimal. The impact required to minimize average ping times to the DNS nameserver varies by configuration and cannot be estimated here.
Risk:
Long ping times between remote SSH targets and the active DNS server may cause remote login failures, performance issues, or dropped application connections.
Action / Repair:
To verify average ping times to DNS nameserver, enter the following command set as the "root" userid on each database server, storage server, and InfiniBand switch:
HOST_NAME=$(hostname);
if [ -s /usr/local/bin/version ]
then
 DNS_SERVER=$(grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' /etc/resolv.conf | head -1);
else
 DNS_SERVER=$(nslookup $HOST_NAME | head -1 | cut -d: -f2 | sed -e 's/^[ \t]*//');
fi;
OS_TYPE=$(uname);
if [ $OS_TYPE = "Linux" ]
then
 PING_COMM="ping -c10 $DNS_SERVER";
else
 PING_COMM="ping -s $DNS_SERVER 56 5";
fi;
AVG_PING_TIME=$($PING_COMM | egrep avg | cut -d"/" -f5);
TRNC_AVG_PING_TIME=$(echo $AVG_PING_TIME | cut -d"." -f1); 
if [ "$TRNC_AVG_PING_TIME" -le "3" ];
then
 echo -e "SUCCESS: Average ping times to DNS nameserver should not be negatively impacting SSH operations: $AVG_PING_TIME";
 echo -e "Active DNS Server IP: $DNS_SERVER\n";
else
 echo -e "WARNING: Average ping times to DNS nameserver MAY be negatively impacting SSH operations: $AVG_PING_TIME";
 echo -e "Active DNS Server IP: $DNS_SERVER\n";
fi;
The output should be similar to the following:
SUCCESS: Average ping times to DNS nameserver should not be negatively impacting operations: 3.255
Active DNS Server IP: 111.222.333.444
If the result is a "WARNING", first repeat the command set several times at different intervals to determine if the results are consistent. The command set is one spot check for ten pings. The environment could normally have a short delay and an execution just happened to catch a period of poor response, or it could normally have a long delay and an execution just happened to catch a period of good response. If the results are consistent, determine the root cause and take appropriate corrective action.

NOTE: The result of this command set is a reflection of how DNS is implemented in the environment and not evidence in itself of a defect in the Oracle Exadata Database Machine.

NOTE: A "WARNING" result does not prove that a delay is causing SSH connectivity problems in the environment. A "WARNING" result should always be evaluated in conjunction with a review of SSH connectivity issues in the environment. If there are other SSH connectivity issues present, evaluate if reducing or stabilizing the average ping times to the DNS nameserver may correct the issues.

NOTE: As with many other network performance metrics, the average ping times to DNS nameserver should be "minimal". However, it is possible that any given environment may return a result that exceeds the threshold used in this command set, yet it is satisfactory given the overall environment characteristics and lack of other related problems. IF NO OTHER PROBLEMS related to DNS exist other than this command set returning a "WARNING", and the numbers reported are acceptable after a "baseline" for the given environment has been established by repeated sampling, then the documented procedures for bypassing this check in exachk may be implemented.

NOTE: Due to the differences in available commands for the InfiniBand switch, the command set assumes the first "nameserver" in /etc/resolv.conf is the "active" DNS server.

NOTE: The use of the Name Service Cache Daemon (NSCD) may also mitigate the effects of long average ping times to DNS nameserver. For more information see: Verify the Name Service Cache Daemon (NSCD) is Running

Verify Running-config and Startup-config are the same on the Cisco switch

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Medium
WARN
12/01/14
<Name>
Production
Exadata, SSC, Exalogic
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2, X2-8, X3-2, X3-8, X4-2
ALL
cat4500-IPBASEK9-M, Version 15.0(2)SG8
N/A

Benefit / Impact:
To keep the switch running the same configuration after it reboots, it is a best practice to have the running-config the same as the startup-config.
Risk:
Potential management network issues if the startup-config contains pre-install defaults, or other customizations made by the Customer.
Action / Repair:
Compare the startup-config and the running-config. The simplest way to do this is to capture the output from the switch and run diff on the capture files.

Capture output of an ssh session, use the tee command to create the log file:
unixhost ~ > ssh admin@randomsw-adm0 2>&1 | tee /tmp/running.out
From that new connection, go into enable, set the terminal so it does not pause its output, and show the running configuration (not the "all")
randomsw-adm0> enable
randomsw-adm0# terminal length 0 - this causes the output to not pause
randomsw-adm0# show running-config all
randomsw-adm0# exit
Now, do the same to check the startup configuration:
unixhost ~ >ssh admin@randomsw-adm0 2>&1 | tee /tmp/start.out
From that new connection, go into enable, set the terminal so it does not pause its output, and show the startup configuration (not the "all")
randomsw-adm0> enable
randomsw-adm0# terminal length 0 <- this causes the output to not pause
randomsw-adm0# show startup-config all
randomsw-adm0# exit
Modify the two files by removing the lines before the version number:
Version 15.0
and the last entry from the show command:
end
This modification will make the file format more suited for the diff command:
unixhost ~ > diff /tmp/start.out /tmp/running.out > /tmp/diff.out
The two files should have identical parameters. In my first attempt to validate the two config's I saved running to startup and then output both. The diff was:
194c194
< spanning-tree uplinkfast max-update-rate 444318408
---
> spanning-tree uplinkfast max-update-rate 444318920
It seems that no matter how many times I copy running to startup, it still differs by those few bytes. This might be the same, so examining the diff.out file you should be able to determine if the differences make any difference at all.

To make running and startup the same, go into the switch and then into the enable mode:
randomsw-adm0> enable
randomsw-adm0# copy running-config startup-config all
Destination filename [startup-config]?
Compressed configuration from 75923 bytes to 22210 bytes[OK]
To protect this setup, you should also copy the new config to a backup on the switch itself and to an external tftp server:
randomsw-adm0# copy running-config bootflash:cisco4948-ip-confg-before
Destination filename [cisco4948-ip-confg-before]?

13815 bytes copied in 1.376 secs (10040 bytes/sec)
Now to the external tftp server:
randomsw-adm0#copy running-config tftp
Address or name of remote host []? random-tftp-1
Destination filename [randomsw-adm0-confg]? cisco4948-ip-confg-before
!!
13815 bytes copied in 1.564 secs (8833 bytes/sec)

Validate SSH is installed and configured on Cisco management switch

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
N/A
FAIL
12/03/14
<Name>
Production
Exadata, SSC, Exalogic
N/A
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2, X2-8, X3-2, X3-8, X4-2
N/A
N/A
N/A

Benefit / Impact: Telnet has no security and should be avoided. Early versions of the Cisco Internetwork Operating System (IOS) for the Catalyst 4948 only had telnet available. Note 1415044.1 describes how to get the version of the IOS and how to configure SSH. This is a check to validate SSHis enabled and also how to configure it, and restrict the number of simultaneous sessions into the switch.
Risk:
By using telnet, one risks a network sniffer obtaining the administrative and enable passwords. Once these passwords are had, it is trivial to breach the switch and cause administrative access to Exadata be disabled. Depending on how this switch is integrated into a Customer's network, infiltration into the Customer's network becomes a possibility.

The versions which do not contain SSH are those which only have the IP Base image which will have only IPBASE only in its image name. For instance, Cat4500-IPBASE-M will not have SSH while Cat4500-IPBASEK9-M will have SSH in it.
Action / Repair:
The following was done on this version of the Cisco IOS which contains SSH (as it is cat4500-IPBASEK9-M).

Cisco IOS Software, Catalyst 4500 L3 Switch Software (cat4500-IPBASEK9-M), Version 15.0(2)SG8, RELEASE SOFTWARE (fc2)
One must first start a session into the switch. Once there, go into "enable" mode. One will notice the prompt change from ">" to "#" to represent the enable session.

randomsw-adm0>enable
randomsw-adm0#
Find if SSH is enabled on the switch.

randomsw-adm0#show ip ssh
SSH Enabled - version 2.0
Authentication timeout: 60 secs; Authentication retries: 3
Validate the SSH configuration:

randomsw-adm0#show running-config all | include transport
 no destination transport-method http
 destination transport-method email
 transport preferred none
 transport preferred telnet
 transport input telnet
 transport output telnet
 transport preferred none
 transport input none
 transport output none
In this case SSH is not listed so it is not configured to be used. A configuration that passes looks like this:

randomsw-adm0#show running-config all | include transport
 no destination transport-method http
 destination transport-method email
 transport preferred none
 transport preferred ssh
 transport input ssh
 transport output ssh
 transport preferred none
 transport input none
 transport output none
Validate that the startup configuration is the same as the running.

randomsw-adm0#show startup-config | include transport
 no destination transport-method http
 destination transport-method email
 transport preferred none
 transport preferred ssh
 transport input ssh
 transport output ssh
 transport preferred none
 transport input none
 transport output none
In this case they match. If further validation is needed, one will have to capture the running configuration and the startup configuration and compare them.
If SSH is not enabled and there still are telnet entries in the above output, then the system needs to be configured for SSH. The first step is to discover how many simultaneous sessions are available.

randomsw-adm0#show line
Tty Typ Tx/Rx A Modem Roty AccO AccI Uses Noise Overruns Int
 0 CTY - - - - - 0 0 0/0 -
 1 VTY - - - - - 66 0 0/0 -
 2 VTY - - - - - 20 0 0/0 -
 3 VTY - - - - - 6 0 0/0 -
 4 VTY - - - - - 0 0 0/0 -
 5 VTY - - - - - 0 0 0/0 -
There can be up to 16 VTY lines in this version of the IOS, so the list you see might be longer. This will allow up to 16 telnet/SSH sessions in the switch at the same time. Normally this is not a good idea, so in this document we will assume only five total sessions are needed and will disable the rest. So below we will configure vty 1 up to vty 4. We will disable vty 5 through 16. The vty 0 is the serial port in the back of the switch.

randomsw-adm0#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.

randomsw-adm0(config)#
randomsw-adm0(config)#line vty 1 4
randomsw-adm0(config-line)#transport preferred ssh
randomsw-adm0(config-line)#transport input none
randomsw-adm0(config-line)#transport input ssh
randomsw-adm0(config-line)#transport output none
randomsw-adm0(config-line)#transport output ssh
randomsw-adm0(config-line)#exit
randomsw-adm0(config)#line vty 5 16
randomsw-adm0(config-line)#transport preferred none
randomsw-adm0(config-line)#transport input none
randomsw-adm0(config-line)#transport output none
randomsw-adm0(config-line)#exit
randomsw-adm0(config)#exit

randomsw-adm0#show line vty 0 | include transport
 Allowed input transports are ssh.
 Allowed output transports are ssh.
 Preferred transport is ssh.
randomsw-adm0#show line vty 1 | include transport
 Allowed input transports are ssh.
 Allowed output transports are ssh.
 Preferred transport is ssh.
randomsw-adm0#show line vty 2 | include transport
 Allowed input transports are ssh.
 Allowed output transports are ssh.
 Preferred transport is ssh.
randomsw-adm0#show line vty 3 | include transport
 Allowed input transports are ssh.
 Allowed output transports are ssh.
 Preferred transport is ssh.
randomsw-adm0#show line vty 4 | include transport
 Allowed input transports are ssh.
 Allowed output transports are ssh.
 Preferred transport is ssh.
randomsw-adm0#show line vty 5 | include transport
 Allowed input transports are none.
 Allowed output transports are none.
 Preferred transport is none.
The rest of the "show line vty #" will show all transport options will be set to one. Because they are set to none, you will only be able to have up to five SSH sessions. You will also not be able get a telnet session on any of the vty's. We will test this in later steps.
We now need to save the running configuration to the startup configuration so these changes will take.

randomsw-adm0#copy running-config startup-config all
Destination filename [startup-config]?
randomsw-adm0#exit
Now that you have exited from the session to the switch, time to test its really working. First try telneting to the switch:

user@host ~ >telnet randomsw-adm0
Trying 111.222.333.444...
telnet: connect to address 111.222.333.444: Connection refused
telnet: Unable to connect to remote host: Connection refused
Now try SSH:

user@host ~ >ssh admin@randomsw-adm0
Password:
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
To test simultaneous connect restriction, keep opening SSH sessions (without exiting from them) until you get a Connection refused error. Once you get that error, you've discovered the number of simultaneous SSH sessions are possible. From this point, while keeping those SSH sessions open and telnet into the switch. If you do not get a Session refused error, the switch is still open to telnet so the configuration above needs to be troubleshot.

Verify Database Memory Allocation is not Greater than Physical Memory Installed on Database node

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Warn
WARN
14/11/04
<Name>
Production
Exadata

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
ALL
ALL
ALL
ALL
ALL

Benefit / Impact:
Database memory allocation should never be greater than the physical memory installed on a database node. Over allocating memory can cause memory swapping which will negatively impact performance.
Risk:
Database performance can be significantly impacted by over allocating memory.
Action / Repair:
Generate a collection of all of the running databases in the environment. This must be done on a per-node basis as databases may not have instances running on all nodes. In a loop, connect to each database and query gv$parameter and ensure all database instances are using USE_LARGE_PAGES = ONLY.
If any instance does not have USE_LARGE_PAGES = ONLY set, FAIL with a message similar to the following and stop processing:
It is highly recommended that you use hugepages in the Linux environment (link to BP for USE_LARGE_PAGES). We have found at least one instance without USE_LARGE_PAGES = ONLY and thus cannot with absolute accuracy calculate actual memory utilization.
If all instances PASS the previous check, calculate PGA memory allocation in use by each database instance (this includes ASM and MGMTDB instances).
  • When accessing the ASM instance, at this time PGA_AGGREGATE_LIMIT is not used, so in all cases for ASM retrieve the PGA_AGGREGATE_TARGET
    SQL> select value*3 from v$parameter where name='pga_aggregate_target';
    VALUE*3
    ----------
    1258291200
  • If the database version is 12.1.0.1 or higherretrieve the PGA_AGGREGATE_LIMIT and add to PGA total. Note that in 12c, PGA_AGGREGATE_LIMIT is derived from PGA_AGGREGATE_TARGET and defaults to greater of 2gb or 2 times setting of PGA_AGGREGATE_TARGET.
    • SQL> select value from v$parameter where name='pga_aggregate_limit';VALUE--------------------------------------------------------------------------------3221225472
  • If the database version is earlier than 12.1.0.1, retrieve the PGA_AGGREGATE_TARGET * 3 and add to PGA total. Note that PGA_AGGREGATE_TARGET can actually consume memory up to 3 times the setting for the parameter
    SQL> select value*3 from v$parameter where name='pga_aggregate_target';VALUE*3--------------------------------------------------------------------------------4831838208
  • Determine the amount of memory being used by HugePages?
    $ cat /proc/meminfo|grep Huge
    HugePages_Total: 256000
    HugePages_Free: 234587
    HugePages_Rsvd: 67
    HugePages_Surp: 0
    Hugepagesize: 2048 kB


    Memory being used by HugePages? is HugePages? _Total * Hugepagesize

    $ bc -q
    2048*1024*25600
    53687091200
    quit
  • Determine the memory available on the node for PGA
    $ cat /proc/meminfo |grep MemTotal? |awk '{print $2 * 1024}'
    1083965984768

    Subtract the memory allocated for HugePages? (gathered above)

    $ bc -q
    1083965984768 - 536870912000
    547095072768
    quit
If the PGA database instance memory total is > memory available on the node for PGA provide FAILURE message stating something similar to "Database PGA allocation of <PGA memory total> is greater than the memory available for PGA <memory available on the node for PGA> on this node. Please change memory allocations by reducing PGA_AGGREGATE_TARGET as appropriate in one or more databases until PGA memory allocation is less than memory available for PGA.

This last item should be scripted so that we can provide it as part of the best practices page for customers to run outside of exachk.

Verify Cluster Verification Utility(CVU) Output Directory Contents Consume < 500MB of Disk Space

Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
WARN
03/20/15
<Name>
Production
Exadata, SSC

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
12.1.0.2+
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5
11.2.3.3.1+
Solaris - 11
Linux x86-64 UEK5.8
N/A

Benefit / Impact:
Beginning with Oracle version 12.1.0.2, the CVU is configured by default to run and generate an XML output file every 6 hours (360 minutes). These files, and occasionally CVU command text output files, are stored in the output directory. If not monitored, the files in the CVU output directory could eventually exhaust the available disk space. Currently, there is no effective purging of these files, but this is expected to be addressed in a future release of CVU.
The benefit of verifying that the CVU output directory contents consume < 500MB of disk space is that an outage due to depleted disk space is avoided. The impact of the verification is small, the impact of reducing disk space consumption depends upon the chosen remediation strategy.
Risk:
Not verifying that the CVU output directory contents consume < 500MB of disk space increases the risk of a cluster instance crash or other failures related to a file system running out of space.
Action / Repair:
To verify that the CVU output directory contents consume < 500MB of disk space, as the RDBMS home owner, and with the environment properly set, execute the following command set on each database server:
DEFAULT_LOCATION="/u01/app/oracle/crsdata/@global/cvu/baseline/cvures"
if [ -r $DEFAULT_LOCATION ]
then
 CVU_SPACE_USED=$(du -sm $DEFAULT_LOCATION | awk '{ print $1}')
 if [ $CVU_SPACE_USED -le "500" ]
 then echo -e "SUCCESS: Automated CVU check output consumes <= 500MB of disk space: "$CVU_SPACE_USED"MB"
 else echo -e "WARNING: Automated CVU check output consumes > 500MB of disk space: "$CVU_SPACE_USED"MB"
 fi
else
 echo -e "WARNING: There seems to be some issue with $DEFAULT_LOCATION"
fi
The expected output should be similar to:
SUCCESS: Automated CVU check output consumes <= 500MB of disk space: 224MB
If the output is "WARNING", these are the recommended corrective options:
1) Manually purge the accumulated files from all database servers on a schedule that suits your retention and space usage requirements. Do not just delete all files.

2) Lengthen the interval at which the automated CVU check executes:

As the RDBMS home owner, with the environment properly set, and with CVU enabled and running, execute the following command set on a database server:
[oracle@randomadm03 ~]$ srvctl modify cvu -checkinterval 720
[oracle@randomadm03 ~]$ srvctl config cvu
CVU is configured to run once every 720 minutes
CVU is enabled.
CVU is individually enabled on nodes: 
CVU is individually disabled on nodes: 
NOTE: the "modify" command does not return any output confirmation. Follow up with the "config" command.
NOTE: The interval change takes effect without restarting the CVU.
NOTE: The CVU process only runs on one database server, but the files accumulate on all database servers.
For additional information see: "Oracle® Real Application Clusters Administration and Deployment Guide 12c Release 1 (12.1) E48838-10"

Verify active system values match those defined in configuration file "cell.conf"

N/A
WARN
03/01/15
<Name>
Production
BDA, Exadata, Exalogic, Exalytics, SSC, ZDLRA

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
11.2.2.4.2+
Linux x86-64
exachk 12.1.0.2.4

Benefit / Impact:
The Impact of verifying that active system values match those defined in configuration file "cell.conf" is minimal.
Changing the options defined in configuration file "cell.conf" directly in the active kernel may impact availability.
Risk:
A run time kernel configuration that does not match the values defined in the configuration file "cell.conf" may result in an outage or unexpected issues during the next boot.
Action / Repair:
Note: Modifications to the Oracle Exadata Storage Server hardware or software are not supported. Only the documented network interfaces on the Oracle Exadata Storage Server should be used for all connectivity including management and storage traffic. Additional network interfaces should not be used.

NOTE: Always follow the recommended procedures to make changes on an Exadata system, and use a reboot to verify that the changes are persistent in order to avoid unexpected issues during a reboot.

NOTE: ipconf validation restarts the cellwall service, which resets the storage server to the default configuration. If manual changes have been made regardless that such configuration is not permitted, the manual configuration will be lost when the cellwall service is restarted.

NOTE: The "ipconf" command performs a number of cross-checks. The length of time to execute varies by Exadata version, environment complexity, and system load. Newer versions of Exadata software have longer execution times due to more cross checking, as do more complex environments. Internal testing has taken up to 60 seconds. Please make sure the command is truly stuck before terminating.
To verify that active system values match those defined in configuration file "cell.conf", as the "root" userid execute the following command set only on each storage server:
IPCONF_RAW_OUTPUT=$(/opt/oracle.cellos/ipconf -verify -semantic -at-runtime -check-consistency -verbose 2>/dev/null);
IPCONF_RESULT=$(echo "$IPCONF_RAW_OUTPUT" | egrep "Consistency check PASSED" | wc -l);
IPCONF_SUMMARY=$(echo "$IPCONF_RAW_OUTPUT" | tail -1);
if [ $IPCONF_RESULT = "1" ]
  then
    echo -e "SUCCESS: $IPCONF_SUMMARY"
  else
    echo -e "FAILURE: $IPCONF_SUMMARY\n"
    echo -e "`echo -e "$IPCONF_RAW_OUTPUT" | grep FAILED`"
fi;
The expected output is:
SUCCESS: [Info]: Consistency check PASSED
If the result is not as expected, the detailed output data will be echoed back after the "FAILURE" message. For example:
FAILURE: Info. Consistency check FAILED

ILOM timezone 00:21:28:A5:1B:BC found in /usr/share/zoneinfo                                      : FAILED
ILOM timezone America/Denver matches 00:21:28:A5:1B:BC from Exadata configuration file            : FAILED
Info. Consistency check FAILED
Review the data and take corrective action based upon the specific configuration items that did not pass.

Verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED"

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
WARN
06/09/15
      <Name>
Production
Exadata-User Domain, Exadata-Physical, SSC, Exalogic
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
12.1.0.2+
CRS
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2
11.2+
Solaris - 11
Linux x86-64 el5uek
Linux x86-64 el6uek
exachk 12.1.0.2.4

Benefit / Impact:
Verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" avoids node eviction and potential cluster crashes due to insufficient resources, and it helps avoid a possible denial of service attack.
The impact of verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" is minimal. The impact of correcting CRS_LIMIT_NPROC should include a restart of the clusterware to ensure the setting is as expected after a restart.
Risk:
Without verifying that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED" there is a risk of node eviction and potential cluster crashes due to insufficient resources, and a possible denial of service attack avenue.
Action / Repair:
To verify that CRS_LIMIT_NPROC is greater than 65535 and not "UNLIMITED", execute the following command set as the grid owner userid with the environment properly set on each of the database servers or each user domain of a virtualized environment:
unset CONFG_CRS_LIMIT_NPROC;
export MIN_VAL=65535;
CONFG_CRS_LIMIT_NPROC=$(grep -w CRS_LIMIT_NPROC $CRS_HOME/crs/install/s_crsconfig_`hostname -s`_env.txt|grep -v ^#|cut -d= -f2);
if [ `echo $CONFG_CRS_LIMIT_NPROC | tr -s '[:upper:]' '[:lower:]'` = "unlimited" ]
then
 echo "WARNING: CRS_LIMIT_NPROC should be set to a value greater than or equal to $MIN_VAL, but not \"UNLIMITED\": $CONFG_CRS_LIMIT_NPROC.";
elif [ $CONFG_CRS_LIMIT_NPROC -ge $MIN_VAL ]
then
 echo "SUCCESS: CRS_LIMIT_NPROC is set to a value greater than or equal to $MIN_VAL, but not \"UNLIMITED\": $CONFG_CRS_LIMIT_NPROC.";
else
 echo "FAILURE: CRS_LIMIT_NPROC is set to a value less than $MIN_VAL: $CONFG_CRS_LIMIT_NPROC.";
fi;
The expected output should be:
SUCCESS: CRS_LIMIT_NPROC is set to a value greater than or equal to 65535, but not "UNLIMITED": 65536.
Example of a FAILURE result:
FAILURE: CRS_LIMIT_NPROC is set to a value less than 65535: 16384.
Example of a WARNING result:
WARNING: CRS_LIMIT_NPROC should be set to a value greater than or equal to 65535, but not "UNLIMITED": UnliMITed.
If the result is not "SUCCESS", determine the root cause and correct the cause.

For example, to correct the "FAILURE" example provided, as the owner userid of the grid infrastructure on the database server or user domain that produced the warning, edit with the "vi" editor the file $CRS_HOME/crs/install/s_crsconfig_`hostname -s`_env.txt and add this line:
CRS_LIMIT_NPROC=65535
as a minimum acceptable value. The limit name is typically in upper case. If thorough testing indicates a larger value should be used, the value can be set to any value within the recommended range. After you have closed the file and verified the value, restart the clusterware.

Verify TCP Segmentation Offload (TSO) is set to off

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
05/21/15
<Name>
Production
Exadata-Physical, Exadata-User Domain, Exalogic

DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2
12.1.2.1.0, 12.1.2.1.1
Linux x86-64 el6uek
exachk 12.1.0.2.4

Benefit / Impact:
The Impact of verifying that the TSO option for IB and bonded IB interfaces is set to "off" is minimal. With the chosen implementation (of updating a configuration file) to make the setting effective a reboot is required.
Risk:
If the TSO option is not set to "off" cluster node evictions can occur.
NOTE: Starting 12.1.2.1.2 TSO function is disabled by the kernel. This does not apply for other Exadata releases then mentioned.
Action / Repair:
To verify that the TSO option is set to "off" in the run time configuration, execute the following command as the "root" userid on all database servers where the exadata image version is >= 12.1.2.1.0 and <= 12.1.2.1.1 on Exadata physical and domU deployments (not dom0)
get_ib_interfaces ()
{
 local -i ret_val=0
 local interface_list=''
 if [ ! -e /opt/oracle.cellos/ORACLE_CELL_NODE ]; then
 ActiveInterfaces=$(/sbin/ip link show up | awk '/[\t ]+bondib/ {print $2}' | sed -e 's/:$//' | grep -v eth | sort)
 ActiveInterfaces1=$(/sbin/ip link show up | awk '/[\t ]+ib/ {print $2}' | sed -e 's/:$//' | grep -v eth | sort)
 ActiveInterfaces2=$(/sbin/ip link show up | awk '/[\t ]+bond/ {print $2}' | sed -e 's/:$//' | grep -v eth | sort)
 for Interface in ${ActiveInterfaces} ${ActiveInterfaces1} ${ActiveInterfaces2}; do
 interface_list="${Interface} ${interface_list}"
 done
 fi
 interface_list=`echo $interface_list| xargs -n1 | sort -u | xargs`
 echo "$interface_list"
}
gettso ()
{
 local tso=UNDEFINED
 local -i ret_val=0
 for Interface in `get_ib_interfaces | tail -1`; do
 if [ -z "$Interface" ]; then
 echo "`date '+%F %T %z'` [INFO] No ib interfaces need this work around."
 else
 tso=$(/sbin/ethtool --show-offload $Interface | awk '(/tcp-segmentation-offload:/){print $NF}')
 if [ $tso == 'off' ]; then
 echo -e "SUCCESS: ${Interface}: tcp-segmentation-offload: set to off"
 else
 echo -e "FAILURE: ${Interface}: tcp-segmentation-offload: not set to off"
 ret_val=1
 fi
 fi
 done
 return $ret_val
}
 
gettso
The output should be similar to:
 
SUCCESS: bondib0: tcp-segmentation-offload: set to off
SUCCESS: ib0: tcp-segmentation-offload: set to off
SUCCESS: ib1: tcp-segmentation-offload: set to off 
- OR -
2015-05-27 10:49:01 -0500 [INFO] No ib interfaces need this work around.
If the output is not as expected, add the option ETHTOOL_OPTS="-K <ibdev> tso off" to the configuration files. Shutdown the stack followed by the command (executed as root) "ifdown <ibdev>" and "ifup <ibdev>" (where <ibdev> is ib0, ib1 or bondib0). Then restart the stack. For the majority of two socket database servers, these files are:

  • /etc/sysconfig/network-scripts/ifcfg-bondib0
  • /etc/sysconfig/network-scripts/ifcfg-ib0
  • /etc/sysconfig/network-scripts/ifcfg-ib1

NOTE: For older compute nodes, the file is: /etc/sysconfig/network-scripts/ifcfg-bond0
NOTE: Eight socket database servers may have additional bonded interfaces in use, with additional configuration files.

Check alerthistory for stateful alerts not cleared

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 06/19/19 <Name> Production Exadata - Physical,
Exadata - Management Domain 
ALL 27848031 - exachk
26651210 - exachk
21299782 - exachk 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/A N/A N/A N/A ALL Linux exachk 19.3.0 N/A 
Benefit / Impact:
There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.
A stateful alert is usually associated with a transient condition, often hardware related, and it will clear itself after that transient condition is corrected. These alerts age out of the alerthistory after 7 days (default time) once they are set to clear.
The benefit of checking for stateful alerts that have not been cleared is faster problem resolution. The impact of correcting any stateful alert that has not been cleared depends upon each individual alert.
Risk:
Failure to investigate a stateful alert that has not been cleared may result in significant impact, which varies by the particular alert.
Action / Repair:
To verify there are no stateful alerts that have not been cleared, as the root userid on each storage and database server execute the following commands:
unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY
if [ $(egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l) -eq 1 ]
  then NODE_TYPE=db
else
  NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
  COMMAND_NAME=cellcli
else
  if [ $IMAGE_VERSION -ge 121211 ]
    then COMMAND_NAME=dbmcli
  fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
  NAME_ARRAY=$($COMMAND_NAME -e "list alerthistory attributes name where alerttype=stateful and endtime=null" | sed -e 's/^[ \t]*//');
  if [ -z "$NAME_ARRAY" ]
  then
    echo -e "SUCCESS: there are no stateful alerts that have not been cleared."
  else
    for INDIVIDUAL_NAME in $NAME_ARRAY
    do
      NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAction where name=$INDIVIDUAL_NAME" | tr -s "\t")
      SID=$(echo "$NAME_RECORD" | cut -f2 | tr -s " " | sed -e 's/^[[:space:]]*//')
      SEVERITY=$(echo "$NAME_RECORD" | cut -f3 | tr -s " " | sed -e 's/^[[:space:]]*//')
      MESSAGE=$(echo "$NAME_RECORD" | cut -f4 | tr -s " " | sed -e 's/^[[:space:]]*//')
      ACTION=$(echo "$NAME_RECORD" | cut -f5 | tr -s " " | sed -e 's/^[[:space:]]*//')
      OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\t$SEVERITY";echo -e "MESSAGE:\t$MESSAGE";echo -e "ACTION:\t\t$ACTION")
    done
    echo -e -n "FAILURE: there are one or more stateful alerts that have not been cleared. Details:"
    echo -e "${OUTPUT_ARRAY[@]}"
  fi
else
  echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_VERSION"
fi
The output should be similar to:
SUCCESS: there are no stateful alerts that have not been cleared.
- OR -
alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322
Example of a FAILURE result:
FAILURE: there are one or more stateful alerts that have not been cleared. Details:

SID:            1
NAME:           1_2
SEVERITY:       critical
MESSAGE:        A IO subsystem component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/RISER3/PCIE3 Fault class : fault.io.intel.iio.pcie-downstream-devices Fault message : http://www.sun.com/msg/SPX86-8003-QH
ACTION:         For additional information, please refer to http://www.sun.com/msg/SPX86-8003-QH This alert occurred while the Management Server was not available and is being sent out on restart of the Management Server. Note the event time may reflect the time when the alert was detected by the Management Server, not the time when the fault occurred. Diagnostic package is attached. It is also accessible at /opt/oracle/dbserver/dbms/deploy/log/scam07adm07_2014_08_11T17_40_33_1_2.tar.bz2

SID:            2
NAME:           2_1
SEVERITY:       critical
MESSAGE:        A processor component is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0 Fault class : fault.cpu.intel.thermtrip Fault message : http://www.sun.com/msg/SPX86-8003-K5
ACTION:         For additional information, please refer to http://www.sun.com/msg/SPX86-8003-K5 This alert occurred while the Management Server was not available and is being sent out on restart of the Management Server. Note the event time may reflect the time when the alert was detected by the Management Server, not the time when the fault occurred.

If the output is not as expected, examine the full details for each alert that has not been cleared and follow the recommendations.

Check alerthistory for non-test open stateless alerts


PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 06/19/19 Vern Wagman Production Exadata - Physical,
Exadata - Management Domain 
ALL 27848031 - exachk
26651210 - exachk
21299794 - exachk 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/A N/A N/A N/A ALL Linux exachk 19.3.0 N/A 
Benefit / Impact:
There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.
A stateless alert is not cleared automatically. They will not age out of the alerthistory until the alert is manually investigated and the "examinedby" field set manually to a non-null value, typically the name of the person who reviewed the stateless alert and corrected or otherwise acted upon the information provided.
The benefit of checking for for non-test open stateless alerts is faster problem resolution. The impact of correcting any stateless alert that has not been cleared depends upon each individual alert.
Risk:
Failure to investigate a stateless non-test alert that has not been cleared may result in significant impact, which varies by the particular alert.
Action / Repair:
To verify there are no non-test open stateless alerts, as the root userid on each storage and database server execute the following commands:
unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY
if [ $(egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l) -eq 1 ]
  then NODE_TYPE=db
else
  NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
  COMMAND_NAME=cellcli
else
  if [ $IMAGE_VERSION -ge 121211 ]
    then COMMAND_NAME=dbmcli
  fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
  NAME_ARRAY=$($COMMAND_NAME -e list alerthistory attributes name where alerttype=stateless and examinedby=\'\' | grep -viw test | sed -e 's/^[ \t]*//' | cut -d" " -f1);
  if [ -z "$NAME_ARRAY" ]
  then
    echo -e "SUCCESS: there are no non-test open stateless alerts."
  else
    for INDIVIDUAL_NAME in $NAME_ARRAY
    do
      NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAction where name=$INDIVIDUAL_NAME" | tr -s "\t")
      SID=$(echo "$NAME_RECORD" | cut -f2 | tr -s " " | sed -e 's/^[[:space:]]*//')
      SEVERITY=$(echo "$NAME_RECORD" | cut -f3 | tr -s " " | sed -e 's/^[[:space:]]*//')
      MESSAGE=$(echo "$NAME_RECORD" | cut -f4 | tr -s " " | sed -e 's/^[[:space:]]*//')
      ACTION=$(echo "$NAME_RECORD" | cut -f5 | tr -s " " | sed -e 's/^[[:space:]]*//')
      OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\t$SEVERITY";echo -e "MESSAGE:\t$MESSAGE";echo -e "ACTION:\t\t$ACTION")
    done
    echo -e -n "FAILURE: there are one or more non-test open stateless alerts that have not been cleared. Details:"
    echo -e "${OUTPUT_ARRAY[@]}"
  fi
else
  echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_VERSION"
fi
The output should be similar to:
SUCCESS: there are no non-test open stateless alerts.
- OR -
alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322
If the output is not as expected, examine the full details for each name that has not been cleared and follow the recommendations.
Example of a FAILURE result:
FAILURE: there are one or more non-test open stateless alerts that have not been cleared. Details:

SID:            1
NAME:           1
SEVERITY:       critical
MESSAGE:        Critical interrupt detected: . Power cycle forced.
ACTION:         Informational. Diagnostic package is attached. It is also accessible at /opt/oracle/dbserver/dbms/deploy/log/slcc32adm05_2017_10_03T07_14_53_1.tar.bz2
When the underlying issue for a given name is resolved, manually set the "examinedby" field with a command similar to the following (command name is either cellcli or dbmcli, depending upon whether a storage or database server is involved):
CellCLI> alter alerthistory 1 examinedby="jdoe"
Alert 1 successfully altered
Where jdoe is the name of the person who verified the cause of the stateless alert no longer exists, and the number is the name of the stateless alert. Note that double quotes are used around the value to be set, but not the name of the stateless alert.


Verify clusterware state is "Normal"

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
07/29/15
<Name>
Production
Exadata-Physical,
Exadata-User Domain,
SSC, ZDLRA

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
11.2.+, 12.1.+
ASM
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2
ALL
N/A
exachk 12.1.0.2.5

Benefit / Impact:

The Impact of verifying that the clusterware state is "Normal" is minimal. The impact of returning the clusterware state to normal varies depending upon the clusterware state found, and the root cause that lead to the found clusterware state.

    NOTE: The clusterware state, unless an upgrade or patching exercise is in progress, should always be "Normal".

Risk:

Outside of an active upgrade or patching exercise, having cluster nodes with clusterware states other than "Normal" can lead to problems with disk rebalances, dropping griddisks, and other maintenance operations.

    NOTE: The following operations cannot be performed while the clusterware is in some form of "Rolling" state:

        User invoked disk operations (ex: add, drop, replace, online, offline, undrop, resize, expel)
        Create/Drop Diskgroup
        Rebalance
        Voting File Creation/Deletion
        Advancing compatibility
        SP file parameter add/change/remove
        Create/Drop ADVM volume

    NOTE: Outside of an active upgrade or patching exercise, having different cluster nodes report a mix of states, particularly "In Rolling Patch" and "In Rolling Upgrade" is an indication of an incomplete or incorrect upgrade or patching exercise!

Action / Repair:

To verify the clusterware state, execute the following command set as the owner of the clusterware home with the environment properly set to access the ASM instance on each database server:

unset CLUSTER_STATE;
CLUSTER_STATE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
SELECT SYS_CONTEXT('SYS_CLUSTER_PROPERTIES', 'CLUSTER_STATE') FROM DUAL;
exit
EOF);
if [ `echo $CLUSTER_STATE | wc -w` = 1 ]
then
  if [ $CLUSTER_STATE = "Normal" ]
    then
      echo -e SUCCESS: the clusterware state is: $CLUSTER_STATE;
    else
      echo -e FAILURE: the clusterware state is: $CLUSTER_STATE;
  fi;
else
  echo -e FAILURE: the clusterware state is: $CLUSTER_STATE;
fi;

The expected output should be:

SUCCESS: the clusterware state is: Normal

If the output is not as expected, investigate the root cause and correct the condition.

Verify the grid Infrastructure management database (MGMTDB) does not use hugepages
Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
11/02/15
<Name>
Production
Exadata - Physical,
Exadata - User Domain,
SSC

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
>= 12.1
MGMTDB
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
11.2+
Linux x86-64 el5uek
Linux x86-64 el6uek
exachk 12.1.0.2.6-ish
Benefit / Impact:
MGMTDB can start on any node within the cluster which makes the configuration and allocation of hugepages more difficult. Verifying that MGMTDB doesn't use hugepages helps to avoid instance start failures because not enough huge pages are available.
The impact of verifying MGMTDB does not use hugepages is minimal. Configuring MGMTDB to not use hugepages requires an instance restart.
Risk:
If MGMTDB is configured to use hugepages and it starts on a database server where MGMTDB's use of hugepages has not been considered, other database instances may fail to start because not enough hugepages are available, or MGMTDB itself may not acquire hugepages when it fails over to a different database server.
Action / Repair:
To verify MGMTDB does not use hugepages, as the root userid on the database server where MGMTDB is running, execute the following command set:
#!/bin/bash
# Main
v_pmon_pid=$(ps -ef | grep pmon | grep '\-MGMTDB' | awk ' { print $2 } ') # If we have a value continue, else exit - MGMTDB may not be running here.
if [ "${v_pmon_pid}" != '' ]
then
# Check value we found is a number
expr ${v_pmon_pid} + 1 > /dev/null 2>&1
if [ $? -eq 0 ]
then
v_hugep_count=$(grep -a -s huge /proc/${v_pmon_pid}/numa_maps 2>/dev/null | grep -a -s dirty | wc -l)
if [ ${v_hugep_count} -gt 0 ]
then
v_logger_msg="MGMTDB should not be running with hugepages"
echo -e "\nFAILURE: ${v_logger_msg}"
else
v_logger_msg="MGMTDB is not running with hugepages"
echo -e "\nSUCCESS: ${v_logger_msg}"
fi
else
v_logger_msg="Unable to find pmon pid for MGMTDB unable to detect if MGMTDB runs with hugepages or not"
echo -e "\nFAILURE: ${v_logger_msg}"
fi
fi
The expected output will be similar to:
SUCCESS: MGMTDB is not running with hugepages
If the output is 'FAILURE', execute the following steps to deconfigure hugepages for MGMTDB as owner of the Grid Infrastructure with Oracle home set to the grid Home and Oracle Sid to -MGMTDB:
[oracle@dbm01 ~]$ sqlplus / as sysdba
SQL> alter system set use_large_pages=FALSE scope=spfile;
[oracle@dbm01 ~]$ srvctl stop mgmtdb -o immediate
[oracle@dbm01 ~]$ srvctl start mgmtdb

Verify the "localhost" alias is pingable

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
11/02/15
<Name>
Production
Exadata - Physical,
Exadata - User Domain,
Exadata - Management Domain,
SSC

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
11.2+
Linux x86-64 el5uek
Linux x86-64 el6uek
Solaris 11
exachk 12.1.0.2.6-ish
Benefit / Impact:
Many scripts and programs, including patching utilities rely on the "localhost" alias. Verifying the "localhost" alias is pingable helps avoid operational issues or incorrect patch applications.
The impact of verifying the "localhost" alias is pingable is minimal. Changing the "localhost" alias definition does not require a reboot or network restart.
Risk:
If the "localhost" alias is not pingable operational issues or incorrect patch applications may result.
Action / Repair:
To verify the "localhost" alias is pingable, as the "root" userid on each storage server, database server, and InfiniBand switch, execute the following command set (IPv4 or IPv6 compatible):
#!/bin/bash
v_cmd[0]="ping -c1 localhost"
v_index=0
v_netw_ipv6=$(grep ^NETWORKING_IPV6 /etc/sysconfig/network | awk -F "=" ' { print $2 } ')
if [ "${v_netw_ipv6}" == "yes" ]
then
# ipv6 detected also check for ip6-localhost
v_cmd[1]="ping6 -c1 ip6-localhost"
fi
# Main
while [ $v_index -lt ${#v_cmd[*]} ]
do
v_localhostname=$(echo ${v_cmd[$v_index]} | awk ' { print $3 } ')
${v_cmd[$v_index]} > /dev/null 2>&1
if [ $? != 0 ]
then
v_logger_msg="${v_localhostname} is not pingable by name"
echo -e "\nFAILURE: ${v_logger_msg}"
else
v_logger_msg="${v_localhostname} is pingable by name"
echo -e "\nSUCCESS: ${v_logger_msg}"
fi
v_index=$((v_index+1))
done
The expected output should be similar to:
SUCCESS: localhost is pingable by name
- OR -
SUCCESS: ip6-localhost is pingable by name
</verbatim> If the output is 'FAILURE' then manually edit /etc/hosts and test to make sure the "localhost" alias definition is a valid entry.
IPv4 example:
127.0.0.1 localhost.localdomain localhost
IPv6 example:
127.0.0.1 localhost.localdomain localhost
::1 ip6-localhost.localdomain ip6-localhost
Verify bundle patch version installed matches bundle patch version registered in database
Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
FAIL
11/04/15
<Name>
Production
Exadata, Exalogic, SSC

DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
>= 12.1.0.2
ALL
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2, X5-8
11.2.x +
Linux, Solaris
exachk 12.1.0.2.6
Benefit / Impact:
Crosschecking the software bundle patch version installed with the bundle patch registered in the database to make sure they match ensures software correctness and stability. If a bundle patch is being installed in a Data Guard configuration in a standby-first manner where the SQL portion of the bundle patch is not installed inside the database until the primary and all standby software homes have the same version installed, then this crosscheck is expected to fail until both the binary and SQL portion of the bundle patch application is fully installed.
Risk:
Incomplete bug fixes, software instability, and unexpected behavior
Action / Repair:
To verify that the bundle patch version installed matches bundle patch version registered in database, as the oracle home owner for the primary database, and with ORACLE_SID and ORACLE_HOME properly set, execute the following command:
opatch_bp=$($ORACLE_HOME/OPatch/opatch lspatches 2>/dev/null|grep -iwv javavm|grep -wi database|head -1|awk -F';' '{print $1}');
database_bp_status=$(echo -e "set heading off feedback off timing off \n select ACTION, STATUS from (select * from dba_registry_sqlpatch where PATCH_ID = $opatch_bp order by action_time desc) where rownum=1;"|$ORACLE_HOME/bin/sqlplus -s " / as sysdba" | sed -e '/^ *$/d');
database_bp_status='echo $database_bp_status';
if [ "$database_bp_status" == "APPLY SUCCESS" ];
then
echo "SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.";
else
echo "FAILURE: Bundle patch installed in the database does not match the software home, or is installed with errors.";
fi;
The output should be similar to:
SUCCESS: Bundle patch installed in the database matches the software home and is installed successfully.
If FAILURE is reported, then investigate and correct the discrepancy.
NOTE: For versions less than 12.1.0.2, please see this archived best practice: Verify bundle patch version installed matches bundle patch version registered in database
Verify database is not in DST upgrade state
Priority
Alert Level
Date
Owner
Status
Engineered System
Critical
FAIL
10/19/2015
<Name>
Review
Exadata - Physical,
Exadata - User Domain,
Exadata - Management Domain,
SSC, ZDLRA
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
11.2+
All
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X5-2, X5-8
11.2.x +
Linux x86-64 el5uek
Linux x86-64 el6uek
Solaris 11
exachk 12.1.0.2.6
Benefit / Impact:
When the DB timezone is in upgrade mode or inconsistent mode, I/Os issued from DB nodes to cell nodes will not go through smart scan and hence block I/O or passthru will take place instead. This results in cell nodes shipping all blocks rather than blocks of interest (filtered) to the database for qualified scans.
Risk:
Smart scan will be disabled or do passthru and can cause potential performance issues. If the I/O size is huge it might saturate the RDS traffic and impact the RDA service times along with database performance.
Action / Repair:
To check whether database DST_UPGRADE_STATE is set to anything other than the normal value NONE, as the owner of the oracle home for a given database and with the environment set to access that database, execute the following command set:
unset DST_UPGRADE_STATE_VALUE;
DST_UPGRADE_STATE_VALUE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select upper(property_value) from sys.database_properties where property_name = 'DST_UPGRADE_STATE';
exit
EOF
);
if [ $DST_UPGRADE_STATE_VALUE = "NONE" ]
then
echo -e "SUCCESS: DB is not in DST upgrade state. \"DST_UPGRADE_STATE\" column value = "$DST_UPGRADE_STATE_VALUE""
else
echo -e "FAILURE: DB is in DST upgrade state. \"DST_UPGRADE_STATE\" column value = "$DST_UPGRADE_STATE_VALUE""
fi;
The expected output should be similar to:
SUCCESS: DB is not in DST upgrade state. "DST_UPGRADE_STATE" column value =  NONE
NOTE: Oracle recommends that database should not be in DST upgrade state under normal operations. Refer to MOS Doc ID 1583297.1 for fixing or closing the DST upgrade state. If DST_UPGRADE_STATE is UPGRADE, PREPARE or DATAPUMP then possibly a prepare or upgrade window or an on-demand or datapump-job loading of a secondary time zone data file is in an active state. A failed or terminated Datapump job can also cause DST_UPGRADE_STATE value to be Datapump(1) which should be fixed. This check could fail if there is an active Datapump job loading a secondary timezone file at the same time.
Verify there are no failed diskgroup rebalance operations
Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
09/16/15
<Name>
Production
Exadata - Physical, Exadata - User Domain, SSC

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
11.2.0.3+
ASM
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
ALL
Linux x86-64 el5uek
Linux x86-64 el6uek
Solaris 11
exachk 12.1.0.2.6
Benefit / Impact:
Verifying there are no failed diskgroup rebalance operations helps to ensure that all diskgroups have the chosen redundancy. The impact of correcting any failed diskgroup rebalance operations depends upon the error responsible for the failure.
Risk:
A failed diskgroup rebalance operation could leave the diskgroup without the proper redundancy, exposing the diskgroup to a loss of data if another partner disk fails.
Action / Repair:
To verify there are no failed diskgroup rebalance operations, as the owner of the grid home and with the environment set to access one ASM instance, execute the following command set:
#!/bin/bash
unset REBALANCE_ERROR;
REBALANCE_ERROR='$ORACLE_HOME/bin/sqlplus -s "/ as sysasm" << EOF
set head off pagesize 0 timing off serveroutput on feedback off
select group_number,error_code from gv\\$asm_operation where error_code is not null and upper(state) not in ('DONE','WAIT','RUN');
exit
EOF';
if [ -z 'echo $REBALANCE_ERROR | tr -d ' \t\n\r\f'' ]
then
echo -e "\nSUCCESS: There were no failed rebalance operations found.\n"
else
echo -e "\nFAILURE: Failed rebalance operations were found:\n"
echo -e "REBALANCE_ERROR:\n$REBALANCE_ERROR\n"
fi;
The output should be similar to:
SUCCESS: There were no failed rebalance operations found.
If the output is not "SUCCESS...", investigate the reported errors and correct appropriately.
Verify the CRS_HOME is properly locked
Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
WARN
11/10/15
<Name>
Production
Exadata - Physical,
Exadata - User Domain,
ZDLRA

DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
12.1+
ASM
X2-2(4170), X2-2, X2-8, X3-2, X3-8, EIGHTH, X4-2, X4-8, X5-2, X5-8
11.2.2.2.0+
Linux x86-64
exachk 12.1.0.2.6
Benefit / Impact:
The CRS_HOME should be locked properly after patching.
Risk:
The CRS_HOME not being locked properly may result in permissions being wrongly set as well as files not being instantiated.
Action / Repair:
To verify the CRS_HOME is properly locked, as the "root" userid on each database server execute the following command set:
export CRS_HOME=$(awk -F: '/^+ASM[0-9].*/{printf "%s\n", $2}' /etc/oratab)
CRS_CHECK=$(stat -c %U $CRS_HOME);
if [[ $CRS_CHECK == "root" ]];
then echo -e "SUCCESS:CRS Home is locked.";
else echo -e "WARN:CRS Home is NOT locked."
fi;
The expected output should be:
SUCCESS:CRS Home is locked.
If the output is not "SUCCESS...", open an SR and work with Oracle Support to determine the root cause and proper corrective action.

Verify storage server data (non-system) disks have no partitions
PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFAIL01/27/2016<Name>ProductionExadata - Physical,
Exadata - Management Domain,
SSC, Exalogic, Exalytics,
BDA, ZDLRA
 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, X4-2, X5-211.2.3.2.0+Linux x86-64exachk 12.1.0.2.6
Benefit / Impact:
Verifying that storage server data (non-system) disks have no partitions helps avoid an outage or data loss.
The impact of verifying that storage server data (non-system) disks have no partitions is minimal. The impact of correcting storage server data (non-system) disks that have partitions varies according to the reason for the partitions and the state of the device, and cannot be estimated here.
Risk:
During a storage server reboot, for storage server data (non-system) disks that have partitions, the partitions may become not visible to the operation system, and therefore unusable.
Action / Repair:
To verify that storage server data (non-system) disks have no partitions, as the "root" userid, execute the following command set on each storage server:
unset report_command  
OSS_SCRIPTS_HOME=/opt/oracle/cell/cellsrv/deploy/scripts      
DISK_DEV="$OSS_SCRIPTS_HOME/unix/hwadapter/diskadp/get_disk_devices.pl"     
SYS_DISKS=`cellcli -x -e "list lun attributes deviceName where isSystemLun = TRUE"`     
SYS_DISK0=`echo $SYS_DISKS|cut -f1 -d' '`     
SYS_DISK1=`echo $SYS_DISKS|cut -f2 -d' '`     
     
if [ -z "$OSS_SCRIPTS_HOME" ]; then     
   report_command=$(echo "$report_command\nEnvironment variable OSS_SCRIPTS_HOME is not defined")     
   status=1     
fi     
     
if [ ! -f $DISK_DEV ]; then     
   report_command=$(echo "$report_command\nFile $DISK_DEV does not exists")     
   status=1     
else     
    DATA_DISKS=`$DISK_DEV 2 |grep -v $SYS_DISK0 |grep -v $SYS_DISK1`     
     
    failDiskCount=0     
    for disk in $DATA_DISKS; do     
       size=${#disk}     
       if [ $size -eq 9 ]; then     
          disk="${disk%?}"     
       fi     
     
       parted -s $disk print 1>&2 >/dev/null     
     
       if [ $? -eq 0 ]; then     
          disks[$i]=$disk      
          failDiskCount=`expr $failDiskCount + 1`     
       fi     
    done     
     
    if [ $failDiskCount -eq 0 ]; then     
      report_command=$(echo "$report_command\nAll data disks have no partitions")     
      status=0     
    else     
      report_command=$(echo "$report_command\nThe following disks have partitions:")     
      report_command=$(echo "$report_command\n ${disks[@]}")     
      report_command=$(echo "$report_command\nAssociated griddisks needs to be removed from diskgroups")     
      report_command=$(echo "$report_command\nRebalance should complete before replacing/reformatting this device.")     
      status=1     
    fi     
fi     
echo -e "$report_command"
The expected output should be:
All data disks have no partitions

If data disks with partitions are discovered, they will be echoed back. If the output is not as expected, investigate for root cause and take appropriate corrective action.
NOTE: For additional information, please see: Exadata: Problems introduced when replacing a physical disk having a foreign partition table (Doc ID 1965314.1).

Verify db_unique_name is used in I/O Resource Management (IORM) interdatabase plans
PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalWARN02/24/2016<Name>ProductionExadata - Physical,
Exadata - User Domain,
SSC, Exalogic, Exalytics,
BDA, ZDLRA
 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
11g, 12cPrimary
Physical Standby
X2-2(4170), X2-2, X2-8, X3-2, EIGHTH, X3-8, X4-2, X5-211.2.3.2.0+Solaris - 11
Linux x86-64
exachk 12.1.0.2.6
Benefit / Impact:
Starting with Oracle Exadata Storage Server software version 12.1.2.3.0, IORM will no longer support using "db_name" in the inter-database IORM plan directive if the directive does not contain the "role" attribute. Existing customers who may be using "db_name" need to be alerted to this change.
NOTE: even though the effective version level for this change is 12.1.2.3.0, this check should be performed on versions prior to that, and the situation resolved, to avoid any issues immediately after the upgrade.
Risk:
If the inter-database IORM plan is not updated to use "db_unique_name", IORM may not manage that database as defined in the plan since the mapping will not be correct. DB, PDB and CG metrics for that database will also be impacted.
Action / Repair:
To determine if an existing IORM interdatabase plan requires modification, repeat the following process for all databases:
As the "root" userid on one storage server accessed by the target database, check if an interdatabase plan has been configured. If the count is non-zero, an interdatabase plan has been configured.
cellcli -e "list iormplan attributes dbplan detail" | grep "name=" | wc -l
NOTE: If no IORM interdatabase plan is configured, no further checking is required.
As the database home owner userid, execute the following to determine if "db_name"; is distinct from "db_unique_name":
$ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select VALUE from v\$parameter where name = 'db_name' and VALUE != (select VALUE from v\$parameter where name = 'db_unique_name');
exit
EOF;

NOTE: if no rows are returned, "db_name" is not distinct and no further checking is required.
If an IORM interdatabase plan is configured and the "db_name" is distinct, as the "root" userid on one storage server accessed by the target database, execute the following (correctly substituting the target database db_name value) to query the IORM plan and check if it contains any directive using the "db_name" without the "role" attribute:
cellcli -e "list iormplan attributes dbplan detail" | grep -i "name=<target database db_name value>" | grep –v “role=” | wc –l
If the number of lines returned is non-zero, the interdatabase IORM plan directive needs to be updated to use the target database "db_unique_name" value.
NOTE: Also review "Ensure db_unique_name is unique across the enterprise".

Verify Datafiles are Placed on Diskgroups consisting of griddisks with cachingPolicy = DEFAULT
Priority
Alert LevelDateOwnerStatus
Engineered System
Bug(s)
CriticalWARNING08/04/2015      <Name>ProductionExadata, AVM 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool Version
TBD
11.2.x+N/AV2, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-811.2.3.2+Linux x86-64 UEK5.8
Benefit / Impact:
Datafiles should be placed in diskgroups consisting of griddisks with their cachingPolicy set to DEFAULT. The cachingPolicy attribute determines if flashcache is used for blocks stored on the griddisk. When cachingPolicy is set to DEFAULT, then flashcache is used; when cachingPolicy is set to NONE, then flaschache will not be used for any blocks stored on the griddisk. Per Oracle best practices, Exadata is configured with cachingPolicy set to NONE for griddisks in the RECO diskgroup and set to DEFAULT (to use flashcache) for the DATA diskgroup. Oracle does not recommend storing datafiles in the RECO diskgroup or any other diskgroup that has its cachingPolicy set to NONE.
Risk:
You will not get the benefit of flashcache and may see greater I/O and related waits and/or higher hard disk utilization than is expected.
Action / Repair:
First, determine if you have placed datafiles onto a diskgroup that has its cachingPolicy set to NONE. Do this by creating a small script as follows and set its execute permission; in this example the script is called "check_cp.sh" :
 #!/bin/bash
#
# $1 = cell to check
#
CELL=$1
CELL_CP=$( ssh root@$CELL cellcli -e list griddisk attributes name, cachingpolicy,asmDiskGroupName where cachingpolicy=NONE  | awk '{print $3}'  | sort -u )
 
if [ -n "$CELL_CP" ]; then
 
   for i in $( echo -e $CELL_CP )
   do
      file_part=$( echo -e "+$i%" )
 
      RETVAL1=`sqlplus -silent / as sysdba <<EOF
 
      set linesize 250 pagesize 10000 feedback off heading off echo off show off verify off
      set serveroutput on
 
      var vDG varchar2(2000)
      begin
         :vDG := '$file_part';
      end;
      /
 
      select count(1) from v\\\$datafile where name like :vDG;
 
      exit;
EOF`
      RETVAL="$(echo $RETVAL1 |tr '\n' ' ')"
 
      if [ "$RETVAL" -gt "0" ]; then
         echo "FAIL : There are $RETVAL datfiles stored on griddisks in the $CELL_CP diskgroup with cachingPolicy=none"
         exit 1
      else
         echo "SUCCESS : There are NO datafiles stored on griddisks with cachingPolicy=none "
      fi
   done
else
   echo "SUCCESS : There are NO griddisks with cachingPolicy=none "
fi
Set your shell environment to the ORACLE_HOME, ORACLE_SID, etc to allow sqlplus to log on and then run the script against a single cell by calling it like this:
$ ./check_cp.sh exacel01
The expected output should be similar to:
SUCCESS : There are NO datafiles stored on griddisks with cachingPolicy=none
If any cell has an output of "FAIL ...", the corrective action is to review which files are on the diskgroup reported by the script and ensure their placement in that diskgroup was intentional. The following query will show the specific datafiles:
select name from v$datafile where name like '<DISKGROUP LISTED IN THE COMMAND OUTPUT>%';
For example, if command returned the following:
FAIL : There are 2  datfiles stored on griddisks in the RECOC1 diskgroup with cachingPolicy=none
the diskgroup to use in the query is +RECOC1, and the query would be:
select name from v$datafile where name like '+RECOC1%';
The script should be executed across all cells and repeated for each database instance you're interested in checking. If you have a list of cells stored in a file such as /home/oracle/cell_group, you can check all of the cells like this:
 for c in $( cat /home/oracle/cell_group );
do
     echo "Now checking cell $c ...";
     ./check_cp.sh $c;
done


Verify all datafiles are placed on griddisks that are cached on flash disks

PriorityAlert LevelDateOwnerStatusEngineered System
CriticalWARNING02/18/2016<Name>ProductionExadata - Physical,
Exadata - User Domain,
ZDLRA
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool Version
11.2.x+N/AV2, X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-211.2.3.2+Linux x86-64exachk 12.1.0.2.6
Benefit / Impact:
Datafiles should be placed in diskgroups consisting of griddisks with their cachedBy attribute that are set to a list of flash disks. The cachedBy attribute determines if flashcache is used for blocks stored on the griddisk. When cachedBy is set to a list of flash disks, then flashcache is used; when cachedBy is not set, then flaschache will not be used for any blocks stored on the griddisk. Per Oracle best practices, Exadata is configured with cachedBy set to NULL for griddisks in the RECO diskgroup and set to the list of flash disks (to use flashcache) for the DATA diskgroup. Oracle does not recommend storing datafiles in the RECO diskgroup or any other diskgroup that has one or more of its griddisks with cachedBy unset.
Risk:
You will not get the benefit of flashcache and may see greater I/O and related waits and/or higher hard disk utilization than is expected.
Action / Repair:
First, determine if you have placed datafiles onto a diskgroup that has cachedBy unset. Do this by creating a small script as follows and set its execute permission; in this example the script is called "check_cby.sh" :
#!/bin/bash
#
# $1 = cell to check
#
CELL=$1
FLASH_MODE=$( ssh root@$CELL cellcli -e  'list cell attributes flashCacheMode' | grep -i -c writeback )
CELL_CBY=$( ssh root@$CELL cellcli -e 'list griddisk attributes name,cachedby,asmDiskGroupName where cachedby\=\"\" ' | awk '{print $2}'  | sort -u )

if [ -n "$CELL_CBY" ] && [ "$FLASH_MODE" -eq "1" ]; then

   for i in $( echo -e $CELL_CBY )
   do
      echo "Diskgroup ${i} has griddisks with unset CachedBy attributes....checking if any datafiles are present... "

      file_part=$( echo -e "+$i%" )

      RETVAL1=`sqlplus -silent / as sysdba <<EOF

      set linesize 250 pagesize 10000 feedback off heading off echo off show off verify off
      set serveroutput on

      var vDG varchar2(2000)
      begin
         :vDG := '$file_part';
      end;
      /

      select count(1) from v\\\$datafile where name like :vDG;

      exit;
EOF`
      RETVAL="$(echo $RETVAL1 |tr '\n' ' ')"


      if [ "$RETVAL" -gt "0" ]; then
         echo "FAIL : There are $RETVAL datafiles stored on griddisks in the ${i} diskgroup that are not cached by flash (have cachedBy attribute unset for at least one griddisk)"
      else
         echo "SUCCESS : There are NO datafiles stored on griddisks with cachedBy unset in the ${i} diskgroup "
      fi

   done
else
   if  [ "$FLASH_MODE" -eq "1" ]; then
       echo "SUCCESS :  There are NO datafiles stored on griddisks with cachedBy unset "
   else
       echo "SUCCESS :  Cell is in WRITETHROUGH flashcache mode - test does not apply."
   fi
fi

Set your shell environment to the ORACLE_HOME, ORACLE_SID, etc to allow sqlplus to log on and then run the script against a single cell by calling it like this:

$ ./check_cby.sh exacel01

The expected output when a cell is in WriteBack flashcache mode should be:

SUCCESS :  There are NO datafiles stored on griddisks with cachedBy unset 

The expected output when a cell is in WriteThrough flashcache mode should be:

SUCCESS :  Cell is in WRITETHROUGH flashcache mode - test does not apply.

If any cell has an output of "FAIL ...", the corrective action is to review which files are on the diskgroup reported by the script and ensure their placement in that diskgroup was intentional. The following query will show the specific datafiles:

select name from v$datafile where name like '<DISKGROUP LISTED IN THE COMMAND OUTPUT>%';

For example, if command returned the following:

FAIL : There are 3  datafiles stored on griddisks in the RECOC1 diskgroup that are not cached by flash (have cachedBy attribute unset for at least one griddisk)

the diskgroup to use in the query is +RECOC1, and the query would be:

select name from v$datafile where name like '+RECOC1%';

The script should be executed across all cells and repeated for each database instance you're interested in checking. If you have a list of cells stored in a file such as /home/oracle/cell_group, you can check all of the cells like this:

 for c in $( cat /home/oracle/cell_group ); do echo "Now checking cell $c ..."; ./check_cby.sh $c; done 

Validate key sysctl.conf parameters on database servers
Priority
Alert Level
Date
Owner
Status
Engineered System
Critical
FAIL
2/10/16
<Name>
Production
Exadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8
All
Linux
Benefit / Impact:
Kernel parameter settings in /etc/sysctl.conf are applied to the kernel automatically at boot time and manually via the sysctl utility at runtime. The semantics of each kernel parameter are known only to the kernel, so the sysctl utility passes all values directly to the kernel with minimal processing and validation. Invalid values can be misinterpreted by the kernel, leading to unexpected results. For certain key parameters, such invalid values can have an immediate and critical impact on the system. Invalid values stored in /etc/sysctl.conf at boot time can prevent the system from booting, making it difficult to identify and correct the problem. Validating the format of some key parameters periodically or after changes to sysctl.conf can prevent unexpected outages due to human error.
Risk:
Applying improperly formatted or incorrect value settings to kernel parameters can render a system unusable.
Action / Repair:
Key sysctl.conf parameters on database servers vary by Exadata software version level, hardware type, and whether or not virtualization is used. exachk runs the appropriate checks based upon the discovered environment configuration. To validate Key sysctl.conf parameters on database servers, run exachk and review the provided report.
The expected output in the exachk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":
PASS   OS Check   sysctl.conf parameters on database servers are configured as recommended   All Database Servers   View
In the "View" detail section of the report for each individual database server:
Status on randomadm01:
PASS => sysctl.conf parameters on database servers are configured as recommended 
DATA FROM RANDOMADM01 FOR VALIDATE KEY SYSCTL.CONF PARAMETERS ON DATABASE SERVERS  
All sysctl.conf formatting checks succeeded
If there are issues discovered, the overall result will be "FAIL" and more information will be listed in the "View" detail section. Investigate the reported issues for root cause and take appropriate corrective action.
NOTE: If after corrective actions are completed, you wish to run just this review manually without a full exachk run, as the "root" userid in the directory in which exachk was installed, execute the following:
./exachk -check 018D274D1212689AE05313C0E50AB893

Detect duplicate files in /etc/*init* directories

PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalWarning04/06/16<Name>ProductionExadata - Physical,
Exadata - User Domain,
Exadata - Management Domain
 
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
n/an/aX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8AllLinux x86-64exachk 12.1.0.2.7 
Benefit / Impact:
It happens administrators backup contents of /etc/init before updating a database node.
Directories with names such as /etc/init122_old can be created with duplicate startup files in it - files that already exist in /etc/init.
Making sure no duplicate startup files exist is helping in preventing against boot failures.
The impact of verifying /etc/*init* contents is minimal. The impact of correcting the duplicate contents zero.
Risk:
At boot time the Operating System traverses through all directories in /etc starting with the word "init" to execute startup scripts, duplicate files can cause startup scripts to be executed multiple times which fails the boot process.
Action / Repair:
Execute the following command as the "root" userid on all database servers:
v_dupe_cnt=$(find  /etc/*init* -type f -exec basename {} \;  | sort | uniq -c | grep -v  "^[ \t]*1 " | wc -l);
if [ $v_dupe_cnt -gt 0 ]
then 
  echo -e "FAILURE:  Duplicate content found in /etc/init* directories";  
else
  echo -e  "SUCCESS:  No duplicate content found in /etc/init* directories"; 
fi;
The expected output should be:
SUCCESS:  No duplicate content found in /etc/init* directories 
A "FAILURE" message would be as follows:
FAILURE: Duplicate content found in /etc/init* directories 
If output is a "FAILURE" message, run the following command to identify the duplicate files. Remove (or move) the duplicate files found in the /etc/*init* directories to another location (out of /etc):
find /etc/*init* -type f -exec basename {} \;  | sort | uniq -c | grep -v "^[ \t]*1 " 

Verify Database Server Quorum Disks configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 05/29/19 <Name>Production Exadata - Physical,
Exadata - User Domain, SSC 
ALL 28496580 - exachk
27274882 - exachk
25306232 - exachk
23065735 - exachk
27067655 - OEDA 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
12.1.0.2.160119 and UP ASM N/A N/A ALL Linux, Sparc exachk 19.3.0 N/A 
Benefit / Impact:
The configuration of Quorum Disks for any High Redundancy diskgroup using less than five failgroups, provides following benefits:
  • When storing Voting disks, protects the Grid Infrastructure in the event of a double partner storage failure or an event involving Exadata storage server being offline due to planned maintenance and a subsequent partner storage failure.
  • Expanding diskgroups to use a higher number of failgroups and the subsequent shrinking to use less than five failgroups, will avoid the diskgroup dismount during planned or unplanned maintenance. This is due to changes introduced in bug 26199003
Risk:
  • Without this feature, voting files get stored in a normal redundancy diskgroup on Exadata racks with less than 5 storage servers which makes the Grid Infrastructure vulnerable to a cluster outage if multiple vote disks are inaccessible.
  • Diskgroups used on a flex configuration (expanding/shrinking) are exposed to be dismounted during planned or unplanned maintenance.
Action / Repair:

NOTE: This check will only pass if the following are all true:
1) /opt/oracle.SupportTools/quorumdiskmgr exists on the db nodes
2) The GI BP version is above 12.1.0.2.160119
3) At least one HIGH redundancy diskgroup exists
4) Quorum disks on DB nodes are implemented when there are less than 5 storage cells in the high redundancy disk group.
5) All HIGH redundancy diskgroups contain quorum disks
6) If the number of cells is greater than or equal to 5, all the voting files are in the cells
NOTE WELL:For a complete picture, please also reference: Verify all voting disks are online
To verify the database server quorum disks configuration, run exachk and review the provided report.
The expected output in the exachk report should be as follows:
The overall result should be "PASS" or "WARNING" or "FAIL":
In the "View" detail section of the report for this check the expected output should be similar to:

Voting File redundancy check Passed

##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   11ccca4125424fb1bfec2180a22e24cb (/dev/exadata_quorum/QD_DATAC1_SCAQAE05ADM01VM01) [DATAC1]
 2. ONLINE   5da7f33dc5f64f64bfb2b756787a6b48 (o/192.168.221.137;192.168.221.138/DATAC1_FD_05_scaqae05celadm03) [DATAC1]
 3. ONLINE   1eefa3ec1ebc4fd3bf8933ca0c587e13 (o/192.168.221.133;192.168.221.134/DATAC1_FD_04_scaqae05celadm01) [DATAC1]
 4. ONLINE   6d65ea6de3eb4fcebf3e7984d62d51b9 (/dev/exadata_quorum/QD_DATAC1_SCAQAE05ADM02VM01) [DATAC1]
 5. ONLINE   de0d94da4fc94f57bf2a12dbc46a3603 (o/192.168.221.135;192.168.221.136/DATAC1_FD_04_scaqae05celadm02) [DATAC1]
Located 5 voting disk(s).
In the "View" detail section of the report for this check a "WARNING" example will be similar to:

A database server quorum disk configuration is not applicable to this system because no high redundancy diskgroups were found.
High redundancy is a MAA best practice.  
For details, see http://www.oracle.com/technetwork/database/features/availability/exadata-maa-131903.pdf

##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   5da7f33dc5f64f64bfb43434787a6b48 (o/192.168.221.137;192.168.221.138/RECOC1_FD_05_scaqae05celadm03) [RECOC1]
 2. ONLINE   1eefa3ec1dffe4d3bf8933ca0c587e13 (o/192.168.221.133;192.168.221.134/RECOC1_FD_04_scaqae05celadm01) [RECOC1]
 3. ONLINE   de0d94da4fc94f52332dr2dbc46a3603 (o/192.168.221.135;192.168.221.136/RECOC1_FD_04_scaqae05celadm02) [RECOC1]
Located 3 voting disk(s).
In the "View" detail section of the report for this check a "FAILURE" example will be similar to:

A database server quorum disk configuration is applicable to this system.
But an optimal Quorum disk setup is not found as seen below.
An optimal quorum disk setup should include 2 quorum disks along with 5 voting files, with 2 of the voting files placed on the 2 quorum disks and the 3 remaining voting files on 3 different cells.

##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   5da7f33dc5f64f64bfb43434787a6b48 (o/192.168.221.137;192.168.221.138/RECOC1_FD_05_scaqae05celadm03) [RECOC1]
 2. ONLINE   1eefa3ec1dffe4d3bf8933ca0c587e13 (o/192.168.221.133;192.168.221.134/RECOC1_FD_04_scaqae05celadm01) [RECOC1]
 3. ONLINE   de0d94da4fc94f52332dr2dbc46a3603 (o/192.168.221.135;192.168.221.136/RECOC1_FD_04_scaqae05celadm02) [RECOC1]
Located 3 voting disk(s).
If the result is a "FAILURE..." message, follow the steps provided to add database server quorum disks in the "Adding Quorum Disks to Database Servers" section of the "Oracle® Exadata Database Machine Maintenance Guide"
NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute the following command as the "root" userid in the directory in which exachk was installed:
./exachk -check 339FE456FBDC3549E0530D98EB0AD21F
Verify Oracle Clusterware files are placed appropriately
Priority
Alert Level
Date
Owner
Status
Engineered System
Critical
Fail
05/25/16
<Name>
Development
Exadata - Physical,
Exadata - User Domain
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
Any supported version
GRID
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
Any supported version
Linux x86-64
exachk 12.1.0.2.7

Benefit / Impact:
Oracle Clusterware files should always be placed in a high redundancy diskgroup with the exception of voting files for the following cases.
i) For environments with less than 5 storage cells and running any Exadata software release prior to 12.1.2.3.0, the voting files need to be placed in a normal redundancy diskgroup.
ii) For environments with less than 5 storage cells , running any Exadata software release 12.1.2.3.0 or above and running any Oracle Grid Infrastructure version prior to 12.1.0.2.160119, the voting files need to be placed in a normal redundancy diskgroup.
Risk:
Oracle Clusterware files placed on a normal redundancy diskgroup are exposed to the risk of of being lost in the event of diskgroup failures due to a double partner storage failure. Having the clusterware files on a high redundancy diskgroup mitigates this risk. The voting files are the only Clusterware files that are mandated to be stored in a normal redundancy diskgroup under the 2 conditions mentioned above. However, even if we lose the voting files due to a double partner storage failure under the above 2 conditions, they can be easily recreated unlike all other Clusterware files which require restore from backups.
Action / Repair:
Execute the script provided below as the Grid Infrastructure owner to check if the Clusterware files are placed appropriately.
#!/bin/bash
#################################################################
# #
# Purpose: Check the placement of Oracle CLusterware Files #
# #
#################################################################
## Function declarations
export GRID_HOME=$(grep ^"+ASM" /etc/oratab|awk -F ":" '{print $2}')
export ORACLE_HOME=$GRID_HOME
export ORACLE_SID=$(grep ^"+ASM" /etc/oratab|awk -F ":" '{print $1}')
usage()
{
echo "Usage: CheckGIFiles.sh [-o check|report] [-h]";
}
checkDGRedundancy()
{
HighRedExists=$($GRID_HOME/bin/asmcmd lsdg --suppressheader|awk '{print $2}'|grep -q HIGH && echo "1")
if [ "$HighRedExists"x == "x" ]
then
HighRedExists=0
else
HighRedExists=1
fi
}
checkOCR()
{
OCRdgName=$($GRID_HOME/bin/ocrcheck|grep "Device/File Name"|awk -F":" '{print $2}'|awk -F"+" '{print $2}')
OCRDGRedundancy=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $OCRdgName|awk '{print $2}'|grep -q HIGH && echo "1")
if [ "$OCRDGRedundancy"x == "x" ]
then
OCRHighRedundancy=0
OCRRec="Please relocate the OCR to a high redundancy diskgroup using $GRID_HOME/bin/ocrconfig as described in the link below\n"
OCRRecLink="http://docs.oracle.com/database/121/CWADD/votocr.htm#BABEIEJI\n"
else
OCRHighRedundancy=1
fi
}
checkASMspfile()
{
ASMspfiledgName=$($GRID_HOME/bin/asmcmd spget|awk -F"/" '{print $1}'|awk -F"+" '{print $2}')
ASMspfileDGRed=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $ASMspfiledgName|awk '{print $2}'|grep -q HIGH && echo "1")
if [ "$ASMspfileDGRed"x == "x" ]
then
ASMspfileHighRed=0
ASMspRec="Please relocate the ASM spfile to a high redundancy diskgroup using '$GRID_HOME/bin/asmcmd spcopy -u' as described in the link below.\nAfter relocating the spfile, if possible restart the Grid Infrastructure in a rolling manner.\nIf a rolling grid infrastructure restart is not permitted, repeat the steps for relocating the spfile to the high redundancy diskgroup every time an initialization parameter modification to the ASM spfile is required until the Grid Infrastructure is restarted in a rolling manner.\n"
ASMspRecLinks="http://docs.oracle.com/database/121/OSTMG/GUID-528363BF-F4C8-4F05-BB61-DF7A6863E5B8.htm#OSTMG94420\n"
else
ASMspfileHighRed=1
fi
}
checkASMpwfile()
{
ASMpwfiledgName=$($GRID_HOME/bin/srvctl config asm|grep "Password"|awk -F":" '{print $2}'|awk -F"/" '{print $1}'|awk -F"+" '{print $2}')
ASMpwfileDGRed=$($GRID_HOME/bin/asmcmd lsdg --suppressheader $ASMpwfiledgName|awk '{print $2}'|grep -q HIGH && echo "1")
if [ "$ASMpwfileDGRed"x == "x" ]
then
ASMpwfileHighRed=0
ASMpwRec="Please relocate the ASM passwordfile to a high redundancy diskgroup using '$GRID_HOME/bin/asmcmd pwmove' as described in the link below.\n"
ASMpwRecLink="http://docs.oracle.com/database/121/OSTMG/GUID-6DFC9F42-A949-412F-B9F3-D947C1A620B8.htm#OSTMG95378\n"
else
ASMpwfileHighRed=1
fi
}
check_main()
{
checkDGRedundancy
checkOCR
checkASMspfile
checkASMpwfile
Dgsfound=$($GRID_HOME/bin/asmcmd lsdg |awk -F"/" '{print $1}'|awk '{print $13,$2}')
if [[ $HighRedExists -eq 1 ]]
then
if [[ $OCRHighRedundancy -eq 0 ]] || [[ $ASMspfileHighRed -eq 0 ]] || [[ $ASMpwfileHighRed -eq 0 ]]
then
repText="\nClusterware files placement check failed. \nThe clusterware files are not all placed in a high redundancy diskgroup.\n"
exit_code=1
repCmdOutput0="The Diskgroups found are \n=========================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
ALVL=1
else
repText="\nClusterware files placement check passed\n"
repCmdOutput0="The Diskgroups found are \n============================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
exit_code=0
fi
else
repText="\nClusterware files placement check passed\n"
repCmdOutput0="The Diskgroups found are \n============================\n $Dgsfound\n"
repCmdOutput1="$(echo "OCR is stored in :" $OCRdgName)\n"
repCmdOutput2="$(echo "ASM spfile is stored in :" $ASMspfiledgName)\n"
repCmdOutput3="$(echo "ASM password file is stored in :" $ASMpwfiledgName)\n"
exit_code=0
fi
}
print_result()
{
echo $exit_code
}
print_report()
{
echo -e $repText
echo -e "$repCmdOutput0"
echo -e "$repCmdOutput1"
echo -e "$repCmdOutput2"
echo -e "$repCmdOutput3"
if [ $exit_code -ne 0 ]
then
[ -z "$OCRRec" ] || echo -e "$OCRRec\n$OCRRecLink"
[ -z "$ASMspRec" ] || echo -e "$ASMspRec\n$ASMspRecLinks"
[ -z "$ASMpwRec" ] || echo -e "$ASMpwRec\n$ASMpwRecLink"
fi
}
NumArgs=$#
if [ $NumArgs -lt 1 ]
then
echo "Invalid or missing command line arguments..."
usage;
exit 1
fi
while getopts "o:h" opt;
do
case "${opt}" in
h) usage;
exit 0
;;
o)
swch=${OPTARG};
;;
*) echo "Invalid or missing command line arguments..."
usage;
exit 1
;;
esac
done
if [ $swch == "check" ]
then
check_main;
print_result;
elif [ $swch == "report" ]
then
check_main;
print_report;
else
echo "Invalid or missing command line arguments..."
usage;
exit 1
fi
The expected output is:
SUCCESS: Clusterware files placement check passed
- OR -
WARNING: Clusterware files placement check failed. The clusterware files are not all placed in a high redundancy diskgroup.

Verify "_reconnect_to_cell_attempts=9" on database servers which access X6 storage servers
Priority
Alert Level
Date
Owner
Status
Engineered System
   Bugs
Critical
FAIL
06/29/16
<Name>
Draft
Exadata - User Domain,
Exadata - Physical,
SSC
<23713702>- exachk
  
<23713702>- exachk
22672595,23749547
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
 TBD
< 12.1.0.2 OCT BP
-or-
< 12.2.0.1
ALL
X2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
11.2+
Linux x86-64,
Solaris
EXAchk 12.2.0.1.1
 
Benefit / Impact:
For optimal high availability, the cellinit.ora parameter file on database servers which access X6 storage servers must contain "_reconnect_to_cell_attempts=9".
The impact of verifying the this setting is minimal. The impact of adding the parameter to the cellinit.or file on the database servers is minimal, but after including the parameter on the database side, the cell server process (CELLSRV) on each X6 storage server must be restarted to activate the change.
Risk:
If the cellinit.ora parameter file on database servers which access X6 storage servers does not contain "_reconnect_to_cell_attempts=9" brownout duration may be lengthened.
Action / Repair:
EXAchk runs the appropriate validation based upon the discovered environment configuration, run EXAchk and review the provided report.
The expected output in the EXAchk report should be as follows:
In the "Findings Passed" summary section of the report, the overall result should be "PASS":
PASS OS Check _reconnect_to_cell_attempts parameter in cellinit.ora is set to recommended value All Database Servers View
In the "View" detail section of the report for each individual database server:
Status on randomadm01:
PASS => _reconnect_to_cell_attempts parameter in cellinit.ora is set to recommended value
DATA FROM RANDOMADM01 - VERIFY "_RECONNECT_TO_CELL_ATTEMPTS=9" ON DATABASE SERVERS WHICH ACCESS X6 STORAGE SERVERS
ipaddress4=192.172.23.4/26
ipaddress3=192.172.23.3/26
ipaddress2=192.172.23.2/26
ipaddress1=192.172.23.1/26
_reconnect_to_cell_attempts=9
If the parameter is not set as expected, the overall result will be "FAIL" and more information will be listed in the "View" detail section.
To correct a "FAIL" result, do:
1) As the "root" userid on each database server that requires correction, edit the cellinit.ora file with vi and add "_reconnect_to_cell_attempts=9".
2) As the "root" userid on each storage server that communicates with the database servers in 1), restart the cell server process.
NOTE: If after corrective actions are completed, you wish to run just this verification without a full EXAchk run, as the "root" userid in the directory in which EXAchk was installed, execute the following:
./exachk -check 39E9CC7370B42BF6E0530E98EB0AC7A5

Verify passwordless SSH connectivity for Enterpise Manager (EM) agent owner userid to target component userids
Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
08/24/16
<Name>
Development
Exadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
 
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
11.2.2.2.0+
Linux x86-64
exachk 12.2.0.1.1
Benefit / Impact:
EM agent monitoring requires passwordless SSH connectivity between the userid running the EM agent on each database server where an EM agent is running and specific userids for each target component that particular EM agent is monitoring. Component replacement or other maintenance work may destroy the passwordless SSH configuration and cause monitoring to fail.
Risk:
Users would not be notified if there are issues on the EM target components.
Action / Repair:
To verify that the necessary passwordless SSH exists, do the following on each database server where an EM agent is running:
1. Determine which database servers have EM agents installed using the EM console.
2. For each EM agent, determine the components for which it is responsible to monitor in the agent home page of the EM console.
3. Login to each database server where an EM agent is running as the operating system userid that launched the EM agent and execute the following for each monitored component determined in 1) and 2):

For a database server EM target:
ssh -o 'PreferredAuthentications=publickey' <AGENT OS USERID>@<Database_Server_Name> "echo Success"
For a storage server EM target:
ssh -o 'PreferredAuthentications=publickey' cellmonitor@<Storage_Server_Name> "echo Success"
For an InfiniBand switch EM target:
ssh -o 'PreferredAuthentications=publickey' nm2user@<IB_Switch_Name> "echo Success"
For a Cisco switch EM target:
ssh -o 'PreferredAuthentications=publickey' admin@<Cisco_Switch_Name> "echo Success"
For each component, the expected output should be:
Success
If "Permission denied (publickey,gssapi-with-mic,password)" is returned then the ssh configuration is not correct.

CORRECTIVE ACTIONS:
For a database server:
4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the computer node host to /home/oracle/.ssh on the affected database3 server(s).
c. Ensure the permission on /home/oracle/.ssh/authorized_keys is set to 600 and owned by the oracle user

For a storage server:
4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the agent host to /home/cellmonitor/.ssh on the affected storage server(s).
c. Ensure the permission on /home/cellmonitor/.ssh/authorized_keys is set to 600 and owned by the cellmonitor user

For an InfiniBand switch:
4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the Exadata Plug-in, Section: Establish SSH Connectivity
b. If so append the contents of /home/oracle/.ssh/ida_dsa.pub on the agent host to /home/nm2user/.ssh/authorized_keys on the affect IB switch(s).
c. Ensure the permission on /home/nm2user/.ssh/authorized_keys is set to 600 and owned by the nm2user user

For the Cisco switch:
4. To correct the "Permission denied..." case:
a. Check to see if /home/oracle/.ssh/id_dsa and id_dsa.pub files exist on the affected agent host. If either file does not exist follow the steps in: Enterprise Manager Oracle Exadata Database Machine Getting Started Guide, Chapter 8: Troubleshooting the Exadata Plug-in, Section: Establish SSH Connectivity

Login to the switch as admin and issue the commands below to add keys

Switch hostname>enable
Switch hostname#configure terminal
Switch hostname(config)#ip ssh pubkey-chain
Switch hostname(conf-ssh-pubkey)#username admin
Switch hostname(conf-ssh-pubkey-user)#key-string
Switch hostname(conf-ssh-pubkey-data)#< Enter you keyfile contents here >
Switch hostname(conf-ssh-pubkey-data)#< Enter your keyfile contents here >

** The key may need to be entered on multiple lines as the maximum line length is 254 characters.

Now exit the switch
Switch hostname(conf-ssh-pubkey-data)#exit
Switch hostname(conf-ssh-pubkey-user)#exit
Switch hostname(conf-ssh-pubkey)#exit
Switch hostname(config)#exit
Switch hostname#exit
5. Repeat step 2 and verify connectivity
If some message other than "Success" or "Permission denied...." is returned, investigate for root cause based on the message keywords and take corrective action.
Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files
Priority
Alert Level
Date
Owner
Status
Engineered System
Engineered System
Platform 
Bug(s)
Critical
WARN
03/15/17
<Name>
Production
Exadata - Management Domain
ALL
Bug 25688952 - Exachk
Bug 25520385 - Exachk
DB Version
DB Type
DB Role
DB Mode
Exadata Version
OS & Version
Validation Tool Version
MAA Scorecard Section
N/A
N/A
N/A
N/A
Linux
Linux
exachk 12.2.0.1.3
                            N/A
Benefit:
To use dom0 disk space efficiently, two space saving techniques are used for disk image files in /EXAVMIMAGES, sparse files and reflinks. Sparse files do not allocate blocks on disk for empty space. OCFS2 reflinks allow disk image copies to share blocks on disk until one of the copies changes, at which time a new block on disk is allocated. The result of these space saving features is the amount of disk space consumed is less than the apparent size of the user domain disk image files reported by the "du -sS --apparent-size " command. However, as a user domain is used and files are changed, created, and removed, the disk space consumed from the /EXAVMIMAGES file system will continually grow while the actual space used by disk image files could remain the same. This check warns when the total apparent size of all files in /EXAVMIMAGES exceeds the size of file system.
Impact: The impact of this check is minimal
Risk:
A failure does not occur when the apparent size exceeds the size of the /EXAVMIMAGES file system. It may be normal in many environments that benefit from sparse files and reflinks heavily. However, over time as changes are made to user domain disks (e.g. by applying Exadata, Grid Infrastructure, or Database patches), allocated space in the /EXAVMIMAGES file system increases. If the allocated space reaches /EXAVMIMAGES file system size in dom0, then an out of space error will occur within the user domain, even though df output within the user domain shows there is available space. This can cause unpredictable behavior, such as an unbootable user domains, or corrupted files that were being changed at the time the out of space error occurred.
Action/Repair: Execute the script as root on a dom0.
To validate /EXAVMIMAGES on dom0s for possible over allocation by sparse files, run exachk and review the provided report.
The expected output in the exachk report should be as follows:

In the "Findings Passed" summary section of the report, the overall result should be "PASS":
PASS   OS Check   /EXAVMIMAGES on dom0s has enough free space   All Database Servers   View
In the "View" detail section of the report for each individual database server:
Status on randomadm01:
PASS => /EXAVMIMAGES on dom0s has enough free space

DATA FROM RANDOMADM01 FOR CHECK /EXAVMIMAGES ON DOM0S FOR POSSIBLE OVER ALLOCATION BY SPARSE FILES 

/EXAVMIMAGES space has not been over allocated and the space usage is under the threshold.
If there are issues discovered, the overall result will be "FAIL" and more information will be listed in the "View" detail section. Investigate the reported issues for root cause and take appropriate corrective action.
NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute the following command as the "root" userid in the directory in which exachk was installed:
./exachk -check 3F15EA417EBB5C15E0530A98EB0A8124

  Verify active kernel version matches expected version for installed Exadata Image

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL11/28/18<Name> ProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
ALL28826182 - exachk
26337714 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/A12.1.2.3.0 or higherLinuxexachk 18.1.4N/A
Benefit / Impact:
Beginning with Exadata version 12.1.2.3.0, the "imageinfo" command includes data on the active kernel version and the expected kernel version for the installed version of the Exadata image. The active and expected kernel versions should match.
Risk:
Having an active kernel version that does not match the expected version could adversely impact upgrade operations.
Action / Repair:
To verify active kernel version matches expected version for the installed Exadata image, as the "root" userid on each database server, execute the following command set:
RAW_DATA=$(imageinfo | egrep "Kernel|kernel")
ACTIVE_KERNEL_VERSION=$(echo "$RAW_DATA" | egrep "Kernel" | cut -d":" -f2 | cut -d"#" -f1 | tr -d '[[:space:]]')
EXPECTED_KERNEL_VERSION=$(echo "$RAW_DATA" | egrep "kernel" | cut -d":" -f2 | tr -d '[[:space:]]')
AKV_OFFSET=$(echo "$ACTIVE_KERNEL_VERSION" | egrep -b -o "\.el" | cut -d":" -f1)
EKV_OFFSET=$(echo "$EXPECTED_KERNEL_VERSION" | egrep -b -o "\.el" | cut -d":" -f1)
ACTIVE_KERNEL_VERSION_SHORT=$(echo "$ACTIVE_KERNEL_VERSION" | cut -c 1-$AKV_OFFSET)
EXPECTED_KERNEL_VERSION_SHORT=$(echo "$EXPECTED_KERNEL_VERSION" | cut -c 1-$EKV_OFFSET)
if [ $ACTIVE_KERNEL_VERSION_SHORT = $EXPECTED_KERNEL_VERSION_SHORT ]
then 
     echo -e "SUCCESS: The kernel versions match:\n"
     echo -e "Active kernel version:\t\t$ACTIVE_KERNEL_VERSION_SHORT"
     echo -e "Expected kernel version:\t$EXPECTED_KERNEL_VERSION_SHORT"
else 
     echo -e "FAILURE: The kernel versions should match:\n"
     echo -e "Active kernel version:\t\t$ACTIVE_KERNEL_VERSION_SHORT"
     echo -e "Expected kernel version:\t$EXPECTED_KERNEL_VERSION_SHORT" 
fi;
The expected output should be similar to:
SUCCESS: The kernel versions match:

Active kernel version:          2.6.39-400.284.1
Expected kernel version:        2.6.39-400.284.1
Example of a "FAILURE" message:
FAILURE: The kernel versions should match:

Active kernel version:          2.6.39-400.284.1
Expected kernel version:        2.6.39-500.284.1
If a "FAILURE: ..." message appears, corrective actions will depend upon the kernel versions and the reasons for which the mismatch was introduced. Please open an SR for diagnostic and corrective assistance.


Verify Storage Server user "CELLDIAG" exists

PriorityAlert LevelDateOwnerStatusEngineered System   Bug(s)      
CriticalFAIL10/26/16<Name>ProductionExadata - Physical,
Exadata - Management Domain
25520477 - exachk
24958292 - exachk
Reference: 23039723
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-812.1.2.2.0+Linux x86-64exachk TBD 
Benefit / Impact:
Beginning with Exadata Storage Server Software version 12.1.2.2.0, the storage server user "CELLDIAG" is created during deployment which allows access to diagnostics without using a more privileged user. The benefit of creating and using the "CELLDIAG" user is improved security. The impact of verifying that the "CELLDIAG" user is created is minimal, as is the impact of creating the user if it does not exist.
Risk:
Not creating and using the storage server user "CELLDIAG" fails to utilize a security improvement.
Action / Repair:
To Verify the storage server user "CELLDIAG" exists, as the "root" userid storage server, execute the following command set:
#!/bin/bash
USER=`cellcli -e list user where name = 'CELLDIAG'`
RET=$?
if [ $RET -eq 0 -a -n "$USER" ]; 
then 
  echo "SUCCESS: CELLDIAG user exists"
else
   echo "FAILURE: CELLDIAG user does not exist"
fi
The expected output should be similar to:
SUCCESS: CELLDIAG user exists
Example of a "FAILURE" message (there is no output from the command--the absence of the CELLDIAG output is the failure condition):
FAILURE: CELLDIAG user does not exist
If a "FAILURE: ..." message appears, create the user and role on each cell in cellcli using commands like these:
create user CELLDIAG password="SomeGood42Password";  
create role celldiagrole;  
grant privilege create on diagpack to role celldiagrole;  
grant privilege list on diagpack to role celldiagrole;  
grant privilege download on diagpack to role celldiagrole;  
grant role celldiagrole to user CELLDIAG;
NOTE: The "CELLDIAG" user is created during the Exadata Storage Server Software version 12.1.2.2.0 or higher deployment process. It is not created during an upgrade from an older release.

NOTE: the user detail for a properly configured "CELLDIAG" userid should look like:
CellCLI> list user CELLDIAG detail
         name:                   CELLDIAG
         roles:                  role=celldiagrole
                                         Privileges:
                                 object=diagpack, verb=create, attributes=all attributes, options=all options
                                 object=diagpack, verb=download, attributes=all attributes, options=all options
                                 object=diagpack, verb=list, attributes=all attributes, options=all options

NOTE: Creation of the "CELLDIAG" storage server user is not mandatory. The automatic diagnostic gathering process continues to function without it and the packaged diagnostics are accessed using one of the other storage server users.

  Verify installed rpm(s) kernel type match the active kernel version

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalWARN11/28/18
<Name>
ProductionExadata - Physical,
Exadata - User Domain
ALL28740049 - exachk
26396389 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.4.0N/A
Benefit / Impact:
Verifying installed rpm(s) kernel type match the active kernel version helps avoid update failures due to dependency conflicts between older rpm versions and newer versions being installed. The impact of verifying that installed rpm(s) kernel type match the active kernel version is minimal. The impact of correction depends upon why the mismatched rpm(s) was/were installed and cannot be estimated here.
Risk:
If installed rpm(s) kernel type do not match the active kernel, there may be update interruptions caused by dependency conflicts between older rpm versions and newer versions being installed.
Action/Repair:
To verify the installed rpm(s) kernel type match the active kernel version, execute the following code as the "root" userid on each database server:
unset ERROR_MESSAGE
UNAME_DATA=$(uname -r)
START=$(echo "$UNAME_DATA" | awk 'END{print index($0,"el")}')
END=$(expr $START + 2)
KERNEL_TYPE=$(echo "$UNAME_DATA" | cut -c$START-$END)
case "$KERNEL_TYPE" in
  el7)
    MISMATCHED_RPMS=$(rpm -aq | egrep "\.el5|\.el6")
    ;;
  el6)
    MISMATCHED_RPMS=$(rpm -aq | grep "\.el5|\.el7")
    ;;
  el5)
    MISMATCHED_RPMS=$(rpm -aq | grep "\.el6|\.el7")
    ;;
  *)
    ERROR_MESSAGE=$(echo "Unrecognized kernel type:  $KERNEL_TYPE")
    ;;
esac
if [ -n "$ERROR_MESSAGE" ]
then
  echo -e "\nFAILURE:  $ERROR_MESSAGE"
else
  if [ -n "$MISMATCHED_RPMS" ]
  then
    MISMATCH_COUNT=$(echo "$MISMATCHED_RPMS" | wc -l)
  else    MISMATCH_COUNT=0
  fi
  if [ -z "$MISMATCHED_RPMS" ]
  then
    echo -e "\nSUCCESS:  There were no mismatched rpms found.\n\nKernel type:\t\t$KERNEL_TYPE\nMismatch count:\t\t$MISMATCH_COUNT"
  else
    echo -e "\nFAILURE:  One or more mismatched rpms were found.\n\nKernel type:\t\t$KERNEL_TYPE\nMismatch count:\t\t$MISMATCH_COUNT\nMismatched rpms:\n$MISMATCHED_RPMS"
  fi
fi

The expected output should be similar to:

SUCCESS:  There were no mismatched rpms found.

Kernel type:            el6
Mismatch count:         0

Examples of "FAILURE" results:

FAILURE:  One or more mismatched rpms were found.

Kernel type:            el5
Mismatch count:         37   
Mismatched rpms:
gdb-7.2-83.el6.x86_64
basesystem-10.0-4.0.1.el6.noarch
strace-4.8-10.el6.x86_64
<output truncated>

FAILURE:  Unrecognized kernel type:  25.el

If the output is not "SUCCESS", investigate for root cause and take corrective action based on root cause findings. 

Verify Flex ASM Cardinality is set to "ALL"

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
11/23/16
<Name>
production
Exadata - Physical,
Exadata - User Domain
- exachk
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
12.2.0.1+
ASM
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
11.2.2.2.0+
Linux x86-64
exachk 12.2.0.1.2
Benefit / Impact:
By default, Flex ASM cardinality is set to 3. The impact of verifying that Flex ASM Cardinality is set to "ALL" is minimal. The impact of setting the Flex ASM cardinality to "ALL" from a lower value is minimal and can be done online; ASM will bring up the additional instances required to fullfil the cardinality setting.
Risk:
Not having Flex ASM cardinality set to "ALL" could result in a higher number of client (DB) connections on some ASM instances and may result in longer client reconnection times should an ASM instance crash.
Action / Repair:
To verify Flex ASM Cardinality is set to "ALL", as the Oracle home owner userid with the environment properly set, execute the following command set on one database server in the cluster where an ASM instance is executing:
RAW_DATA=$($ORACLE_HOME/bin/srvctl config asm -detail)
FLEX_MODE=$($ORACLE_HOME/bin/asmcmd showclustermode | cut -d" " -f6)
if [ "$FLEX_MODE" = "disabled" ]
then
echo -e "INFO: ASM is not in Flex mode: $FLEX_MODE, check not executed."
else
CARDINALITY=$(echo "$RAW_DATA" | grep count | cut -d" " -f4)
if [ "$CARDINALITY" = "ALL" ];
then
echo -e "SUCCESS: Flex ASM cardinality is set to: $CARDINALITY."
else
echo -e "FAILURE: Flex ASM cardinality is set to: $CARDINALITY.\n\n$RAW_DATA"
fi
fi
The expected output should be:
SUCCESS: Flex ASM cardinality is set to: ALL.
-- OR --
INFO:  ASM is not in Flex mode: disabled, check not executed. 
Example of a "FAILURE" message:
FAILURE: Flex ASM cardinality is set to: 3.

ASM home: <CRS home> 
Password file: +DBFS_DG/orapwASM 
Backup of Password file:  
ASM listener: LISTENER ASM is enabled. 
ASM is individually enabled on nodes:  
ASM is individually disabled on nodes:  
ASM instance count: 3 Cluster 
ASM listener: ASMNET1LSNR_ASM 
If a "FAILURE: ..." message appears, adjust the Flex ASM cardinality to "ALL" using the following command:
srvctl modify asm -count ALL
After making the change to ASM cardinality, verify that each node has an ASM instance running using the following command:
$ srvctl status asm -detail | grep "is running"
ASM is running on exadb06,exadb05,exadb08,exadb07,exadb02,exadb01,exadb04,exadb03
ASM instance +ASM2 is running on node exadb02  
ASM instance +ASM1 is running on node exadb01  
ASM instance +ASM4 is running on node exadb04  
ASM instance +ASM3 is running on node exadb03  
ASM instance +ASM5 is running on node exadb05  
ASM instance +ASM6 is running on node exadb06  
ASM instance +ASM7 is running on node exadb07  
ASM instance +ASM8 is running on node exadb08

Verify "downdelay" is correctly set for bonded client interfaces

Priority
Alert Level
Date
Owner
Status
Engineered System
Bug(s)
Critical
FAIL
02/08/17
<Name>
Production
Exadata - Physical,
Exadata - Management Domain
Bug 25520669 - exachk
   Bug 25144261 - exachk
DB Version
DB Role
Engineered System Platform
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8
12.1.2.2.0+
Linux x86-64
exachk TBD
Benefit / Impact:
When using the default "downdelay" settings, an undesired VIP failover or brownout may be seen depending upon the timing of a single client network interface failure. To avoid this possibility, the "downdelay" parameter of the client network interface should be set to 2000 when using active-backup mode bonding and to 200 when using LACP mode bonding.
The impact of verifying "downdelay" attributes for bonded client interfaces is minimal. The recommended corrective action includes a reboot.
Risk:
Not verifying "downdelay" attributes for bonded client interfaces increases the risk of unwanted VIP failover or brownouts in the event of a single client network interface failure.
Action/Repair:
To verify "downdelay" attributes for bonded client interfaces, as the root userid execute the script below on each database server:

#!/bin/bash
 
#############################################################################################
#                                                                            #
#  Purpose: Check downdelay is set appropriately for the bonded interfaces          #
#                                                                            #
#############################################################################################
 
## Variable declarations
exit_code=0
## Function Definitions
usage()
{
  echo "Usage: check_downdelay.sh [-o check|report] [-h]";
}
 
check_downdelay()
{
downDelayActiveBackup=2000
downDelayLACP=200
while read bonintf
do
  bondingType=$(grep "^Bonding Mode:" $bonintf|awk -F ":" '{print $2}')
  downdelaySet=$(grep "^Down Delay (ms):" $bonintf|awk '{print $4}')
  if [ "${bondingType}" == " fault-tolerance (active-backup)" ]
  then
    if [ $downdelaySet -ne $downDelayActiveBackup ]
    then
      exit_code=1
      downdelayFailMsgTmp="Down delay not set to 2000 for the active-backup bonded interface $(echo $bonintf|awk -F"/" '{print $NF}')"
      downdelayFailMsg=$(printf "$downdelayFailMsgTmp\n$downdelayFailMsg")
    fi
  elif [ "${bondingType}" == " IEEE 802.3ad Dynamic link aggregation" ]
  then
    if [ $downdelaySet -ne $downDelayLACP ]
    then
      exit_code=1
      downdelayFailMsgTmp="Down delay not set to 200 for the LACP bonded interface $(echo $bonintf|awk -F"/" '{print $NF}')"
      downdelayFailMsg=$(printf "$downdelayFailMsgTmp\n$downdelayFailMsg")
    fi
  fi
  if [ $exit_code -eq 0 ]
  then
    downdelayPassMsg="Down delay correctly set to correct value(s) for all bonded interfaces"
  fi
done << EOF
$(ls -1 /proc/net/bonding/bondeth*)
EOF
}
 
check_main()
{
    check_downdelay
}
 
print_result()
{
  echo $exit_code
}
 
print_report()
{
  if [ $exit_code -eq 0 ]
  then
    printf "\n$downdelayPassMsg\n"
  else
      printf "\n$downdelayFailMsg\n"
  fi
}
 
NumArgs=$#
 
if [ $NumArgs -lt 1 ]
then
  echo "Invalid or missing command line arguments..."
  usage;
  exit 1
fi
 
while getopts "o:h" opt;
do
  case "${opt}" in
    h) usage;
       exit 0
       ;;
    o)
       swch=${OPTARG};
       ;;
    *) echo "Invalid or missing command line arguments..."
       usage;
       exit 1
       ;;
   esac
done
 
if [ $swch = "check" ]
then
  check_main;
  print_result;
elif [ $swch == "report" ]
then
  check_main;
  print_report;
else
  echo "Invalid or missing command line arguments..."
  usage;
  exit 1
fi
The expected output should be:
Down delay correctly set to correct value(s) for all bonded interfaces
Example of a failure:
Down delay not set to 2000 for the active-backup bonded interface bondeth0
If failures are reported, as the root userid on the database server which has the failure, execute the following command followed by a reboot:
For active-backup mode - sed -i 's/downdelay=<existing value>/downdelay=2000/' /etc/sysconfig/network-scripts/ifcfg-<client network interface name>
For LACP mode - sed -i 's/downdelay=<existing value>/downdelay=200/' /etc/sysconfig/network-scripts/ifcfg-<client network interface name>
NOTE: It is possible to temporarily set the value in the active kernel as the root userid using this command:
echo 2000 > /sys/class/net/<client network interface name>/bonding/downdelay - For active-backup bonding
echo 200 > /sys/class/net/<client network interface name>/bonding/downdelay - For LACP bonding
However, this will not survive a reboot. The "sed" command followed by a reboot is the preferred method.
 
Verify ExaWatcher is executing

PriorityAlert LevelDate OwnerStatusEngineered SystemBug(s)
CriticalFAIL02/15/17<Name> ProductionExadata - Physical,
Exadata - User Domain,
SSC
Bug 25543623 - exachk
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
11.2.0.2+N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, SL611.2.3.3.0+Linux x86-64,
Sparc Linux
exachk 12.2.0.1.3 








Benefit / Impact:
ExaWatcher collects data on key metrics for both database and storage servers, which can be used for both troubleshooting and performance analysis. There is minimal impact to verify that ExaWatcher is executing, or from starting ExaWatcher if it is not executing.
Risk:
If ExaWatcher is not executing, valuable data for analysis is not collected.
Action / Repair:
To verify that ExaWatcher is executing, as the "root" userid execute the following command set on each database and storage server in the cluster:
NUM_OF_EXAWATCHERS=$(ps -ef | grep -i exawatcher | grep -v grep | wc -l)
if [[ $NUM_OF_EXAWATCHERS -gt 0 ]]
then
  echo -e "SUCCESS: ExaWatcher is executing.  Number of processes: $NUM_OF_EXAWATCHERS"
else
  echo -e "FAILURE: ExaWatcher is not executing.  Number of processes: $NUM_OF_EXAWATCHERS"
fi
The output should be similar to:
SUCCESS: ExaWatcher is executing.  Number of processes: 15
NOTE: The number of processes may vary depending upon the site-specific configuration.
If ExaWatcher is not executing, please refer to the "System Diagnostics Data Gathering with sosreports and Oracle ExaWatcher" section of the "Oracle® Exadata Storage Server Software User's Guide" that is for your specific installed version of Oracle Exadata Storage Server software.
Verify non-Default services are created for all Pluggable Databases
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalWARN02/15/17Frank KobylanskiProductionExadata - Physical,
Exadata - Management Domain,
Exadata - User Domain
ALLBug 25520385 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
12.1.0.1+CDBPrimaryOpenALLALLexachk 12.2.0.1.3N/A
Benefit / Impact:
Oracle recommends that non-default services should be created for application and end user access to pluggable databases (PDBs). This provides access control along with automated opening of the PDB as part of container database (CDB) startup.
Risk:
PDBs may not open automatically at instance startup and applications and users may have access to PDBs through default services at inappropriate times.
Action / Repair:
Note that only PDBs that are open and not in MIGRATE/UPGRADE mode will be checked. Since a PDB may not be open on all instances the following script should be executed on each instance of each CDB.
To verify that all PDBs in a CDB have at least one non-default service created for them, as the CDB ownerid on each database server:
                      1. Set your environment for a CDB
                      2. Run the script below
Repeat steps 1 and 2 for each CDB running on the database server, then move onto the next database server.
unset PDB_SERVICES;
PDB_SERVICES=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select name from v\$pdbs p
where p.name not in ('PDB\$SEED','CDB\$ROOT')
 and p.open_mode not in ('MOUNTED','MIGRATE')
 and p.name not in (select s.pdb from containers(service\$) s
                    where bitand(s.flags,128) != 128
                     and deletion_date is null
                     and s.name != ('SYS.SCHEDULER\$_EVENT_QUEUE')
                     and s.name not like ('SYS\$%'));
exit
EOF
);
if [ `echo $PDB_SERVICES| grep ORA- | wc -w` = 0 ]
  then
  if [ `echo $PDB_SERVICES| wc -w` = 0 ]
    then
      echo -e SUCCESS: all open PDBs have non-default services defined or there are no open PDBs;
    else
      echo -e WARNING: the following open PDBs do not have non-default services defined: $PDB_SERVICES;
  fi;
else
  echo -e WARNING: Issues were detected while trying to access the database: $PDB_SERVICES;
fi;
If the all PDBs that can be checked have non-default services defined, the following be returned:
SUCCESS: all open PDBs have non-default services defined or there are no open PDBs
If there are PDBs found that do not have non-default services defined for them, a message similar to the following will be returned.
WARNING: the following open PDBs do not have non-default services defined: TESTPDB4 TESTPDB2 TESTPDB3 TESTPDB5 TESTPDB1
To resolve the warning, create services for these PDBs using either:
                    1. srvctl in Grid Infrastructure or Oracle Restart based environments
                    2. The DBMS_SERVICE.create_service package in environments where srvctl is not available.
 
Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database files

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 08/14/19 Irfan Alvi Production Exadata - Physical,
Exadata - User Domain 
ALL 29411526 - exachk
26268345 - exachk
26143661 - OEDA 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
12.1.0.2.0 or higher ASM N/A N/A ALL Linux X86-64 exachk 19.3.0 N/A 
Benefit / Impact:
ACFS disk groups created on Exadata should not contain any critical database files to isolate operational maintenance and configuration changes.
The impact of verifying (ACFS) file systems do not contain critical database files is minimal and can be done online.
The impact of moving critical database files out of ACFS disk groups varies by the type of file involved, and cannot be estimated here.

NOTE: For more information on ACFS use cases and recommended disk group attributes on Exadata, please see: Oracle ACFS Support on Oracle Exadata Database Machine (Linux only) (Doc ID 1929629.1)
Risk:
Any ACFS maintenance or configuration change could potentially impact the availability of database files residing on the same disk group as ACFS.
Action / Repair:
To verify (ACFS) file systems do not contain critical database files, run exachk and review the provided report.
The expected output in the exachk report should be as follows:
 
In the "View" detail section of the report for this check the expected output should be similar to:
 
Example of a "FAILURE" message: Output in the exachk report
 
In the "View" detail section of the report for this check a "FAILURE" example will be similar to:
 
If a "FAILURE: ..." message appears, either relocate ACFS to a new dedicated disk group following How to Relocate an ACFS Filesystem to Another Diskgroup in Exadata (Doc ID 2133396.1) MOS note or move the database files out of the ACFS disk group.
NOTE: If after corrective actions are completed, you wish to run just this check manually without a full exachk run, as the "root" userid in the directory where exachk was installed, execute the following:
 
 Verify the ownership and permissions of the "oradism" file
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
CriticalFAIL07/12/17<Name> ProductionExadata - Physical,
Exadata - User Domain
ALL
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool Version
11.2+N/AN/AN/AALLLinuxexachk 12.2.0.1.4
Benefit / Impact:
Maintaining the correct ownership and permissions of the "oradism" file is essential for the proper operation of Direct NFS and achieving the highest possible throughput. The file should be owned the the "root" userid and have the setuid bit enabled in the permissions mask. The impact of validating file ownership and permission is minimal. Changing the file ownership and permissions requires a restart of the Oracle stack running out of the adjusted $ORACLE_HOME.
Risk:
If the ownership and permissions of the "oradism" file are not correct, the performance of Direct NFS will be severely impacted.
Action / Repair:
To verify the ownership and permissions of the "oradism" file, as the appropriate oracle home owner userid on each database server, execute the following command set on each $ORACLE_HOME:
OWNER_USERID=$(ls -l $ORACLE_HOME/bin/oradism |awk '{print $3}')
SETUID_BIT=$(ls -l $ORACLE_HOME/bin/oradism | cut -c4)
DETAIL=$(echo -e "owner userid:\t$OWNER_USERID\nsetuid bit:\t$SETUID_BIT")
if [[ $OWNER_USERID = "root" && $SETUID_BIT = "s" ]]
then
  echo -e "SUCCESS: \"oradism\" file is correctly configured:\n$DETAIL"
else
  echo -e "FAILURE: \"oradism\" file is not correctly configured:\n$DETAIL"
fi
The output should be similar to:
SUCCESS: "oradism" file is correctly configured:
owner userid:   root
setuid bit:     s
Examples of "FAILURE" results:
FAILURE: "oradism" file is not correctly configured:
owner userid:   root
setuid bit:     x

FAILURE: "oradism" file is not correctly configured:
owner userid:   oracle
setuid bit:     x
If the output is a "FAILURE" result, investigate and take corrective action.

Verify the SYSTEM, SYSAUX, USERS and TEMP tablespaces are of type bigfile

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
CriticalFAIL07/12/17<Name>ProductionExadata - Physical,
Exadata - User Domain
ALL
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool Version
11.2+N/AN/AN/AALLLinuxexachk 12.2.0.1.4
Benefit / Impact:
Configuring the SYSTEM, SYSAUX, USERS, and TEMP tablespaces to be of type bigfile simplifies maintenance and operations which involve these tablespaces. The impact of verifying the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile is minimal.
Risk:
If the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are not of type bigfile, maintenance operations are more complicated and a tablespace running out of free space is more possible.
Action / Repair:
To verify the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile, as the ORACLE_HOME owner userid on one database server in the cluster, execute the following command set once for each database running out of a given ORACLE_HOME, with the environment properly configured to access each given database:
BIGFILE_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set newpage none lines 80 feedback off timing off serveroutput on
SELECT tablespace_name, bigfile FROM dba_tablespaces
WHERE tablespace_name in ('SYSTEM', 'SYSAUX', 'USERS', 'TEMP');
exit
EOF
)
if [ `echo "$BIGFILE_DATA" | grep -ic "NO"` -gt 0 ]
then 
     echo -e "FAILURE: One or more of SYSTEM, SYSAUX, USERS, TEMP tablespaces are not of type bigfile:\n\n$BIGFILE_DATA"
else 
     echo -e "SUCCESS: SYSTEM, SYSAUX, USERS, TEMP tablespaces are of type bigfile:\n\n$BIGFILE_DATA" 
fi
The output should be similar to:
SUCCESS: the SYSTEM, SYSAUX, USERS, and TEMP tablespaces are of type bigfile:

TABLESPACE_NAME                BIG
------------------------------ ---
SYSTEM                         YES
SYSAUX                         YES
TEMP                           YES
USERS                          YES
Examples of a "FAILURE" result:
FAILURE: One or more of SYSTEM, SYSAUX, USERS, TEMP tablespaces are not of type bigfile:

TABLESPACE_NAME                BIG
------------------------------ ---
SYSTEM                         NO
SYSAUX                         NO
TEMP                           NO
USERS                          NO
If the output is a "FAILURE" result, investigate and take corrective action.

Verify the storage servers in use configuration matches across the cluster

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL12/19/18<Name>ProductionExadata - Physical,
Exadata - User Domain
ALLBug 29061438 - exachk
Bug 27541151 - exachk
Bug 26365216 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.5.0N/A
Benefit / Impact:
Verifying the storage servers in use configuration matches across the cluster can prevent potential issues ranging from impaired performance to a node eviction.
The impact of verifying the storage servers in use configuration matches across the cluster. The impact of making corrections varies depending upon the root cause of the difference.
Risk:
If the storage servers in use configuration does not match across the cluster, there is risk of impaired performance, node eviction, and perhaps data loss with multiple hardware failures over time.
Action / Repair:
NOTE: This check will only pass if the following are both true:
1) For each database server, the md5sum for the cellip.ora file matches the md5sum from the list of storage servers accessed by kfod.
2) The md5sum from 1) matches across the cluster.
To verify the storage servers in use configuration matches across the cluster, run exachk and review the provided report.
The expected output in the exachk report should be as follows:
In the "Cluster Wide" section of the report, the overall result should be "PASS":
PASS   Cluster Wide Check   The storage servers in use configuration matches across the cluster   Cluster Wide   View
In the "View" detail section of the report for this check the expected output should be similar to:
SUCCESS: The storage servers in use configuration matches:
-
DBSRVR:                 <Host Name>
DBSRVR_CELLIP_MD5SUM:   d2144e88f4249a5d267691b85ed2ae49
DBSRVR_KFOD_MD5SUM:     d2144e88f4249a5d267691b85ed2ae49
DBSRVR_BASE_MD5SUM:     d2144e88f4249a5d267691b85ed2ae49
-
DBSRVR:                 <Host Name>
DBSRVR_CELLIP_MD5SUM:   d2144e88f4249a5d267691b85ed2ae49
DBSRVR_KFOD_MD5SUM:     d2144e88f4249a5d267691b85ed2ae49
DBSRVR_BASE_MD5SUM:     d2144e88f4249a5d267691b85ed2ae49
A "FAILURE" example:
In the "Cluster Wide" section of the report, the overall result will be "FAIL":
FAIL    Cluster Wide Check   The storage servers in use configuration should match across the cluster   Cluster Wide   View
In the "View" detail section of the report for this check the expected output should be similar to:
FAILURE: The storage servers in use configuration does not match:
-
DBSRVR:                 randomadm01vm01
DBSRVR_CELLIP_MD5SUM:   acd6ad6d153ea1ec1ecf9a5aa19cf4a7
DBSRVR_KFOD_MD5SUM:     d41d8cd98f00b204e9800998ecf8427e
DBSRVR_BASE_MD5SUM:     acd6ad6d153ea1ec1ecf9a5aa19cf4a7
-
DBSRVR:                 randomadm02vm01
DBSRVR_CELLIP_MD5SUM:   acd6ad6d153ea1ec1ecf9a5aa19cf4a7
DBSRVR_KFOD_MD5SUM:     d41d8cd98f00b204e9800998ecf8427e
DBSRVR_BASE_MD5SUM:     acd6ad6d153ea1ec1ecf9a5aa19cf4a7
NOTE: In the "FAILURE:" example, the md5sum for the results reported from kfod on the running system does not match the cellip.ora md5sum.
If the result is not as expected, investigate for root cause and take appropriate corrective action.
NOTE: If after corrective actions are completed, you wish to run this one check without a full exachk run execute the following command as the "root" userid in the directory in which exachk was installed:
./exachk -check 5D6AC87BF4669BF2E053D498EB0AFC19,5D691B1A8146F67CE053D398EB0A8822

Verify "asm_power_limit" is greater than zero

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
CriticalCRITICAL07/26/17<Name>ProductionExadata - Physical,
Exadata - User Domain
ALL
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool Version
11.2+ASMN/AN/AALLLinuxexachk 12.2.0.1.4
Benefit / Impact:
Setting "asm_power_limit=0" disables rebalance operations. Verifying that "asm_power_limit" is greater than zero confirms that rebalance operations are enabled. The impact of verifying that "asm_power_limit" is greater than zero is minimal, as is the impact of setting it to a value greater than zero.
NOTE: Changing the default value via the initialization parameter "asm_power_limit" is not the same as changing the power for an actively running rebalance operation.
Risk:
"asm_power_limit=0" disables rebalance operations, which can lead to data loss in the event of multiple hardware failures over time.
Action / Repair:
To verify "asm_power_limit" is greater than zero, as the grid home owner userid, execute the following command set once for each ASM instance with the environment properly configured to access that given instance:
NOTE: This code will not execute properly if executed on a database server in a flex ASM environment where an ASM instance is not running.
ASMPL_PARAM_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set newpage none heading off lines 80 feedback off timing off serveroutput on
select value from v\$parameter where name = 'asm_power_limit';
exit
EOF
)
ASMPL_QUEUE_DATA=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set newpage none heading off lines 80 feedback off timing off serveroutput on
select count(*) from gv\$asm_operation where power=0 or actual=0;
exit
EOF
)
if [[ $ASMPL_PARAM_DATA -gt 0 && $ASMPL_QUEUE_DATA -eq 0 ]]
then 
  echo -e "SUCCESS: \"asm_power_limit\" is set to $ASMPL_PARAM_DATA and there are no rebalance operations in gv\$asm_operation with the attribute POWER or ACTUAL = 0"
else 
  echo -e "FAILURE:"
  if [ $ASMPL_PARAM_DATA -eq 0 ]
  then
    echo -e "The intitialization parameter \"asm_power_limit\" is set to zero"
  fi
  if [ $ASMPL_QUEUE_DATA -gt 0 ]
  then
    echo -e "There are rebalance operation(s) in gv\$asm_operation with the attribute POWER or ACTUAL = 0"
  fi
fi;
The output should be similar to:
SUCCESS: "asm_power_limit" is set to 32 and there are no rebalance operations in gv$asm_operation with the attribute POWER or ACTUAL = 0
Examples of "FAILURE" results:

FAILURE:
The intitialization parameter "asm_power_limit" is set to zero
There are rebalance operation(s) in gv$asm_operation with the attribute POWER or ACTUAL = 0

FAILURE:
The intitialization parameter "asm_power_limit" is set to zero

FAILURE:
There are rebalance operation(s) in gv$asm_operation with the attribute POWER or ACTUAL = 0

If the output is a "FAILURE" result, investigate and take corrective action.
Verify the recommended patches for Adaptive features are installed
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical INFO 06/05/19 <Name> Production Exadata - Physical,
Exadata - User Domain 
Exadata 29849595 - exachk
26681554 - exachk 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
12.1.0.2 only Normal, CDB, PDB Primary, Physical Standby Open ALL Linux exachk 19.3.0 N/A 
Benefit / Impact:
Adaptive features are a set of capabilities that enable the optimizer to make run-time adjustments to execution plans and to adjust plans for future executions based on the results of previous executions. For Oracle version 12.1.0.2 only, to maximize performance and reliability it is recommended that the default configuration for 12.2.x be used. Installing patches 22652097 and 21171382 configures those defaults.
Risk:
Without patches 22652097 and 21171382 Oracle version 12.1.0.2 may experience poor performance and potential instability.
Action / Repair:
To verify the recommended patches for Adaptive features are installed, as the owner userid of a given Oracle home, and with the environment set to access that Oracle home on each database server, execute the following code set:
opatch_return_code=$($ORACLE_HOME/OPatch/opatch lsinventory -oh $ORACLE_HOME -local >/dev/null 2>&1;echo $?)
if [ $opatch_return_code -eq 0 ]
then
  RAW_LSPATCHES=$($ORACLE_HOME/OPatch/opatch lsinventory -oh $ORACLE_HOME -local -bugs_fixed 2>&1)
else
  RAW_LSPATCHES=$(cat $ORACLE_HOME/inventory/ContentsXML/comps.xml);
fi 
IS_22652097_PRESENT=$(echo "$RAW_LSPATCHES" | grep -wc 22652097)
IS_21171382_PRESENT=$(echo "$RAW_LSPATCHES" | grep -wc 21171382)
if [[ $IS_22652097_PRESENT -eq 1 && $IS_21171382_PRESENT -eq 1 ]]
then
  echo -e "SUCCESS: patches 22652097 and 21171382 are installed in $ORACLE_HOME"
else
  echo -e "INFO: patches 22652097 and 21171382 are not installed in $ORACLE_HOME"
fi
The expected output should be:
SUCCESS: patches 22652097 and 21171382 are installed in /u01/app/oracle/product/12.1.0.2/dbhome_1
Example of a "INFO:" result:
INFO: patches 22652097 and 21171382 are not installed in /u01/app/oracle/product/12.1.0.2/dbhome_1
If the output is not as expected, install the recommended patches.

Verify initialization parameter cluster_database_instances is at the default value
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System PlatformBug(s)
CriticalFAIL11/08/2017<Name>ProductionExadata - physical
Exadata - User Domain
ALL  Bug 27055638 - Exachk
Bug 26844705 - base
GI/DB VersionDB TypeDB RoleDB ModeExadata VersionOS & Version Validation Tool VersionMAA Scorecard Section
 < 19.1 ALLALLOPENALLLinuxexachk 12.2.0.1.4N/A
Benefit / Impact:
cluster_database_instances should not be changed from the default value for performance and stability. The impact of verifying initialization parameter cluster_database_instances is at the default value is minimal. The impact of removing a set value should include a database restart to make sure the change survives database shutdown and startup.
Risk:
If cluster_database_instances is modified from the default, dynamic remastering can be impacted potentially causing poor performance or stability.
Action / Repair:
To verify cluster_database_instances is at the default value, as the owner of the oracle home for a given database and with the environment set to access that database, execute the following command set:
unset ISDEFAULT_VALUE
ISDEFAULT_VALUE=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set head off lines 80 feedback off timing off serveroutput on
select upper(isdefault) from v\$parameter where name ='cluster_database_instances';
exit
EOF
)
if [ $ISDEFAULT_VALUE = "TRUE" ]
then 
     echo -e "SUCCESS: cluster_database_instances is at the default value"
else 
     echo -e "FAILURE: cluster_database_instances should be at the default value: \"isdefault\" column value = "$ISDEFAULT_VALUE"" 
fi;

The expected output should be:
SUCCESS: cluster_database_instances is at the default value
Example of a "FAILURE" result:
FAILURE: cluster_database_instances should be at the default value: "isdefault" column value = FALSE

To correct a failure condition, with the environment properly set to access the target database, unset cluster_database_instances database parameter using
SQL> alter system reset cluster_database_instances scope=spfile sid='*';
Restart the instance and verify the change survives startup and shutdown.

Verify the database server NVME device configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System PlatformBug(s)
CriticalFAIL11/29/2017<Name> ProductionX7-8ExadataBug 27123748 - exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 12.2.0.1.4N/A

Benefit / Impact:
Proper configuration of NVME devices is necessary for reliable and efficient operation of a database server. The impact of verifying the database server NVME device configuration is minimal. The impact of making any required corrections or adjustment varies depending upon the root issue, and cannot be estimated here.
Risk:
An improper NVME device configuration could lead to unreliable operation, poor performance, or impact upgrade operations.
Action / Repair:
NOTE: This check will pass on a database server only if the following are both true:
1) There are four NVME devices discovered.
2) Every device has a status of "normal".
To verify the database server NVME device configuration, as the "root" userid, execute the following code set on each database server:

RAW_OUTPUT=$(dbmcli -e "list physicaldisk attributes name,status")
# Is count correct?
if [ $(echo "$RAW_OUTPUT" | wc -l) -eq 4 ]
then
  COUNT_CORRECT=1
else
  COUNT_CORRECT=0
fi
# Is the status normal?
if [ $(echo "$RAW_OUTPUT" | awk '{print $2}' | grep -icv normal) -eq 0 ]
then
  STATUS_NORMAL=1
else
  STATUS_NORMAL=0
fi
# Analyze:
if [[ $(echo $COUNT_CORRECT) -eq 1 && $(echo $STATUS_NORMAL) -eq 1 ]]
then
  echo "SUCCESS: The NVME device configuration is correct."
else
  echo -e "FAILURE: The NVME device configuration is not correct.\nDetails:\n$RAW_OUTPUT"
fi
The expected output should be:
SUCCESS: The NVME device configuration is correct.
Example of a "FAILURE:" result:
FAILURE: The NVME device configuration is not correct.
Details:
         FLASH_15_1      failed - dropped for replacement
         FLASH_15_2      failed - dropped for replacement
         FLASH_1_1       normal
         FLASH_1_2       normal
NOTE: The "FAILURE:" example is such because two devices have failed and been dropped.
If the output is not as expected, determine root cause and take appropriate correct action for same.

Verify that Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL01/17/18<Name>ProductionExadata - Physical,
Exadata - User Domain
ALLBug 27298631 - exachk
Bug 27403057 - OEDA
DB/GI VersionDB TypeDB RoleDB Mode Exadata VersionOS & VersionValidationTool VersionMAA Scorecard Section
12.2.0.1 or higherASMN/AN/AALLLinuxexachk 18.2.0N/A


 

 

Benefit / Impact:
Starting with Grid Infrastructure 12.2.0.1, Oracle ACFS supports I/O requests in multiples of 4K logical sector sizes as well as continued support for 512-byte logical sector size I/O requests. The size of the metadata blocks is not set directly, but derived from the logical sector size. Using a 4k metadata block size helps improve performance and stability.
Risk:
On ACFS files systems where the metadata block size is not 4k, applications that frequently access large numbers of files stored on the ACFS file system can experience severe poor performance, and possilby a storage server outage.
Action / Repair:
To verify that the Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size, on one database server in the cluster as the owner userid of the Grid home, and with the environment set to access the ASM instance on that database server, execute the following code:

#!/bin/bash
# acfs check metadata block size
# ORACLE_HOME should  be the Grid Infrastucture ORACLE_HOME
CRS_HOME=$ORACLE_HOME
NO4KMETABLK=0
ACFSNO4K=()
isacfsused=$(asmcmd volinfo --all|sed -e 's/ //g'|head -n 1)

if [ $isacfsused = 'novolumesfound' ] ; then
   echo -e "ACFS is not used"
   exit 1
fi

version=$(acfsutil info fs|grep 'ACFS Version'|sort -u|awk -F: '{print $2}'|awk -F. '{print $1$2}')

if [ $version -lt 122 ] ; then
   echo -e "WARNING: This check only is valid when GI version is 12.2 or higher"
   exit 1
fi

for  vol in $(acfsutil info fs|egrep 'metadata block size|primary volume'|awk -F: '{print $1"="$2}'|sed -e 's/ //g')
do

   attr=$(echo $vol |awk -F= '{print $1}')
   attrval=$(echo $vol |awk -F= '{print $2}')

   if  [ $attr = 'metadatablocksize' ] ; then
     if  [ $attrval -eq 512 ] ; then
          NO4KMETABLK=1
     else NO4KMETABLK=0
     fi
   elif  [ $attr = 'primaryvolume' ]  &&  [ $NO4KMETABLK -eq 1 ]  ;  then
         ACFSNO4K=(${ACFSNO4K[@]} $attrval)
         NO4KMETABLK=0
   fi
done

if [  ${#ACFSNO4K[@]} -eq 0 ] ; then
   printf "%s \n"  "SUCCESS: ALL the ACFS filesystem are using metadata block size 4096"
else
   printf "%s \n"  "WARNING: There are ACFS filesystem  NOT USING  metadata block size 4096"
   printf "\t %s \n" "The list of the primary volume is: "
   printf "\t %s \n" "${ACFSNO4K[@]}"
   printf "\t %s \n" "To get the complete details of each filesystem, please execute command acfsutil info fs"
fi
The expected output should be:
SUCCESS: ALL the ACFS filesystem are using metadata block size 4096
Example of a "FAILURE" result:
WARNING: There are ACFS filesystem  NOT USING  metadata block size 4096 
    The list of the primary volume is:  
    /dev/asm/volume1-399 
    /dev/asm/volume2-399 
    /dev/asm/volume3-399 
    To get the complete details, please execute command acfsutil info fs 
An ACFS file system created using Grid Infrastructure 12.2.0.1 or higher, by default will use metadata block size 4k.
An ACFS file system created using Grid Infrastructure before 12.2.0.1, it requires reformatting the ACFS volume, following those steps:
  • Create a backup of the filesystem
  • Deregister (if required) the file system using acfsutil registry -d command
  • Dismount the filesystem
  • Remove the file system using acfsutil rmfs command
  • Reformat the volume using mkfs -t acfs -i 4096 <dev path> command
  • Mount the file system
  • Restore the files
  • Optionally register the file system using acfsutil registry command.

Evaluate Automated Maintenance Tasks configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemsEngineered System PlatformBug(s)
CriticalWARN01/31/18<Name>DevelopmentSSC, Exadata - Physical,
Exadata - User Domain
ALLBug 27471238 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS &  VersionValidation Tool VersionMAA Scorecard Section
11.2 or higherALL ALLALLALLLinux, Solarisexachk 18.2.0N/A

Benefit / Impact:
Some automated maintenance tasks are enabled by default with default settings at database creation time. It is recommended that these automated tasks be allowed to run, but that they are reviewed and adjusted if necessary to provide the most benefit for a given environment's workload. Benefits are provided by improving the overall efficiency of an environment, and also from not having the automated maintenance tasks themselves negatively impact the environment's specific workload.
Risk:
Leaving automated maintenance tasks at their default values, or disabling them completely may significantly impact a given environment's specific workload performance.
Action / Repair:
To see basic information on automated maintenance tasks, as the owner of the oracle home for a given database and with the environment set to access that database, execute the following command set:
FORMATTED_OUTPUT=$($ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF
set newpage none head off lines 80 feedback off timing off serveroutput on
select client_name,status from DBA_AUTOTASK_CLIENT;
exit
EOF
)
LINE_COUNT=$(echo "$FORMATTED_OUTPUT" | wc -l)
ENABLED_COUNT=$(echo "$FORMATTED_OUTPUT" | egrep -ic enabled)
if [ $LINE_COUNT -eq $ENABLED_COUNT ]
then 
  echo -e "INFO: all automated maintenance tasks are enabled."
  echo -e "Please review configuration appropriateness for this environment."
else 
  echo -e "WARNING: one or more automated maintenance tasks are not enabled."
  echo -e "Please enable all and review configuration appropriateness for this environment.\nDetails:\n$FORMATTED_OUTPUT" 
fi;
The expected output should be similar to: 
INFO: all automated maintenance tasks are enabled.
Please review configuration appropriateness for this environment.
Example of a "WARNING" result: 
WARNING: one or more automated maintenance tasks are not enabled.
Please enable all and review configuration appropriateness for this environment.
Details:
sql tuning advisor                                               ENABLED
auto optimizer stats collection                                  ENABLED
auto space advisor                                               DISABLED
NOTE:
Oracle recommends that Oracle supplied automated maintenance tasks be utilized and tuned for each individual database and it's associated workload.
For more information, please see:
Database Administrator's Guide, 11g Release 2, Managing Automated Database Maintenance Tasks
Database Administrator's Guide, 12c Release 1, Managing Automated Database Maintenance Tasks

Verify proper ACFS drivers are installed for Spectre v2 mitigation

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System PlatformBug(s)
CriticalFAIL05/08/2018<Name>ProductionExadata - Physical,
Exadata - Management Domain
ALL Bug 27989056- Exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AallLinuxexachk 18.2.0N/A
Benefit / Impact:
On Exadata database servers that have an Exadata version installed that provides mitigation for Spectre v2 vulnerability, proper ACFS drivers or other customer-installed kernel drivers must be installed in order for the proper Spectre v2 mitigation to be used.
The impact of verification is minimal. Installing proper ACFS drivers requires Clusterware restart. The impact of installing proper customer-installed kernel drivers cannot be estimated here.
Risk:
Not using the proper ACFS drivers or other customer-installed kernel drivers can prevent the desired Spectre v2 mitigation, which can lead to reduced performance.
Action / Repair:
To verify proper ACFS drivers are installed for Spectre v2 mitigation, execute the following command set as the "root" userid on all database servers:
#!/bin/bash
# CPU model numbers (/proc/cpuinfo)
# V2:26 X2-2:44 X2-8:46 X2-8M2:47 X3:45 X4:62 X5:63 X6:79 X7:85
modelsUseRetpoline='26|44|46|47|45|62|63|79'
thisModel=$(egrep "^model[[:space:]]*:" /proc/cpuinfo | sort -u | awk '{print $NF}')
# kernels without spectrev2 mitigation will not have this file
if [[ ! -e /sys/devices/system/cpu/vulnerabilities/spectre_v2 ]]; then
 echo "WARNING: System is not capable of Spectre v2 mitigation. See minimum version requirements in MOS document 2356385.1."
else
 v2mitigation=$(&lt;/sys/devices/system/cpu/vulnerabilities/spectre_v2)
 # dom0 should use retpoline for all hardware
 # X6 and older should use retpoline
 wantRetpoline=no
if (  [[ -d /proc/xen/capabilities ]] && grep -q 'control_d' /proc/xen/capabilities ) || \
   echo "$thisModel" | egrep -q "$modelsUseRetpoline"; then
  wantRetpoline=yes
 fi
 if [[ $wantRetpoline == yes ]]; then
  if ! echo $v2mitigation | grep -qi retpoline; then
   echo "FAIL: Spectre v2 mitigation is expected to be retpoline, but is not."
   if dmesg | grep -q 'Disabling Spectre v2 mitigation retpoline'; then
    echo "Spectre v2 mitigation retpoline was disabled after system boot."
    # look for modules not compiled with retpoline
    badmodules=$(dmesg | grep 'loading module not compiled with retpoline compiler' | awk -F '[]:]' '{print $2}' | tr '\012' ' ')
    echo "Modules loaded not compiled with retpoline compiler: $badmodules. These modules must be updated."
    if [[ $badmodules =~ oracleoks ]]; then
     echo "oracleoks module will be updated by installing updated ACFS drivers. See MOS document 2356385.1."
    fi
   fi
  else
   echo "SUCCESS: Spectre v2 mitigation is using $v2mitigation"
  fi
 else
  echo "SUCCESS: Spectre v2 mitigation is using $v2mitigation"
 fi
fi
The expected output is:
SUCCESS: Spectre v2 mitigation is using Mitigation: Full generic retpoline, IBRS_FW, IBPB
-OR-
SUCCESS: Spectre v2 mitigation is using Mitigation: IBRS, IBRS_FW, IBPB
Example of a "WARNING" result:
WARNING: System is not capable of Spectre v2 mitigation.  See minimum version requirements in MOS document 2356385.1.
In the above "WARNING" example, the system should be upgraded per the MOS note.
Example of a "FAIL" result:
FAIL: Spectre v2 mitigation is expected to be retpoline, but is not.  Spectre v2 mitigation retpoline was disabled after system boot.  Modules loaded not compiled with retpoline compiler: oracleoks. These modules must be updated.  oracleoks module will be updated by installing updated ACFS drivers.  See: MOS document 2356385.1.
In the above FAIL, the system was expected to be using retpoline mitigation for Spectre v2, but was not. The system initially booted with retpoline mitigation, but it was disabled when an improper kernel module was loaded that caused retpoline mitigation to be disabled.

Verify Exafusion Memory Lock Configuration

PriorityAlert LevelDateOwnerStatusEngineered SystemBug(s)
CriticalFAIL06/27/18<Name>ProductionExadata - ALLBug 23253697 - exachk
DB VersionDB RoleEngineered System PlatformExadata VersionOS & VersionValidation Tool VersionTBD
ALLN/AALLALLLinux X86-64  

Benefit / Impact:
Having memlock set correctly is required for a successful upgrade to releases 12.2 and higher, and also to prevent ORA- errors associated with IPC context initialization. The impact of verifying the Exafusion memory lock configuration is minimal. Following any modifications to the limits.conf settings, a logout/login is required for the OS user to ensure the changes take effect.

NOTE: The memlock settings should be correct according to script recommendations regardless of whether Exafusion is actually being used or not (it is enabled by default in 12.2).
Risk:
Instance startup will fail, and/or clients will fail to connect if memlock settings are insufficient.
Action / Repair:
To verify Exafusion memory lock configuration, on each database server, as the owner userid of each unique RDBMS home, place the following code into a script and execute it.
#!/bin/sh
 
#    DESCRIPTION
#      Parse limits settings under /etc/security and produce an FAILURE if the
#      required memlock settings for Exadata are missing.
#      If non-standard settings are found, produce an FAILURE if the configured
#      limits are below the minimum requirement, else produce a WARNING.
#
#    MODIFIED   (MM/DD/YY)
#    amorimur    04/09/18 - Creation
 
LIMITSFILE=/etc/security/limits.conf
LIMITSDDIR=/etc/security/limits.d
RDBMS_OWNER=$(whoami)
MINLIMIT=32768
TMPFILE=$(mktemp)
SUCCESS=1
DEBUG=0

#
# Parse the given memlock setting string and see if it is satisfactory
#
check_memlock () {
  local L=$*

  # Error if we don't see the correct format
  if [ $(echo "$L" | wc -w) -ne 5 ] ; then
    SUCCESS=0
    echo "FAILURE: Invalid entry found ($L)"
  else

    # The oracle user must have an unlimited limit
    local LUSR=$(echo "$L" | sed 's/\*/all_users/g' | awk '{print $1}')
    if [ $LUSR = $RDBMS_OWNER ] ; then
      local LVAL=$(echo "$L" | awk '{print $4}')
      if [ $LVAL != 'unlimited' ] ; then
        SUCCESS=0
        echo "FAILURE: $RDBMS_OWNER must have an unlimited setting ($L)"
      fi

    # All others must have the minimum limit
    # Even if the limit settings are satisfactory, print a warning for all of these non-standard entries
    else
      local LVAL=$(echo "$L" | awk '{print $4}' | sed "s/unlimited/$MINLIMIT/g")

      if [ $LVAL -lt $MINLIMIT ] ; then
        SUCCESS=0
        echo "FAILURE: Found the following entry with memlock limit less than $MINLIMIT ($L)"
      else
        SUCCESS=0
        echo "WARNING: Found a non-standard memlock limit entry ($L)"
      fi
    fi
  fi
}

# Check the limits.conf file
# See if the file exists & is readable
if [ -r $LIMITSFILE ] ; then
 
  # Generate a reference file
  REFFILE_BASE=$(mktemp)
  cat <<! >> $REFFILE_BASE
* soft memlock $MINLIMIT
* hard memlock $MINLIMIT
$RDBMS_OWNER soft memlock unlimited
$RDBMS_OWNER hard memlock unlimited
!

  # Sort the contents
  REFFILE=$(mktemp)
  sort $REFFILE_BASE > $REFFILE
 
  # Extract the limits.conf settings on this system, exclude comments, and sort (duplicates are ok)
  MYFILE=$(mktemp)
  grep memlock $LIMITSFILE | egrep 'soft|hard' | awk '{print $1, $2, $3, $4}' | grep -v ^# | sort | uniq > $MYFILE
 
  # Find settings missing on this system, missing settings will produce an FAILURE
  comm -23 $REFFILE $MYFILE > $TMPFILE
  if [ -s $TMPFILE ] ; then
    SUCCESS=0
    echo "FAILURE: the following required memlock settings are missing in $LIMITSFILE"
    echo "------"
    cat $TMPFILE
    echo "------"
  fi
 
  # Find non-standard settings on this system
  # An FAILURE is raised when the memlock setting is below the minimum requirement, otherwise a WARNING is raised
  comm -13 $REFFILE $MYFILE > $TMPFILE
  if [ -s $TMPFILE ] ; then
 
    # Parse results one by one
    while read L ; do
      check_memlock "$L file:$LIMITSFILE"
    done < $TMPFILE
  fi
 
  # Debug
  if [ $DEBUG -eq 1 -a $SUCCESS -ne 1 ] ; then
    echo "-----"
    echo "Debug: reference file"
    cat $REFFILE
    echo "-----"
    echo "Debug: local file"
    cat $MYFILE
    echo "-----"
  fi

else
  SUCCESS=0
  echo "FAILURE: Unable to open $LIMITSFILE for reading"
fi
 
#
# Check for memlock settings under limits.d
#
for F in $(grep -rl memlock $LIMITSDDIR/*) ; do
  grep memlock $F | egrep 'soft|hard' | awk '{print $1, $2, $3, $4}' | grep -v ^# > $TMPFILE

  if [ -s $TMPFILE ] ; then

    # Parse results one by one
    while read L ; do
      check_memlock $L file:$F
    done < $TMPFILE
  fi
done
 
# Clean up
rm -rf $MYFILE $REFFILE $TMPFILE $REFFILE_BASE

# Success
if [ $SUCCESS -eq 1 ] ; then
  echo "SUCCESS: Memlock settings meet the Oracle best practices"
fi
The expected output is:
SUCCESS: Memlock settings meet the Oracle best practices
Example of a "FAILURE" result:
FAILURE: the following required memlock settings are missing in /etc/security/limits.conf
------
* hard memlock 32768
oracle hard memlock unlimited
oracle soft memlock unlimited
* soft memlock 32768
------
WARNING: Found a non-standard memlock limit entry (grid hard memlock 237778560 file:/etc/security/limits.conf)
WARNING: Found a non-standard memlock limit entry (grid soft memlock 237778560 file:/etc/security/limits.conf)
FAILURE: oracle must have an unlimited setting (oracle hard memlock 237778560 file:/etc/security/limits.conf)
FAILURE: oracle must have an unlimited setting (oracle soft memlock 237778560 file:/etc/security/limits.conf)
If a "FAILURE" or "WARNING" message appears, make the necessary edits to "/etc/security/limits.conf" and files under "/etc/security/limits.d/" as directed.

Verify there are no unhealthy InfiniBand switch sensors

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL08/08/18<Name> ProductionExadata - Physical,
Exadata - Management Domain,
RA
ALLBug 28279223 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.4.0N/A
Benefit / Impact:
For maximum functionality and alert notifications, all InfiniBand switch sensors should be functioning properly. The impact of verifying there are no unhealthy InfiniBand switch sensors is minimal. The impact of correcting failed sensors varies by failed component.
Risk:
InfiniBand switch functionality may be reduced depending upon which components have failed.
Action / Repair:
To verify there are no unhealthy InfiniBand switch sensors, as the "root" userid on each InfiniBand switch execute the following code set:
RAW_OUTPUT=$(/usr/local/bin/showunhealthy)
if [ $(echo "$RAW_OUTPUT" | egrep -ic "WARNING|FAILURE") -eq 0 ]
then
  echo -e "SUCCESS: there are no unhealthy InfiniBand switch sensors"
else
  echo -e "FAILURE: there are one or more unhealthy InfiniBand switch sensors.  Details:\n\n$RAW_OUTPUT"
fi
The expected output is the following:
SUCCESS: there are no unhealthy InfiniBand switch sensors
Example of a FAIL result:
FAILURE: there are one or more unhealthy InfiniBand switch sensors.  Details:

WARNING PSU 1 present AC Loss
FAILURE - 1 sensors NOT OK
Corrective actions vary depending upon the failed component. Refer to the appropriate switch documentation, and if necessary open an SR for assistance.

Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric

Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric 

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical WARN 09/05/18 <Name> Production Exadata - Physical,
Exadata - Management Domain,
RA 
ALL Bug 28108851 - exachk 
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/A N/A N/A N/A ALL Linux exachk 18.4.0 N/A 
Benefit / Impact:
If non-Exadata components are in use on the same InifiniBand fabric as an Exadata environment, then there are additional configuration considerations between the components. Verifying these additional considerations helps to ensure the InfiniBand fabric is stable and performs well.
Risk:
Not referring to MOS 1682501.1 can result in potential InfiniBand fabric instability and poor performance which may cause components in the Exadata environment to crash. Problems during patching can also occur.
Action / Repair:
To determine if non-Exadata components are discovered on the InfiniBand fabric execute the following code set as the "root" userid on one database server in the Exadata environment:
unset NONEXADATA_OUTPUT
VT_OUTPUT=$(/opt/oracle.SupportTools/ibdiagtools/verify-topology)
DETECTED_LINE_NUMBER=$(echo "$VT_OUTPUT" | egrep -ni "detected and ignored" | cut -d":" -f1)
ENDING_LINE_NUMBER=$(echo "$VT_OUTPUT" | wc -l)
SPAN=$(expr $ENDING_LINE_NUMBER - $DETECTED_LINE_NUMBER)
NONEXADATA_OUTPUT=$(echo "$VT_OUTPUT" | egrep -i "detected and ignored" -A $SPAN | grep -v "^Detected")
if [ -z "$NONEXADATA_OUTPUT" ]
then
  echo -e "SUCCESS: There were no non-Exadata InfiniBand components discovered."
else
  echo -e "WARNING: One or more non-Exadata InfiniBand components were discovered:\n\n$NONEXADATA_OUTPUT"
fi

The expected output is the following:

SUCCESS: There were no non-Exadata InfiniBand components discovered.
Example of a "WARNING" result:

WARNING: One or more non-Exadata IB components were discovered:
Ca      : 0x0010e0605308c000 ports 2 "SUN IB QDR GW switch <host>-sw-ib2  Bridge 0"
Ca      : 0x0010e0605308c040 ports 2 "SUN IB QDR GW switch <host>-sw-ib2  Bridge 1"
Ca      : 0x0010e00001757140 ports 2 "<host>-bda10-adm BDA xx.xx.xx.200 HCA-1"
Ca      : 0x0010e0000178e640 ports 2 "<host>-bda09-adm BDA xx.xx.xx.199 HCA-1"
Ca      : 0x0010e0000187b6e8 ports 2 "<host>-bda12 BDA 192.168.43.12 HCA-1"
Ca      : 0x0010e00001757ad0 ports 2 "<host>-bda11-adm BDA xx.xx.xx.201 HCA-1"
Ca      : 0x0010e00001878808 ports 2 "<host>-bda13 BDA 192.168.43.13 HCA-1"
Ca      : 0x0010e000017723d0 ports 2 "<host>-bda14 BDA 192.168.43.14 HCA-1"
Ca      : 0x0010e0000187a638 ports 2 "<host>-bda15 BDA 192.168.43.15 HCA-1"
Ca      : 0x0010e00001757050 ports 2 "<host>-bda16 BDA 192.168.43.16 HCA-1"
Ca      : 0x0010e00001757090 ports 2 "<host>-bda17 BDA 192.168.43.17 HCA-1"
Ca      : 0x0010e0000178e5f0 ports 2 "<host>-bda18-adm BDA xx.xx.xx.152 HCA-1"
Ca      : 0x0010e0000178e600 ports 2 "<host>-bda08-adm BDA xx.xx.xx.198 HCA-1"
Ca      : 0x0010e00001756fa0 ports 2 "<host>-bda07-adm BDA 192.168.43.7 HCA-1"
Ca      : 0x0010e00001757070 ports 2 "<host>-bda05-adm BDA xx.xx.xx.195 HCA-1"
Ca      : 0x0010e000017573d0 ports 2 "<host>-bda06-adm BDA xx.xx.xx.196 HCA-1"
Ca      : 0x0010e000017572a0 ports 2 "<host>-bda03-adm BDA xx.xx.xx.193 HCA-1"
Ca      : 0x0010e00001756fb0 ports 2 "<host>-bda04-adm BDA xx.xx.xx.194 HCA-1"
Ca      : 0x0010e0000174f0e0 ports 2 "<host>-bda01-adm BDA xx.xx.xx.191 HCA-1"
Ca      : 0x0010e0000174e170 ports 2 "<host>-bda02-adm BDA xx.xx.xx.192 HCA-1"
Ca      : 0x0010e0602e08c000 ports 2 "SUN IB QDR GW switch <host>-sw-ib3  Bridge 0"
Ca      : 0x0010e0602e08c040 ports 2 "SUN IB QDR GW switch <host>-sw-ib3  Bridge 1"
If a "WARNING" result is returned, please refer to: Setting up the Subnet Manager in a multi-rack cabling configuration containing Exalogic/Big Data Appliance and Exadata/SuperCluster (Doc ID 1682501.1)


<strong><a name="verify_ib_sdp_not_loaded" class="mceItemAnchor"></a>Verify the ib_sdp module is not loaded into the kernel

</strong>
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalFAIL02/20/19<Name>ProductionExadata - Physical,
Exadata - Management Domain,
RA
ALLBug 29157366 - exachk
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 19.1.0N/A
 
Benefit / Impact:
 
The Socket Direct Protocol (SDP) developed by the OpenFabric Enterprise Distribution (OFED) group Mellanox is no longer supported. There are open issues with SDP and operating system stability that will not be resolved.
 
For performance and stability, the ib_sdp module should not be loaded into the kernel. The impact of verifying the ib_sdp module is not loaded into the kernel is minimal. Modifying a system to not load the ib_sdp module requires a reboot.
 

 
NOTE: for Exadata versions 12.2.0.0.0 or greater, the ib_sdp module should not be loaded into the kernel.
 
NOTE: for Exadata versions 12.1.x.x.x or lower, it is recommended the ib_sdp module not be loaded into the kernel. However, if the ib_sdp module is loaded against this recommendation, then the option "sdp_apm_enable" must be set to "0". While the original Automatic Path Migration (APM) issue was reported when Exalogic application servers were accessing an Oracle Exadata Database Machine using SDP, ANY client requesting a connection using SDP with APM enabled to an Oracle Exadata Database Machine will eventually cause the connection to hang on the database server.
 
Risk:
 
System instability, poor performance, and potential node evictions are likely if the ib_sdp module is loaded into the kernel.
 

 
Action / Repair:
 
To verify the ib_rds module is not loaded, as the "root" userid on each database server execute the following code set:
 
EXADATA_VERSION=$(imageinfo -version | cut -d"." -f1-5 | tr -d .)
LSMOD_DATA=$(/sbin/lsmod | egrep -i ^ib_sdp)
if [[ $EXADATA_VERSION -ge 122000 ]]
then
  if [ -z "$LSMOD_DATA" ] 
  then
    echo -e "SUCCESS: The ib_sdp module is not loaded into the kernel"
  else
    echo -e "FAILURE: The ib_sdp module is loaded into the kernel.  Details:\n$LSMOD_DATA"
  fi
else
  if [ -z "$LSMOD_DATA" ] 
  then
    echo "SUCCESS: The ib_sdp module is not loaded into the kernel"
  else
    CODE_LINE=$(echo $EXADATA_VERSION | cut -c1-2)
    KERNEL_TYPE=$(uname -r | cut -d"." -f6)
    if [ $KERNEL_TYPE = "el5uek" ]
    then
      IB_SDP_FILE="/etc/modprobe.conf"
    elif [ $KERNEL_TYPE = "el6uek" ]
    then
      IB_SDP_FILE="/etc/modprobe.d/ib_sdp.conf"
    else
      echo -e "ERROR: unable to determine IB_SDP_FILE: $KERNEL_TYPE"
    fi
    IB_SDP_FILE_OUTPUT=$(egrep "ib_sdp" $IB_SDP_FILE)
    if [ -s /sys/module/ib_sdp/parameters/sdp_apm_enable ]
    then 
      IB_SDP_KERNEL_OUTPUT_RSLT=$(cat /sys/module/ib_sdp/parameters/sdp_apm_enable)
    else
      IB_SDP_KERNEL_OUTPUT_RSLT="/sys/module/ib_sdp/parameters/sdp_apm_enable not found"
    fi
    if [[ $CODE_LINE -eq 11 && $EXADATA_VERSION -lt 112331 || $CODE_LINE -eq 12 && $EXADATA_VERSION -lt 121111 ]]
    then
      if [ $(echo "$IB_SDP_FILE_OUTPUT" | egrep "sdp_apm_enable*.=0" | wc -l) -eq 1 ]    
        then
          IB_SDP_FILE_OUTPUT_RSLT=0   
        fi
      if [[ "$IB_SDP_FILE_OUTPUT_RSLT" = 0 && "$IB_SDP_KERNEL_OUTPUT_RSLT" = 0 ]]
      then 
        echo -e "SUCCESS: ib_sdp is loaded and sdp_apm_enable is set to 0 in $IB_SDP_FILE and running kernel."
        echo -e "$IB_SDP_FILE:  $IB_SDP_FILE_OUTPUT"
        echo -e "Running Kernel:  $IB_SDP_KERNEL_OUTPUT_RSLT"
      else 
        echo -e "FAILURE: ib_sdp is loaded and sdp_apm_enable should be set to 0 in $IB_SDP_FILE and running kernel."
        echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
        echo -e "Running Kernel:  $IB_SDP_KERNEL_OUTPUT_RSLT"
      fi
    else
      if [ $(echo "$IB_SDP_FILE_OUTPUT" | egrep "sdp_apm_enable*.=0" | wc -l) -eq 0 ]    
        then
          IB_SDP_FILE_OUTPUT_RSLT=0  
      fi
      if [[ "$IB_SDP_FILE_OUTPUT_RSLT" = 0 && "$IB_SDP_KERNEL_OUTPUT_RSLT" = 0 ]]
      then 
        echo -e "SUCCESS: ib_sdp is loaded and sdp_apm_enable is not set in $IB_SDP_FILE and is set to "0" in the running kernel."
        echo -e "$IB_SDP_FILE:  $IB_SDP_FILE_OUTPUT"
        echo -e "Running Kernel:  $IB_SDP_KERNEL_OUTPUT_RSLT"
      else 
        echo -e "FAILURE: ib_sdp is loaded and sdp_apm_enable should not be set in $IB_SDP_FILE and should be "0" in the running kernel."
        echo -e "$IB_SDP_FILE: $IB_SDP_FILE_OUTPUT"
        echo -e "Running Kernel:  $IB_SDP_KERNEL_OUTPUT_RSLT"
      fi
    fi
  fi
fi
The expected output is the following:
SUCCESS: The ib_sdp module is not loaded into the kernel
Example of a FAIL result:
FAILURE: ib_sdp is loaded and sdp_apm_enable should be set to 0 in /etc/modprobe.conf and running kernel.
/etc/modprobe.conf: 
Running Kernel:  0
 
NOTE: To correct a "FAILURE" result, place the text "SDP_LOAD=no" into the file "/etc/rdma/rdma.conf" and reboot the database server. 
Verify all voting disks are online
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 05/29/19 Vern Wagman Production Exadata - Physical,
Exadata - User Domain 
ALL 29779386 - exachk 
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
11.2.0.4 or higher ASM N/A N/A N/A Linux exachk 19.3.0 N/A 
Benefit / Impact:
Voting disks help ensure a stable cluster. The impact of verifying all voting disks are online is minimal. The impact of bringing a given voting disk back online depends upon the reason why it went offline, and cannot be estimated here.
Risk:
Not having all expected voting disks online increases the risk of node eviction or cluster crash.
Action / Repair:
To verify all voting disks are online, as the grid home owner userid, and with CRS_HOME and SID set to access the ASM instance, execute the following code on one database server in the cluster:
VOTEDISK_OUTPUT=$($CRS_HOME/bin/crsctl query css votedisk)
LOCATED_COUNT=$(echo "$VOTEDISK_OUTPUT" | egrep "^Located" | cut -d" " -f2)
ONLINE_COUNT=$(echo "$VOTEDISK_OUTPUT" | egrep -c ONLINE)
if [ "$LOCATED_COUNT" -eq "$ONLINE_COUNT" ]
then
  echo -e "SUCCESS: all voting disks are online."
else
  echo -e "FAILURE: not all voting disks are online.\nDETAILS:\n$VOTEDISK_OUTPUT"
fi
The expected output should be: 
SUCCESS: all voting disks are online.
Example of a "FAILURE" case: 
FAILURE: not all voting disks are online.
DETAILS:
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   a07c741f08194f71bf7f4d14c7d67a15 (/dev/exadata_quorum/QD_DATAC1_RANDOM05ADM05) [DATAC1]
 2. ONLINE   d1327820402f4f2fbffca97cbdef72d7 (/dev/exadata_quorum/QD_DATAC1_RANDOM05ADM06) [DATAC1]
 3. ONLINE   748b53cfb1a64f6cbff0f71de2de89b3 (o/192.168.22.171;192.168.22.172/DATAC1_FD_05_random05celadm07) [DATAC1]
 4. ONLINE   5fbc672724094f82bfcd4ea220ab824a (o/192.168.22.173;192.168.22.174/DATAC1_FD_05_random05celadm08) [DATAC1]
 5. OFFLINE   e9efd3be40ad4f64bfd034233f3e37d3 (o/192.168.22.175;192.168.22.176/DATAC1_FD_05_random05celadm09) [DATAC1]
If a "FAILURE" result is returned, investigate to determine root cause and take appropriate corrective action.
Verify available ksplice fixes are installed
PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
Critical FAIL 08/14/19 Doug Utzig Production Exadata - Physical,
Exadata - Management Domain,
Exadata - User Domain, RA 
ALL 30185190 - exachk 
DB VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
ALL ALL ALL ALL >=12.2.1.1.4, >=18.1.2.0.0 Linux exachk 19.3.0 N/A 
Benefit / Impact:
On Exadata systems some Oracle Linux operating system updates are delivered via ksplice. All available ksplice updates should be installed to ensure issues fixed in the installed Exadata release are not encountered.
Risk:
Not having all available ksplice updates installed can lead to unexpected behavior caused by encountering issues that are expected to be fixed in the installed Exadata release. The risk of checking that all available ksplice updates are installed is minimal.
Action / Repair:
To verify all available ksplice updates are installed run the following command set as the root user on each storage and database server in the cluster:
The expected output is the following:
-- OR --
Example of a FAIL result:
If there are available ksplice updates not installed then run uptrack-install as the root user, as follows:
 

Benefit/Impact:
The Flash 20 card supports ESM lifetime to enable proactive replacement before failure.
The impact of verifying that the ESM lifetime is within specification is minimal. Replacing an ESM requires a storage server outage. The database and application may remain available if the appropriate grid disks are properly inactivated before and activated after the storage server outage. Refer to MOS Note 1188080.1 and "Shutting Down Exadata Storage Server" in Chapter 7 of "Oracle® Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-14" for additional details.
Risk:
Failure of the ESM will put the Flash 20 card in WriteThrough mode which has a high impact on performance.
Action/Repair:
To verify the ESM lifetime value, use the following command on the storage servers:
for RISER in RISER1/PCIE1 RISER1/PCIE4 RISER2/PCIE2 RISER2/PCIE5; do ipmitool sunoem cli "show /SYS/MB/$RISER/F20CARD/UPTIME"; done | grep value -A4

The output will be similar to:
 value = 3382.350 Hours
 upper_nonrecov_threshold = 17500.000 Hours
 upper_critical_threshold = 17200.000 Hours
 upper_noncritical_threshold = 16800.000 Hours
 lower_noncritical_threshold = N/A
 -- <output truncated>

If the "value" reported exceeds the "upper_noncritical_threshold" reported, schedule a replacement of the relevant ESM.
NOTE: There is a bug in ILOM firmware version 3.0.9.19.a which may report "Invalid target..." for "RISER1/PCIE4". If that happens, consult your site maintenance records to verify the age the ESM Module.

NOTE: For Aura II (F20 M2) cards, the CPLD reports the End of Life indication on the F20 M2 cards, so the thresholds for UPTIME sensor are not needed. The threshold values are replaced with "N/A". The ILOM will fault the system when it's time to replace the F20 M2's ESM. Beginning with 2.1.3, exachk does not execute this check on F20 M2 cards. Beginning with 2.1.5, exachk posts a message in the html report detail that the card is an F20M2 model and the check is not applicable.


Verify Database Server Disk Controller Configuration (ARCHIVE)


Archive Date: 10/01/12
Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to have all available disk drives in a RAID-5 configuration with no hot spare.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8Linux11.2.x +11.2.x +

Benefit / Impact:
For X2-2, there are 4 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 3 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
For X2-8, there are 8 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 7 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the RAID devices is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the RAID devices increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server disk controller configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep "Device Present" -A 8 

For X2-2, the output will be similar to:

 Device Present
 ================
 Virtual Drives : 1
 Degraded : 0
 Offline : 0
 Physical Devices : 5
 Disks : 4
 Critical Disks : 0
 Failed Disks : 0 

The expected output is 1 virtual drive, none degraded or offline, 5 physical devices (controller + 4 disks), 4 disks, and no critical or failed disks.
For X2-8, the output will be similar to:
 Device Present
 ================
 Virtual Drives : 1
 Degraded : 0
 Offline : 0
 Physical Devices :11
 Disks : 8
 Critical Disks : 0
 Failed Disks : 0 

The expected output is 1 virtual drive, none degraded or offline, 11 physical devices (1 controller + 8 disks + 2 SAS2 expansion ports), 8 disks, and no critical or failed disks.
On X2-8, there is a SAS2 expander on each NEM, which takes in the 8 ports from the Niwot REM and expands it out to both the 8 physical drive slots through the midplane and the 2 SAS2 expansion ports external on each NEM. See output below from the MegaRaid? FW event log.
If the reported output differs, investigate and correct the condition.

Verify Database Server Virtual Drive Configuration (ARCHIVE)


Archive Date: 10/01/12
Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to have all available disk drives in a RAID-5 configuration with no hot spare.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8Linux11.2.x +11.2.x +

Benefit / Impact:
For X2-2, there are 4 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 3 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
For X2-8, there are 8 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 7 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the virtual drives is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the virtual drives increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server virtual drive configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Virtual Drive:";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Number Of Drives";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "^State" 

For X2-2 the output should be similar to:

Virtual Drive: 0 (Target Id: 0)
Number Of Drives : 3
State : Optimal

The expected result is that the virtual device has 3 drives and a state of optimal.
For X2-8, the output should be similar to:
Virtual Drive: 0 (Target Id: 0) 
Number Of Drives : 7 
State : Optimal 

The expected result is that the virtual device has 7 drives and a state of optimal.
If the reported output differs, investigate and correct the condition.
NOTE: The virtual device number reported may vary depending upon configuration and version levels.
NOTE: If a bare metal restore procedure is performed on a database server without using the "dualboot=no" configuration, that database server may be left with three virtual devices for X2-2 and 7 for X2-8. Please see My Oracle Support note 1323309.1 for additional information and correction instructions.

Verify Database Server Physical Drive Configuration (ARCHIVE)


Archive Date: 10/01/12
Archive Reason: Beginning with 11.2.3.2.0 the configuration of the database server disk drives was changed to have all available disk drives in a RAID-5 configuration with no hot spare.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8Linux11.2.x +11.2.x +

Benefit / Impact:
For X2-2, there are 4 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 3 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
For X2-8, there are 8 disk drives in a database server controlled by an LSI MegaRAID SAS 9261-8i disk controller. The disks are configured RAID-5 with 7 disks in the RAID set and 1 disk as a hot spare. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the physical drives is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the physical drives increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server physical drive configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 PDList -aALL | grep "Firmware state"

The output for X2-2 will be similar to:
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Hotspare, Spun down

There should be three lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun down". The ordering of the output lines is not significant and may vary based upon a given database server's physical drive replacement history.
The output for X2-8 will be similar to:
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Hotspare, Spun down

There should be seven lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun down". The ordering of the output lines is not significant and may vary based upon a given database server's physical drive replacement history.
If the reported output differs, investigate and correct the condition.
NOTE: Modified 03/21/12
Occasionally in normal operation, the "Hotspare" physical drive may be brought to a state of "Online, Spun Up". Thirty minutes (default) after the operation that brought the drive to "Online, Spun Up" has completed, the drive should spin down due to the powersaving feature. There is no harm for the drive to be "Online, Spun Up" if there are no other errors reported in the disk drive configuration checks.

For additional information, please reference My Oracle Support note "Exadata: Hot Spares Not Spinning Down (Doc ID 1403613.1)"

Verify Peripheral Component Interconnect (PCI) Bridges are Configured for Generation II on Storage Servers (ARCHIVE)


Archive Date: 10/24/12
Archive Reason: Beginning with the X4270 M3 storage servers shipped with the X3-2 and X3-8 database machines, there is a different PCI architecture and this issue is not relevant to the new hardware.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical09/13/11X2-2(4170), X2-2, X2-8Linux, Solaris11.2.x +11.2.x +


Benefit / Impact:
The storage server PCI bridges (19:0.0 and 27:0.0) should be configured for generation II for maximum performance.
There is minimal impact to verify the PCI Bridges configuration.
Risk:
If the PCI bridges are not configured for generation II, performance will be sub-optimal.
Action / Repair:
To verify the current PCI bridges configuration, execute the following command as the root userid on all storage servers:
for BUS_NUM in 19:0.0 27:0.0; do echo $BUS_NUM `lspci -xxx -s $BUS_NUM | grep ^50 | cut -d" " -f4`; done

The output should be similar to:
19:0.0 82
27:0.0 82

If any of the storage server PCI bridges do not return "82", there are three possible corrective actions:
If the value returned is "81" you may upgrade to Exadata storage server software version 11.2.2.4.0 or greater, or refer to MOS note1351559.1.
If neither the value "81" nor "82" is returned, contact oracle support for further assistance.
NOTE: PCI Bridge generation I will return the value "81".
[NOTE: INTERNAL ONLY - manual instructions are also listed in exachk bug 12756149.]

Verify Database Server Disk Controller Configuration (ARCHIVE)


Archive Date: 03/06/13
Archive Reason: Beginning with the Exadata software version 11.2.3.2.1, the reclamation of the hotspare device mandated in 11.2.3.2.0, was made optional for those customers upgrading from a version below 11.2.3.2.0 directly to 11.2.3.2.1 or higher.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical10/1/2012X2-2(4170), X2-2, X2-8, X3-2, X3-8Linux11.2.3.2.0 +11.2.x +


Benefit / Impact:
An X3-2 or X2-2 database server contains 4 disk drives in a RAID-5 configuration. An X3-8 or X2-8 database server contains 8 disk drives in a RAID-5 configuration. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the RAID devices is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the RAID devices increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server disk controller configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep "Device Present" -A 8 

For an X3-2 or X2-2 database server, the output will be similar to:

 Device Present
 ================
 Virtual Drives : 1
 Degraded : 0
 Offline : 0
 Physical Devices : 5
 Disks : 4
 Critical Disks : 0
 Failed Disks : 0 

The expected output is 1 virtual drive, none degraded or offline, 5 physical devices (controller + 4 disks), 4 disks, and no critical or failed disks.
For an X3-8 or X2-8 database server, the output will be similar to:
 Device Present
 ================
 Virtual Drives : 1
 Degraded : 0
 Offline : 0
 Physical Devices :11
 Disks : 8
 Critical Disks : 0
 Failed Disks : 0 

The expected output is 1 virtual drive, none degraded or offline, 11 physical devices (1 controller + 2 SAS2 expansion ports+ 8 disks), 8 disks, and no critical or failed disks.
If the reported output differs, investigate and correct the condition.

NOTE: If additonal virtual drives or a "hot spare" is present, it may be that the procedure to reclaimdisks was not executed at deployment time or that a bare metal restore procedure was performed without using the "dualboot=no" qualifier. Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

Verify Database Server Virtual Drive Configuration (ARCHIVE)


Archive Date: 03/06/13
Archive Reason: Beginning with the Exadata software version 11.2.3.2.1, the reclamation of the hotspare device mandated in 11.2.3.2.0, was made optional for those customers upgrading from a version below 11.2.3.2.0 directly to 11.2.3.2.1 or higher.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical10/1/2012X2-2(4170), X2-2, X2-8, x3-2, x3-8Linux11.2.3.2.0 +11.2.x +


Benefit / Impact:
An X3-2 or X2-2 database server contains 4 disk drives in a RAID-5 configuration. An X3-8 or X2-8 database server contains 8 disk drives in a RAID-5 configuration. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the virtual drives is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the virtual drives increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server virtual drive configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Virtual Drive:";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "Number Of Drives";/opt/MegaRAID/MegaCli/MegaCli64 CfgDsply -aALL | grep "^State" 

For an X3-2 or X2-2 database server, the output will be similar to:
Virtual Drive: 0 (Target Id: 0)
Number Of Drives : 4
State : Optimal

The expected result is that the virtual device has 4 drives and a state of optimal.
For an X3-8 or X2-8 database server, the output will be similar to:
Virtual Drive: 0 (Target Id: 0) 
Number Of Drives : 8 
State : Optimal 

The expected result is that the virtual device has 8 drives and a state of optimal.
If the reported output differs, investigate and correct the condition.
NOTE: The virtual device number reported may vary depending upon configuration and version levels.NOTE: If additonal virtual drives or a "hot spare" is present, it may be that the procedure to reclaimdisks was not executed at deployment time or that a bare metal restore procedure was performed without using the "dualboot=no" qualifier. Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

NOTE: If the database server was upgraded to 11.2.3.2.0 or higher, this check may fail because the reported number of drives is "3" or "7". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My Oracle Support note 1468877.1 for corrective action.


Verify Database Server Physical Drive Configuration (ARCHIVE)


Archive Date: 03/06/13
Archive Reason: Beginning with the Exadata software version 11.2.3.2.1, the reclamation of the hotspare device mandated in 11.2.3.2.0, was made optional for those customers upgrading from a version below 11.2.3.2.0 directly to 11.2.3.2.1 or higher.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical10/1/2012X2-2(4170), X2-2, X2-8, X3-2, X3-8Linux11.2.3.2.0 +11.2.x +


Benefit / Impact:
An X3-2 or X2-2 database server contains 4 disk drives in a RAID-5 configuration. An X3-8 or X2-8 database server contains 8 disk drives in a RAID-5 configuration. There is 1 virtual drive created across the RAID set. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of validating the physical drives is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the physical drives increases the chance of a performance degradation or an outage.
Action / Repair:
To verify the database server physical drive configuration, use the following command:
/opt/MegaRAID/MegaCli/MegaCli64 PDList -aALL | grep "Firmware state"

For an X3-2 or X2-2 database server, the output will be similar to:
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up

There should be 4 lines of output showing a state of "Online, Spun Up".
For an X3-8 or X2-8 database server, the output will be similar to:
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up 
Firmware state: Online, Spun Up

There should be 8 lines of output showing a state of "Online, Spun Up".
If the reported output differs, investigate and correct the condition.
NOTE: If additonal virtual drives or a "hot spare" is present, it may be that the procedure to reclaimdisks was not executed at deployment time or that a bare metal restore procedure was performed without using the "dualboot=no" qualifier. Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)".

NOTE: If the database server was upgraded to 11.2.3.2.0 or higher, this check may fail because one of the devices shows a state of: "Unconfigured(good), Spun Up". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My Oracle Support note 1468877.1 for corrective action.

Verify processor.max_cstate=1 on database servers


Archive Date: 03/13/13
Archive Reason: Beginning with the Exadata software version fresh install 11.2.2.2.0 or upgrade to 11.2.2.4.0, the ILOM version went to 3.0.16.10 and this issue was resolved. This also does not apply to the current X3 series hardware.

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL04/17/12Dan NorrisProductionExadata14153949- exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-211.2.x+ (ILOM < 3.0.16.10)Solaris - 11, Linux x86-64 UEK5.8exachk 2.2.2 


Benefit / Impact:
The benefit of these settings is avoiding uncorrectable memory errors related to the deep C state features on Nahalem processors.
NOTE: Fresh images 11.2.2.2.0 or higher automatically include these fixes. Systems upgraded from older original images should be manually upgraded by following the 11.2.2.2.0 upgrade notes.

Risk:
Without the proper configuration settings, memory errors may be reported.
Action / Repair:
If the database server has been upgraded to version 11.2.2.4.0 or higher, it should be running ILOM version 3.0.16.10 which includes fix for CR 7036024. Once that fix is installed, the kernel parameter is no longer required as the ILOM/BIOS incorporates the fix directly. Rather than checking for an image version, the proper check should be against the ILOM version directly.
To verify that processor.max_cstate=1 if required, as the "root" userid execute the following code on each database server:
##### begin script 
#!/bin/bash 
UNAME_S=`/bin/uname -s` 
DMIDECODE=`/usr/sbin/dmidecode -s system-product-name` 
### this fixes weirdness with the way dmidecode returns its data 
DMIDECODE=`echo $DMIDECODE` 
TARGET_ILOM_VER_X4170=03001610 
### check basic requirements 
if [ "$UNAME_S" = "Linux" -a "$DMIDECODE" = "SUN FIRE X4170 M2 SERVER" ]; then 
 ### verify the ILOM version - if 3.0.16.10 or newer, can exit 
 ILOM_VER=`ipmitool sunoem cli version | grep firmware | egrep -v 'build number|date:' | awk '{print $3}'` 
 ILOM_VER1=`echo $ILOM_VER | awk -F. '{print $1}'` 
 ILOM_VER2=`echo $ILOM_VER | awk -F. '{print $2}'` 
 ILOM_VER3=`echo $ILOM_VER | awk -F. '{print $3}'` 
 ILOM_VER4=`echo $ILOM_VER | awk -F. '{print $4}'` 
 if [ "$ILOM_VER1" -le 9 ]; then ILOM_VER1="0$ILOM_VER1"; fi 
 if [ "$ILOM_VER2" -le 9 ]; then ILOM_VER2="0$ILOM_VER2"; fi 
 if [ "$ILOM_VER3" -le 9 ]; then ILOM_VER3="0$ILOM_VER3"; fi 
 if [ "$ILOM_VER4" -le 9 ]; then ILOM_VER4="0$ILOM_VER4"; fi 
 LOCALVER="${ILOM_VER1}${ILOM_VER2}${ILOM_VER3}${ILOM_VER4}" 
 if [ $TARGET_ILOM_VER_X4170 -gt $LOCALVER ]; then 
 ### now we need to check for the parameter in /proc/cmdline 
 PARAM_PRESENT=`grep processor.max_cstate=1 /proc/cmdline | wc -l ` 
 if [ $PARAM_PRESENT -eq 1 ]; then 
 ### don't have fix via ILOM version, but have cmdline param 
 echo "PASSED due to cmdline param" 
 else ### don't have fix via ILOM, don't have fix via kernel cmdline param, failed check 
 echo "FAILED" 
 fi 
 else 
 ### already have the minimum ILOM version, so passed the check 
 echo "PASSED due to minimum ILOM version" 
 fi 
else 
 echo "This check is only for Linux-based X4170 M2 database servers, exiting" 
fi 
#### end script 

The expected output is not "FAILED".
To correct a "FAILED" condition:
1) Upgrade to newer versions of Exadata Software not impacted by this issue.
2) If an upgrade is not possible, to configure the proper settings, the kernel boot option "processor.max_cstate=1" should be added to the /boot/grub/grub.conf file on the "kernel" line so that it looks like this:
kernel /vmlinuz-2.6.18-274.18.1.0.1.el5 root=LABEL=DBSYS ro bootarea=dbsys loglevel=7 panic=60 debug rhgb numa=off console=ttyS0,115200n8 console=tty1 crashkernel=128M@16M audit=1 processor.max_cstate=1 nomce

After this change, a system reboot is required to pick up the new setting.

Verify Software on Storage Servers (CheckSWProfile.sh) (ARCHIVE)


Archive Date: 06/26/13
Archive Reason: Beginning with the Exadata software version fresh install 11.2.3.3.0 or upgrade to 11.2.3.3.0, CheckSWProfile.sh has been desupported by Exadata development.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8Linux, Solaris11.2.x +11.2.x +


Benefit / Impact:
Verifying the software configuration after initial deployment, upgrades, or patching and before the Oracle Exadata Database Machine is placed into or returned to production status can avoid problems related to the software modifications.
The overhead for these verification steps is minimal.
Risk:
If the software is not validated, inconsistencies can lead to problems and outages.
Action / Repair:
To verify the storage server software configuration execute the following command as the root userid:
/opt/oracle.SupportTools/CheckSWProfile.sh -c 

The output will be similar to:
[INFO] SUCCESS: Meets requirements of operating platform and installed software for 
[INFO] below listed releases and patches of Exadata and of corresponding Database. 
[INFO] Check does NOT verify correctness of configuration for installed software.

[The_ExadataAndDatabaseReleases] 
Exadata: 11.2.2.1.0 OracleDatabase: 11.2.0.2+Patches 

If any result other than "SUCCESS" is returned, investigate and correct the condition.
Review:
ravindra.dani: This is not correct for database hosts all the time. SW checker is only useful on fresh imaged db nodes. Also this check is going to be retired by 11.2.3.1.0. This check should not be run on the cells and though not folded in cellcli,s ay at validate config it should be.

Verify Software on InfiniBand Switches (CheckSWProfile.sh)(ARCHIVE)


Archive Date: 06/26/13
Archive Reason: Beginning with the Exadata software version fresh install 11.2.3.3.0 or upgrade to 11.2.3.3.0, CheckSWProfile.sh has been desupported by Exadata development.

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
CriticalN/AX2-2(4170), X2-2, X2-8Linux, Solaris11.2.x +11.2.x +

Benefit / Impact:
Verifying the software configuration after initial deployment, upgrades, or patching and before the Oracle Exadata Database Machine is placed into or returned to production status can avoid problems related to the software modifications.
The overhead for these verification steps is minimal.
Risk:
If the software is not validated, problems may occur when the machine is utilized.
Action / Repair:
The commands required to verify the InfiniBand switches software configuration vary slightly by the physcial configuration of the Oracle Exadata Database Machine. The key difference is whether or not the physical configuration includes a designated spine switch.
To verify the InfiniBand switches software configuration for a X2-8, a full rack Oracle Exadata Database Machine X2-2 or a late production model half rack Oracle Exadata Database Machine X2-2, with a designated spine switch properly configured per the "Oracle Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-15" with "sm_priority=8", and the name "RanDomsw-ib1", execute the following command as the "root" userid on one of the database servers:
/opt/oracle.SupportTools/CheckSWProfile.sh -I IS_SPINERanDomsw-ib1,RanDomsw-ib3,RanDomsw-ib2 

Where "RanDomsw-ib1, RanDomsw-ib3, and RanDomsw-ib2" are the switch names returned by the "ibswitches" command.
NOTE: There is no space between the "IS_SPINE" qualifier and the name of the designated spine switch.

The output will be similar to:
Checking if switch RanDomsw-ib1 is pingable...
Checking if switch RanDomsw-ib3 is pingable...
Checking if switch RanDomsw-ib2 is pingable...
Use the default password for all switches? (y/n) [n]: y
[INFO] SUCCESS Switch RanDomsw-ib1 has correct software and firmware version:
 SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib1 has correct opensm configuration:
 controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=8 


[INFO] SUCCESS Switch RanDomsw-ib3 has correct software and firmware version:
 SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib3 has correct opensm configuration:
 controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5 


[INFO] SUCCESS Switch RanDomsw-ib2 has correct software and firmware version:
 SWVer: 1.3.3-2
[INFO] SUCCESS Switch RanDomsw-ib2 has correct opensm configuration:
 controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5 


[INFO] SUCCESS All switches have correct software and firmware version:
 SWVer: 1.3.3-2
[INFO] SUCCESS All switches have correct opensm configuration:
 controlled_handover=TRUE polling_retry_number=5 routing_engine=ftree sminfo_polling_timeout=1000 sm_priority=5 for non spine and 8 for spine switch5 

To verify the InfiniBand switches software configuration for an early production model half rack Oracle Exadata Database Machine X2-2 (may not have shipped with a designated spine switch), or a quarter rack Oracle Exadata Database Machine X2-2 properly configured per the "Oracle Exadata Database Machine Owner's Guide 11g Release 2 (11.2) E13874-15", execute the following command as the "root" userid on one of the database servers:
/opt/oracle.SupportTools/CheckSWProfile.sh -I RanDomsw-ib3,RanDomsw-ib2 

Where "RanDomsw-ib3 and RanDomsw-ib2" are the switch names returned by the "ibswitches" command.
The output will be similar to the output for the first command, but there will be no references to a spine switch and all switches will have "sm_priority" of 5.
In either command case, the expected output is to return "SUCCESS". If anything else is returned, investigate and correct the condition.

Verify storage server network configuration with ipconf (ARCHIVE)


Archive Date: 05/13/15
Archive Reason: This storage server only check was replaced by "Verify active system values match those defined in configuration file "cell.conf" which executes on both storage and database servers with broader scope.

PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL05-Mar-2013Doug UtzigProductionExadata, SSC 
DB VersionDB RoleEngineered SystemExadata VersionOS & VersionValidation Tool VersionTBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Alln/a  

Benefit / Impact:
Exadata Storage Server network configuration is maintained in both operating system level configuration files and in Exadata-specific configuration files. The configuration defined in the two sets of files must match. To ensure proper configuration and consistency, network configuration changes to an Exadata Storage Server must be performed with the ipconf utility, as documented in the Oracle Exadata Storage Server Software User's Guide.
The impact of verifying that storage server network configuration is correct and consistent is minimal.
Risk:
If operating system level configuration files and Exadata-specific configuration files are inconsistent, then maintenance activities like software patching may fail, or previous configuration may be restored without warning.
Action / Repair:
To verify operating system level configuration files and Exadata-specific configuration files are consistent, run the following ipconf command on storage servers:
# /usr/local/bin/ipconf -verify -semantic 

The output should be similar to:
Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks 

If the output reports FAILED for any check, investigate to find the root cause, and then use only the ipconf utility to make the necessary corrections to the storage server network configuration. Refer to the Oracle Exadata Storage Server Software User's Guide for details of the ipconf utility.

11.2.0.2 ASM Instance Initialization Parameters (ARCHIVE)


Archive Date: 05/08/15
Archive Reason: 11.2.0.2 is fully desupported. Please see: "Release Schedule of Current Database Releases (Doc ID 742060.1)"
Priority: Critical
Benefit / Impact: Experience and testing has shown that certain ASM initialization parameters should be set at specific values. These are the best practice values set at deployment time. By setting these ASM initialization parameters as recommended, known problems may be avoided and performance maximized. The parameters are specific to the ASM instances. Unless otherwise specified, the value is for both 2 socket and 8 socket Database Machines. The impact of setting these parameters is minimal.
Risk: If the ASM initialization parameters are not set as recommended, a variety of issues may be encountered, depending upon which initialization parameter is not set as recommended, and the actual set value.
Action / Repair: To verify the database initialization parameters, compare the values in your environment against the table below (* = default value):

ParameterRecommended ValuePriorityNotes
cluster_interconnectsBondib0 IP address for 2 socket servers
Colon delimited Bondib* IP addresses for 8 socket servers
1This is used to avoid the Clusterware HAIP address as its use is not supported on Exadata (the only exception being with RAC One Node)
asm_power_limit41This is Exadata default to mitigate application performance impact during ASM rebalance. Please evaluate application performance impact before using a higher ASM_POWER_LIMIT.
Memory_target1040M1This avoids issues with 11.2.0.1 to 11.2.0.2 upgrade. This is the default setting for Exadata.
processesFor < 10 instances per node,
50 * (DB instances per node + 1)
For >= 10 instances per node,
{(50 * MIN (db_instances_per_node +1, 11) }+ {10 * MAX (db_instances_per_node - 10, 0)}
1This avoids issues observed when ASM hits max # of processes.
NOTE: "instances" means "non-ASM" instances
[Internal] Note that bug 11842806 can cause excessive connections that even a properly configured processes parameter can't handle so the fix should be applied

Correct any Priority 1 parameter that is not set as recommended. Evaluate and correct any Priority 2 parameter that is not set as recommended.

Verify Common Instance Database Initialization Parameters (ARCHIVE)


Archive Date: 08/22/12
Archive Reason: This section was created to account for database initialization parameters that become deprecated at various release levels.
Critical, 08/02/11
Benefit / Impact: Experience and testing has shown that certain database initialization parameters should be set at specific values. These are the best practice values set at deployment time. By setting these database initialization parameters as recommended, known problems may be avoided and performance maximized. The parameters are common to all database instances. The impact of setting these parameters is minimal. The performance related settings provide guidance to maintain highest stability without sacrificing performance. Changing the default performance settings can be done after careful performance evaluation and clear understanding of the performance impact. Risk: If the database initialization parameters are not set as recommended, a variety of issues may be encountered, depending upon which initialization parameter is not set as recommended, and the actual set value. Action / Repair: To verify the database initialization parameters, compare the values in your environment against the table below (* = default value):

ParameterRecommended ValuePriorityNotes
_lm_rcvr_hang_allow_time1401This parameter protects from corner case timeouts lower in the stack and prevents instance evictions
Archive Reason: Deprecated with 11.2.0.3 or higher boundary. exachk bug 14526144
_kill_diagnostics_timeout1401This parameter protects from corner case timeouts lower in the stack and prevents instance evictions
Archive Reason: Deprecated with 11.2.0.3 or higher boundary. exachk bug 14526155

Verify RAID Controller Battery Condition (ARCHIVE)
Archive Date: 04/06/16
Archive Reason: This check became obsolete with the release of X5 series hardware
Priority Added Machine Type OS Type Exadata Version Oracle Version
Critical 03/02/11
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2, X4-8 Linux, Solaris
11.2.x + 11.2.x +
[Bug(s): 11828407 (Storage Server), 11832924 (EM Storage Server Plugin), 11832981 (EM Agent)]
Benefit/Impact:
The RAID controller battery loses its ability to support cache over time. Verifying the battery charge and condition allows proactive battery replacement.
The impact of verifying the RAID controller battery condition is minimal.
Risk:
A failed RAID controller battery will put the RAID controller into WriteThrough mode which significantly impacts write I/O performance.
Action/Repair:
Execute the following command as the "root" userid on all servers:
if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]
then
#Linux
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 | egrep "Full Charge|Max Error|BatteryType" | sort | head -3
else
#Solaris
/opt/MegaRAID/MegaCli -AdpBbuCmd -a0 | egrep "Full Charge|Max Error|BatteryType" | sort | head -3
fi;
The output will be similar to:
BatteryType: iBBU08
Full Charge Capacity: 1272 mAh
Max Error: 0 %
Proactive battery replacement should be performed within 60 days for any batteries that meet the following criteria:
1) "Full Charge Capacity" less than or equal to 800 mAh and "Max Error" less than 10%.
Immediately replace any batteries that meet either of the following criteria:
1) "Max Error" is 10% or greater (battery deemed unreliable regardless of "Full Charge Capacity" reading)
2) "Full Charge Capacity" less than 674 mAh regardless of "Max Error" reading
[NOTE: The complete reference guide for LSI disk controller batteries used in Exadata can be found in MOS 1329989.1 (INTERNAL ONLY)]

Verify all "BIGFILE" tablespaces have non-default "MAXBYTES" values set (ARCHIVE)

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical11-Nov-2011X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux, [WIP:VW]Solaris11.2.x +11.2.x +

Benefit / Impact
"MAXBYTES" is the SQL attribute that expresses the "MAXSIZE" value that is used in the DDL command to set "AUTOEXTEND" to "ON". By default,
for a bigfile tablespace, the value is "3.5184E+13", or "35184372064256". The benefit of having "MAXBYTES" set at a non-default value for
"BIGFILE" tablespaces is that a runaway operation or heavy simultaneous use (e.g., temp tablespace) cannot take up all the space in a diskgroup.

The impact of verifying that "MAXBYTES" is set to a non-default value is minimal. The impact of setting the "MAXSIZE" attribute to a non-default
value "varies depending upon if it is done during database creation, file addition to a tablespace, or added to an existing file.

Risk

The risk of running out of space in a diskgroup varies by application and cannot be quantified here. A diskgroup running out of space may impact the entire database as well as ASM operations (e.g., rebalance operations).

Action / Repair

To obtain a list of file numbers and bigfile tablespaces that have the "MAXBYTES" attribute at the default value, enter the following sqlplus command logged into the database as sysdba:
select file_id, a.tablespace_name, autoextensible, maxbytes
from (select file_id, tablespace_name, autoextensible, maxbytes from dba_data_files where autoextensible='YES' and maxbytes = 35184372064256) a, 
(select tablespace_name from dba_tablespaces where bigfile='YES') b
where a.tablespace_name = b.tablespace_name
union
select file_id,a.tablespace_name, autoextensible, maxbytes
from (select file_id, tablespace_name, autoextensible, maxbytes from dba_temp_files where autoextensible='YES' and maxbytes = 35184372064256) a, 
(select tablespace_name from dba_tablespaces where bigfile='YES') b
where a.tablespace_name = b.tablespace_name;
The output should be:
no rows returned 
If you see output similar to:
 FILE_ID TABLESPACE_NAME AUT MAXBYTES
---------- ------------------------------ --- ----------
1 TEMP YES 3.5184E+13
3 UNDOTBS1 YES 3.5184E+13
4 UNDOTBS2 YES 3.5184E+13
Investigate and correct the condition.

Ensure Temporary Tablespace is correctly defined (ARCHIVE)

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
 N/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.x +
The temporary tablespace should be
  1. A BigFile Tablespace
  2. Located in DATA or RECO, whichever one is not HIGH redundancy
  3. Sized 32GB Initially
  4. Configured with AutoExtend on at 4GB
  5. Configured with a Max Size defined to limit out of control growth.

 
Verify "diagsnap.pl" is not executing

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System
Platform
Bug(s)
CriticalWARN04/26/17Dib Chatterjee, Jaime FigueroaProductionRA, Exadata - Physical,
Exadata - User Domain
ALLbug 27376516 - Exachk
bug 25960055 - OEDA
bug 25955127 - Exachk
DB/GI VersionDB TypeDB RoleDB ModeExadata VersionOS & VersionOS & VersionMAA Scorecard Section
12.2.0.1+ASMN/AN/AALL Linux exachk 18.2.0 N/A

 

Benefit / Impact:

Starting with version 12.2.0.1, by default the Cluster Health Monitor (CHM) framework executes continuously the file "/u01/app/12.2.0.1/grid/bin/diagsnap.pl". Under certain conditions, this script executes the "pstack" command against key grid infrastructure processes. The output of "pstack" can be useful for diagnosing grid infrastructure issues, but the "pstack" command execution and locking can lead these key grid infrastructure processes to hang (especially ocssd) which can trigger node reboots. It is recommended that "diagsnap.pl" not execute continuously, and that the "pstack" command is only used when other diagnostics indicate a benefit.
The impact of verifying that "diagsnap.pl" is not executing is minimal, as is the impact of stopping it's execution.
Risk:
Continuously executing "diagsnap.pl" may lead to node reboots that might have otherwise been avoided.
Action / Repair:
To verify that "diagsnap.pl" is not executing, as the owner userid of the grid home, and with the environment properly set to access the grid home, execute the following code set on each database server:

CRS_HOME=$ORACLE_HOME
unset DIAGSNAP_OUTPUT
unset DIAGSNAP_EXECUTING

function chkdiagsnap
{
    if [ $DIAGSNAP_EXECUTING -gt 0 ]
    then
        echo -e "WARNING: \"diagsnap.pl\" is executing on this database server.  Recommendation is to stop the process:\nDetails: $DIAGSNAP_OUTPUT"
        repair
    else
           echo -e "SUCCESS: \"diagsnap.pl\" is not executing on this database server.\n"
    fi
    exit 0
}

function repair
{
  $CRS_HOME/bin/oclumon manage -disable diagsnap
}

DIAGSNAP_OUTPUT=$(ps -ef | grep $CRS_HOME | grep diagsnap | grep -v grep)
DIAGSNAP_EXECUTING=$(echo "$DIAGSNAP_OUTPUT" | grep -c diagsnap)


chkdiagsnap
 
The expected output is:
SUCCESS: "diagsnap.pl" is not executing on this database server.
example of a "FAILURE:" result:
WARNING: "diagsnap.pl" is executing on this database server:
 
Details: root     386456 378366  0 Apr03 ?        00:30:17 /u01/app/12.2.0.1/grid/perl/bin/perl /u01/app/12.2.0.1/grid/bin/diagsnap.pl start
NOTE: If a "WARNING:" result is returned, to stop the file "diagsnap.pl" from executing, as the owner userid of the grid home, and with the environment variables properly set, execute the following command:
$CRS_HOME/bin/oclumon manage -disable diagsnap


   
Verify memlock is 90% of phys ram when huge pages are enabled
Priority
Alert Level
Date
Owner
Status
Scope
Bug(s)
Critical
FAIL
8/29/14
Rene Kundersma
Production
Exadata,
DB Version
DB Role
Engineered System
Exadata Version
OS & Version
Validation Tool Version
TBD
N/A
N/A
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
11.2.2.2.0+
Linux x86-64

Benefit / Impact:
Oracle recommends that the maximum locked memory be at least 90 percent of the installed physical memory when huge pages are enabled. Refer to the operating system documentation or issue the command ''man limits.conf'' for details. The impact of verifying this value is minimal. Also see http://docs.oracle.com/database/121/LADBI/usr_grps.htm#LADBI7674
Risk:
Incorrect resource settings can cause instability and performance problems.
Action / Repair:
Obtain hard and soft value for memlock from /etc/security/limits.conf. Verify this value is at least 90% of physical memory. When hugepages are configured (which should be true) - and this value is less than 90% we should print a warning and suggest the user to update the values

Verify RAID Controller Battery Temperature


Priority
Added
Machine Type
OS Type
Exadata Version
Oracle Version
Critical
03/02/11
X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2
Linux, Solaris
11.2.x +
11.2.x +

Benefit/Impact:
Maintaining proper temperature ranges maximizes RAID controller battery life.
The impact of verifying RAID controller battery temperature is minimal.
Risk:
A reported temperature of 60C or higher causes the battery to suspend charging until the temperature drops and shortens the service life of the battery, causing it to fail prematurely and put the RAID controller into
WriteThrough mode which significantly impacts write I/O performance.
Action/Repair:
To verify the RAID controller battery temperature, execute the following command as the "root" userid on all servers:
if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]
then
#Linux
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 -nolog| grep BatteryType;
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 -nolog | grep -i temper;
else
#Solaris
/opt/MegaRAID/MegaCli -AdpBbuCmd -a0 -nolog| grep BatteryType;
/opt/MegaRAID/MegaCli -AdpBbuCmd -a0 -nolog| grep -i temper;
fi;
The output will be similar to:
BatteryType: iBBU08
Temperature: 38 C
Temperature : OK
Over Temperature : No
If the battery temperature is equal to or greater than 55C, investigate and correct the environmental conditions.

NOTE: Replace Battery Module after 3 Year service life assuming the battery temperature has not exceeded 55C. If the temperature has exceeded 55C (battery temp shall not exceed 60C), replace the battery every 2 years.
[NOTE: The complete reference guide for LSI disk controller batteries used in Exadata can be found in MOS Unpublished Note 1329989.1 (INTERNAL ONLY)]

Verify Database Server Disk Controller Configuration

PriorityAlert LevelDateOwnerStatusEngineered System Engineered System
Platform   
  Bug(s)
CriticalFAIL03/17/18Dib ChatterjeeProductionExadata - Physical,
Exadata - Management Domainl
X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-2Bug 27525145- exachk
Bug 26775963- exachk
Bug 24533088- exachk
Bug 20557656- exachk
DB VersionDB Type DB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALL
Linuxexachk 18.2.0N/A
Benefit / Impact:
The recommended configuration for a newly deployed (or upgraded from 11.2.3.2.0) database server varies according to the hardware type and Exadata software version. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of verifying the database server disk controller configuration is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the database server disk controller configuration increases the chance of a performance degradation or an outage.
Action / Repair:
exachk contains all the logic necessary to identify the various correct configurations. To verify the database server disk controller configuration, run exachk and evaluate the results.
To manually verify the database server disk controller configuration, execute the following command set as the "root" userid on each database server or the management domain of a virtualized environment:
NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]
then
  echo -e "\nThis check will not run in a user domain of a virtualized environment.  Execute this check in the management domain.\n"
else
  if [ -x /opt/MegaRAID/storcli/storcli64 ]
  then
    export CMD=/opt/MegaRAID/storcli/storcli64
  else
    export CMD=/opt/MegaRAID/MegaCli/MegaCli64
  fi
  RAW_OUTPUT=$($CMD AdpAllInfo -aALL -nolog | grep "Device Present" -A 8);
  echo -e "The database server disk controller configuration found is:\n\n$RAW_OUTPUT";
fi;
The output will be similar to:
                Device Present
                ================
  Virtual Drives    : 1
    Degraded        : 0
    Offline         : 0
  Physical Devices  : 5
    Disks           : 4
    Critical Disks  : 0
    Failed Disks    : 0  
The output should match one of the combinations of entries in this table:

Database Server Disk Controller Configurations
Engineered
System
Virtual
Drives
DegradedOfflinePhysical
Devices
DisksCritical
Disks
Failed
Disks
Exadata
Version
X2-2(4170), X2-2 < 11.2.3.2.0 
X2-8 11 < 11.2.3.2.0 
  
X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 >= 11.2.3.2.0 
X5-2, X6-2, X7-2 (Disk Expansion Kit) >= 11.2.3.2.0 
X2-8, X3-8 11 >= 11.2.3.2.0 
X4-8 >= 11.2.3.2.0 
X5-8, X6-8 >= 11.2.3.2.0 
NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.

Verify Database Server Virtual Drive Configuration


PriorityAlert LevelDateOwnerStatusEngineered System Engineered System
Platform
  Bug(s)  
CriticalFAIL03/07/18Dib ChatterjeeProductionExadata-Management Domain,
Exadata-Physical
  X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-2Bug 27533289- exachk
Bug 26775963- exachk
Bug 24533222- exachk
Bug 20557656- exachk
DB VersionDB TypeDB RoleDB ModeExadata versionOS & Version Validation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALL
Linux exachk 18.2.0 N/A
Benefit / Impact:
The recommended configuration for a newly deployed (or upgraded from 11.2.3.2.0) database server varies according to the hardware type and Exadata software version. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of verifying the database server virtual drive configuration is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the virtual drives increases the chance of a performance degradation or an outage.
Action / Repair:
exachk contains all the logic necessary to identify the various correct configurations. To verify the database server disk controller configuration, run exachk and evaluate the results.
To manually verify the database server disk controller configuration, execute the following command set as the "root" userid on each database server or the management domain of a virtualized environment:
NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]
then
  echo -e "\nThis check will not run in a user domain of a virtualized environment.  Execute this check in the management domain.\n"
else
  if [ -x /opt/MegaRAID/storcli/storcli64 ]
  then
    export CMD=/opt/MegaRAID/storcli/storcli64
  else
    export CMD=/opt/MegaRAID/MegaCli/MegaCli64
  fi
  RAW_OUTPUT=$($CMD CfgDsply -aALL -nolog | egrep "Virtual Drive:|Number Of Drives|^State");
  echo -e "The database server virtual drive configuration found is:\n\n$RAW_OUTPUT";
fi;
The output will be similar to:
Virtual Drive: 0 (Target Id: 0)
Number Of Drives    : 4
State               : Optimal
The output should match one of the combinations of entries in this table:

Database Server Virtual Drive Configurations
Engineered
System
Number of
Virtual Drives
StateNumber of
Physical Drives
Exadata
Version
X2-2(4170), X2-2 Optimal < 11.2.3.2.0 
X2-8 Optimal < 11.2.3.2.0 
  
X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 Optimal >= 11.2.3.2.0 
X5-2, X6-2, X7-2 (Disk Expansion Kit) Optimal >= 11.2.3.2.0 
X2-8, X3-8 Optimal >= 11.2.3.2.0 
X4-8 Optimal >= 11.2.3.2.0 
X5-8, X6-8 Optimal >= 11.2.3.2.0 
NOTE: The virtual device number reported may vary depending upon configuration and version levels.NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.
NOTE: If additonal virtual drives are present, it may be that the procedure to reclaimdisks was not executed at deployment time or that a bare metal restore procedure was performed without using the "dualboot=no" qualifier. Please refer to the "Reclaiming Disks for the Linux Operating System" section of "Oracle® Exadata Database Machine Owner's Guide, 11g Release 2 (11.2)". See also "Verify Database Server Physical Drive Configuration".

NOTE: If the database server was upgraded to 11.2.3.2.0, this check may fail because the reported number of drives is "3" or "7". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My Oracle Support note 1468877.1 for corrective action.

Verify Database Server Physical Drive Configuration

PriorityAlert LevelDateOwnerStatusEngineered System   Engineered System
Platform 
    Bug(s)
CriticalFAIL03/07/2018/Dib ChatterjeeProductionExadata - Physical,
Exadata - Management Domain
 X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-2Bug 27533421- exachk
Bug 26775963- exachk
Bug 24533293- exachk
Bug 20557656- exachk
DB VersionDB TypeDB RoleDB ModeExadata Version OS & Version Validation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALL
Linux exachk 18.2.0 N/A
Benefit / Impact:
The recommended configuration for a newly deployed (or upgraded from 11.2.3.2.0) database server varies according to the hardware type and Exadata software version. Verifying the status of the database server RAID devices helps to avoid a possible performance impact, or an outage.
The impact of verifying the database server physical drive configuration is minimal. The impact of corrective actions will vary depending on the specific issue uncovered, and may range from simple reconfiguration to an outage.
Risk:
Not verifying the physical drives increases the chance of a performance degradation or an outage.
Action / Repair:
exachk contains all the logic necessary to identify the various correct configurations. To verify the database server physical drive configuration, run exachk and evaluate the results.
To manually verify the database server physical drive configuration, execute the following command set as the "root" userid on each database server or the management domain of a virtualized environment:
NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
if [[ -d /proc/xen && ! -f /proc/xen/capabilities ]]
then
  echo -e "\nThis check will not run in a user domain of a virtualized environment.  Execute this check in the management domain.\n"
else
  if [ -x /opt/MegaRAID/storcli/storcli64 ]
  then
    export CMD=/opt/MegaRAID/storcli/storcli64
  else
    export CMD=/opt/MegaRAID/MegaCli/MegaCli64
  fi
  RAW_OUTPUT=$($CMD PDList -aALL -nolog | grep "Firmware state");
  echo -e "The database server physical drive configuration found is:\n\n$RAW_OUTPUT";
fi;
The output will be similar to:

Recommended Configuration 

The database server physical drive configuration found is:

Firmware state: Online, Spun Up 
<output truncated for brevity> 
Firmware state: Online, Spun Up
The output should match one of the combinations of entries in this table:

Database Server Physical Drive Configurations
Engineered
System
OnlineSpun UpHotspareSpun DownExadata
Version
X2-2(4170), X2-2 < 11.2.3.2.0 
X2-8 < 11.2.3.2.0 
  
X2-2(4170), X2-2, X3-2, X4-2, X5-2, X6-2, X7-2 >= 11.2.3.2.0 
X5-2, X6-2, X7-2 (Disk Expansion Kit) >= 11.2.3.2.0 
X2-8, X3-8 >= 11.2.3.2.0 
X4-8 >= 11.2.3.2.0 
X5-8, X6-8 >= 11.2.3.2.0 
If the reported output differs, investigate and correct the condition.
NOTE: The Disk Expansion Kit is only applicable to X5-2, X6-2, and X7-2 database servers.
NOTE: If the database server was upgraded to 11.2.3.2.0, this check may fail because one of the devices shows a state of: "Unconfigured(good), Spun Up". Please see the "Known Issues" #5 "Hotspare removed for compute nodes" in My Oracle Support note 1468877.1 for corrective action.

 Alternate Configuration

For an X2-2(4170), X2-2, or X2-8 database server which is running an Exadata software version lower than 11.2.3.2.0 that is being upgraded to an Exadata software version of 11.2.3.2.1 or higher, an alternate configuration is permitted. The alternate configuration for an X2-2(4170) or X2-2 uses 3 disks in the RAID set with 1 disk as a hot spare. The alternate configuration for an X2-8 uses 7 disks in the RAID set with 1 disk as a hot spare.
The output should be similar to:
Firmware state: Online, Spun Up 
<output truncated for brevity>
Firmware state: Hotspare, Spun down
For an X2-2(4170) or X2-2, the expected output should contain three lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun down". For an X2-8, the expected output should contain seven lines of output showing a state of "Online, Spun Up", and one line showing a state of "Hotspare, Spun down". In either case, the ordering of the output lines is not significant and may vary based upon a given database server's physical drive replacement history.
If the reported output differs, investigate and correct the condition.
NOTE: Modified 03/21/12Occasionally in normal operation, the "Hotspare" physical drive may be brought to a state of "Online, Spun Up". Thirty minutes (default) after the operation that brought the drive to "Online, Spun Up" has completed, the drive should spin down due to the powersaving feature. There is no harm for the drive to be "Online, Spun Up" if there are no other errors reported in the disk drive configuration checks.

For additional information, please reference My Oracle Support note "Exadata: Hot Spares Not Spinning Down (Doc ID 1403613.1)"

Verify database server disk controllers use writeback cache

PriorityAlert LevelDateOwnerStatusEngineered SystemEngineered System PlatformBug(s)
CriticalFAIL03/07/18 ProductionExadata - Physical,
Exadata - Management Domain
X2-2, X2-8, X3-2, X3-8, X4-2, X4-8, X5-2, X5-8, X6-2, X6-8, X7-2Bug 27523948 - exachk
DB Version DB Type DB RoleDB ModeExadata VersionOS & VersionValidation Tool VersionMAA Scorecard Section
N/AN/AN/AN/AALLLinuxexachk 18.2.0N/A
Benefit / Impact:
Database servers use an internal RAID controller with a battery-backed cache to host local filesystems. For maximum performance when writing I/O to local disks, the battery-backed cache should be in "WriteBack" mode.
The impact of configuring the battery-backed cache in "WriteBack" mode is minimal.
Risk:
Not configuring the battery-backed cache in "WriteBack" mode will result in degraded performance when writing I/O to the local database server disks.
Action / Repair:
To verify that the disk controller battery-backed cache is in "WriteBack" mode, run the following set of commands as the "root" userid on all database servers:
NOTE: This check is not applicable to X7-8 Oracle Exadata Database Servers as they contain no conventional disk drives!
unset NON_WRITEBACK
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
  export CMD=/opt/MegaRAID/storcli/storcli64
else
  export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
RAW_OUTPUT=$($CMD -CfgDsply -a0 -nolog | egrep -i "Virtual Drive:|Current Cache Policy:" | grep -v Number | sed 'N;s/\n/ /')
NON_WRITEBACK=$(echo -n "$RAW_OUTPUT" | grep -vi writeback)
if [ -z "$NON_WRITEBACK" ]
then
  echo -e "SUCCESS: All virtual drives have \"Current Cache Policy\" set to \"WriteBack\"."
else
  echo -e "FAILURE: One or more virtual drives do not have \"Current Cache Policy\" set to \"WriteBack\".  Details:\n\n$NON_WRITEBACK"
fi
The output should be:
SUCCESS: All virtual drives have "Current Cache Policy" set to "WriteBack".
Example of a "FAILURE:" result:
FAILURE: One or more virtual drives do not have "Current Cache Policy" set to "WriteBack".  Details:

Virtual Drive: 0 (Target Id: 1) Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
If the battery-backed cache is not in "WriteBack" mode, run these commands as the "root" userid on the effected database server to place the battery-backed cache into "WriteBack" mode:
if [ -x /opt/MegaRAID/storcli/storcli64 ]
then
  export CMD=/opt/MegaRAID/storcli/storcli64
else
  export CMD=/opt/MegaRAID/MegaCli/MegaCli64
fi
$CMD -LDSetProp WB  -Lall  -a0 -nolog
$CMD -LDSetProp NoCachedBadBBU -Lall  -a0 -nolog
$CMD -LDSetProp NORA -Lall  -a0 -nolog
$CMD -LDSetProp Direct -Lall  -a0 -nolog
NOTE: No settings should be modified on Exadata storage cells. The mode described above applies only to database servers in an Exadata database machine.

Verify that "Disk Cache Policy" is set to "Disabled"

PriorityAddedMachine TypeOS TypeExadata VersionOracle Version
Critical06/13/11X2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-2Linux11.2.x +11.2.x +

Benefit / Impact:
"Disk Cache Policy" is set to "Disabled" by default at imaging time and should not be changed because the cache created by setting "Disk Cache Policy" to "Enabled" is not battery backed. It is possible that a replacement drive
has the disk cache policy enabled so its a good idea to check this setting after replacing a drive.
The impact of verifying that "Disk Cache Policy" is set to "Disabled" is minimal. The impact of suddenly losing power with "Disk Cache Policy" set to anything other than "Disabled" will vary according to each specific case,
and cannot be estimated here.
Risk:
If the "Disk Cache Policy" is not "Disabled", there is a risk of data loss in the event of a sudden power loss because the cache created by "Disk Cache Policy" is not backed up by a battery.
Action / Repair:
To verify that "Disk Cache Policy" is set to "Disabled" on all servers, use the following command as the "root" userid on the first database server in the cluster:
unset TMP_RSLT;
TMP_RSLT='dcli -g /opt/oracle.SupportTools/onecommand/all_group -l root "if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]; then /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL -nolog; else /opt/MegaRAID/MegaCli -LdPdInfo -aALL -nolog; fi;" | grep -i 'Disk Cache Policy' | grep -v Disabled | wc -l'
if [ $TMP_RSLT = 0 ]
then
echo -e "\nSUCCESS\n"
else
echo -e "\nFAILURE:";
dcli -g /opt/oracle.SupportTools/onecommand/all_group -l root "if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]; then /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL -nolog; else /opt/MegaRAID/MegaCli -LdPdInfo -aALL -nolog; fi;" | grep -i 'Disk Cache Policy' | grep -v Disabled;
echo -e "\n";
fi;
The output should be:
SUCCESS
If anything other than "SUCCESS" is returned, identify the LUN(s) in question and reset the "Disk Cache Policy" to "Disabled" using the following commands as the "root" userid on the server that reported the issue (where Lx= the lun in question, for example: L2):
if [ -x /opt/MegaRAID/MegaCli/MegaCli64 ]
then
#Linux
export TMP_CMD=/opt/MegaRAID/MegaCli/MegaCli64
else
#Solaris
export TMP_CMD=/opt/MegaRAID/MegaCli
fi;
$TMP_CMD -LDSetProp -DisDskCache -Lx -a0 -nolog
Note: The "Disk Cache Policy" is completely separate from the disk controller caching mode of "WriteBack". Do not
confuse the two. The cache created by "WriteBack" cache mode is battery-backed, the cache created by "Disk Cache Policy" is not!

Verify service exachkcfg autostart status
PriorityAlert LevelDateOwnerStatusScopeBug(s)
CriticalFAIL05/14/2014<Name>ProductionExadata, SSC, Exalogic18735585- exachk
DB VersionDB RoleEngineered SystemExadata VersionOS & Version Validation Tool Version TBD
N/AN/AX2-2(4170), X2-2, X2-8, X3-2, X3-8, X4-211.2.2.2.0+Linux x86-64exachk 2.2.5 
Benefit / Impact:
Verifying the exachkcfg service autostart status helps to avoid an unexpected modification attempt and possibly lengthened boot sequence. The Impact of verifying the exachkcfg service autostart status is minimal.
Risk:
On either a database or storage server, a required maintenance operation or an incorrect configuration change might be missed.
Action / Repair:
To verify the exachkcfg service autostart status, execute the following command as the "root" userid on all storage and database servers:
chkconfig --list exachkcfg;
The output should be similar to:
exachkcfg 0:off 1:off 2:off 3:on 4:off 5:off 6:off
For either a database or storage server, run level 3 should be "on" (3:on).
It should be rare to find this not set as expected. Should a correction be required, as the root userid, use the "chkconfig --level" command. For example, to set the run level "3" for exachkcfg to "on" for a database server with exadata image version >= 11.2.3.3.0:
[root@randomdb03 ~]# chkconfig --level 3 exachkcfg on
For another example, to set the run level "3" for exachkcfg to "off" for a database server with exadata image version < 11.2.3.3.0:
[root@randomdb03 ~]# chkconfig --level 3 exachkcfg off

NOTE: At exadata image versions below 11.2.3.3.0, on a database server all run levels should be set to "off", and on a storage server, at least one run level should be set to "on" (number varies by exadata software version).

Check alerthistory for test open stateless alerts

Priority
Alert Level
Date
Owner
Status
Engineered System 
Engineered System
Bug(s)
Critical
FAIL
11/01/2017
   <Name>
Production
Exadata - Physical,
Exadata - Management Domain
ALL
26651210 - exachk
21299854 - exachk
GI/DB Version
DB Type
DB Role
DB Mode
Exadata Version
OS & Version
Validation Tool Version
MAA Scorecard Section
N/A
N/A
N/A
N/A
ALL
Linux
exachk 12.2.0.1.4
 N/A

Benefit / Impact
There are two types of alerts maintained in the alerthistory of a storage or database server, stateful and stateless.
A stateless alert is not cleared automatically. They will not age out of the alerthistory until the alert is manually investigated and the "examinedby" field set manually to a non-null value, typically the name of the person who reviewed the stateless alert and corrected or otherwise acted upon the information provided.
The benefit of checking for for test open stateless alerts is a less cluttered alerthistory. The impact of acknowledging any test open stateless alert is minimal.
Risk:
Unnecessary test alerts maintained in the alerthistory.
Action / Repair:
To verify there are no test open stateless alerts, as the root userid on each storage and database server execute the following commands:
unset IMAGE_VERSION
unset NODE_TYPE
unset COMMAND_NAME
unset NAME_ARRAY
unset INDIVIDUAL_NAME
unset SID
unset SEVERITY
unset MESSAGE
unset ACTION
unset OUTPUT_ARRAY
if [ `egrep -i node.type /opt/oracle.cellos/cell.conf | grep -i db | wc -l` -eq 1 ]
  then NODE_TYPE=db
else
  NODE_TYPE=cell
fi
IMAGE_VERSION=$(imageinfo -version |tr -d '.'|cut -c1-6)
if [ $NODE_TYPE = "cell" ]
then
  COMMAND_NAME=cellcli
else
  if [ $IMAGE_VERSION -ge 121211 ]
    then COMMAND_NAME=dbmcli
  fi;
fi;
if [ -n "$COMMAND_NAME" ]
then
  NAME_ARRAY=$($COMMAND_NAME -e list alerthistory attributes name,alertmessage where alerttype=stateless and examinedby=\'\' | grep -iw test | sed -e 's/^[ \t]*//' | cut -d" " -f1);
  if [ -z "$NAME_ARRAY" ]
  then
    echo -e "SUCCESS: there are no test open stateless alerts."
  else
    for INDIVIDUAL_NAME in $NAME_ARRAY
    do
      NAME_RECORD=$($COMMAND_NAME -e "list alerthistory attributes alertsequenceid,severity,alertMessage,alertAction where name=$INDIVIDUAL_NAME" | awk '{$2=$2};1')
      SID=$(echo "$NAME_RECORD" | cut -d" " -f1)
      SEVERITY=$(echo "$NAME_RECORD" | cut -d" " -f2)
      MESSAGE=$(echo "$NAME_RECORD" | cut -d'"' -f2)
      ACTION=$(echo "$NAME_RECORD" | cut -d'"' -f4)
      OUTPUT_ARRAY+=$(echo -e "\n";echo -e "SID:\t\t$SID";echo -e "NAME:\t\t$INDIVIDUAL_NAME";echo -e "SEVERITY:\t$SEVERITY";echo -e "MESSAGE:\t$MESSAGE";echo -e "ACTION:\t\t$ACTION")
    done
    echo -e -n "FAILURE: there are one or more test open atateless alerts that have not been cleared. Details:"
    echo -e "${OUTPUT_ARRAY[@]}"
  fi
else
  echo "alerthistory is not available on database servers at image versions below 12.1.2.1.1: $NODE_TYPE $IMAGE_VERSION"
fi
The output should be similar to:
SUCCESS: there are no test open stateless alerts.
- OR -
alerthistory is not available on database servers at image versions below 12.1.2.1.1: db 112322
If the output is not as expected, examine the full details for each name that has not been cleared and follow the recommendations.
Example of a FAILURE result:
FAILURE: there are one or more test open atateless alerts that have not been cleared. Details:
SID:            2
NAME:           2
SEVERITY:       info
MESSAGE:        "This is a test trap"
ACTION:         
To acknowledge a test open stateless alert, manually set the "examinedby" field with a command similar to the following (command name is either cellcli or dbmcli, depending upon whether a storage or database server is involved):
CellCLI> alter alerthistory 2 examinedby="jdoe"
Alert 2 successfully altered
Where jdoe is the name of the person who verified the test open stateless alert, and the number is the name of the stateless alert. Note that double quotes are used around the value to be set, but not the name of the stateless alert.

 Revision History


Date
Change
Nov 23 2016Hidden Parameters  Table MAA Nov 23 2016
Nov 16 2016MAA Nov 16 2016  Verify There Are No Memory (ECC) Errors
Oct 5 2016Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files
Oct 5 2016Verify InfiniBand Address Resolution Protocol (ARP) Configuration on Database Servers
Aug 22 2016Verify "_reconnect_to_cell_attempts=9" on database servers which access X6 storage servers
April 6 2016Detect duplicate files in /etc/*init* directories
Verify Initialization parameters and diskgroup attributes
Verify RAID disk controller Cache Valur Capacitor condition
verify Exadata Smart Flash Cache is created
March 23 2016Verify Ambient Air Temperature – improved existing section
March 16 2016Verify database server file systems have "Maximum mount count" = "-1"
Verify database server file systems have "Check interval" = "0"
February 17 2016Verify Datafiles are Placed on Diskgroups consisting of griddisks with unset cachedBy attribute– updated to only check when flashcache in WriteBack mode
February 16 2016Adding Validate key sysctl.conf parameters on database servers
February 5 2016Adding:
Verify storage server data (non-system) disks have no partitions
Verify db_unique_name is used in I/O Resource Management (IORM) interdatabase plans
Verify Datafiles are Placed on Diskgroups consisting of griddisks with cachingPolicy = DEFAULT
Verify Datafiles are Placed on Diskgroups consisting of griddisks with unset cachedBy attribute
January 26 2016Adding X4-8/X5-2 to the list of supported platforms
Feb 10 2017Consolidation Parameters Reference Table – updates to parallel parameters row, and removal of unneeded Exadata platform specific resource references
Mar 01 2017Verify "downdelay" is correctly set for bonded client interfaces – improved with more checks
Verify Storage Server user "CELLDIAG" exists – improved with prompt for password
Mar 14 2017(1) Verify RDS Protocol over InfiniBand Network is used – existing section improved
(2) Verify all Database and Storage Servers are synchronized with the same NTP server – existing section improved
Mar 27 1017(1) Check /EXAVMIMAGES on dom0s for possible over allocation by sparse files – converted to new style using exachk -check
Apr 4 2017 (1) Verify ExaWatcher is executing
(2) Verify non-Default services are created for all Pluggable Databases
Apr 27 2017(1) Verify "diagsnap.pl" is not executing
Jun 7 2017(1) Verify Hidden Initialization Parameter Usage – updated version for _parallel_adaptive_max_users to include 12.2
(2) Verify IP routing configuration on database servers
(3) Verify Grid Infrastructure Management Database (MGMTDB) configuration
Jun 29 2017(1) Verify Automatic Storage Management Cluster File System (ACFS) is on a separate Disk Group
 July 12 2017(1) Ensure Temporary Tablespace is correctly defined (ARCHIVE) –archived; confirmed deployment template has key attributes that are still valid today
(2) Verify ASM Diskgroup Attributes for 12.2.0.x –new
(3) Verify the SYSTEM, SYSAUX, USERS and TEMP tablespaces are of type bigfile –new
(4) Verify ASM Diskgroup Attributes for 12.1.0.x –updated to have “>=” for repair timers
(5) Verify the ownership and permissions of the "oradism" file –updated to execute as software owner instead of root
(6) Verify all "BIGFILE" tablespaces have non-default "MAXBYTES" values set (ARCHIVE) –archived; relying on other tools like EM to handle the problem this was originally created to solve
July 19 2017 
July 26 2017 
Sep 9 2017  (1)
Verify Hidden Initialization Parameter Usage – added _asm_max_connected_clients as acceptable in 12.2.0.1
Oct 10 2017(1) Verify the recommended patches for Adaptive features are installed
(2) Verify that griddisks are distributed as expected across celldisks
(3) Verify Exadata Smart Flash Cache is Created
(4) Verify Database Server Disk Controller Configuration
(5) Verify Database Server Virtual Drive Configuration
Oct 28 2017 
(1) Verify Database Server Physical Drive Configuration
(2) Verify Grid Infrastructure Management Database (MGMTDB) configuration (ARCHIVE) – archived
Nov 22 2017  
(1) Verify that griddisks are distributed as expected across celldisks – update; added exception for griddisk RA prefix “CATALOG”
(2) Check alerthistory for non-test open stateless alerts & Check alerthistory for test open stateless alerts – update; improved formatting
Dec 1 2017
(1) Verify that griddisks are distributed as expected across celldisks – update; added exception for griddisk RA prefix “CATALOG”
(2) Check alerthistory for non-test open stateless alerts & Check alerthistory for test open stateless alerts – update; improved formatting
(3) Verify initialization parameter
(4) cluster_database_instances is at the default value
Verify the database server NVME device configuration
(5) Verify celldisk configuration on flash memory devices
Jan 25 2018
(1) Verify "diagsnap.pl" is not executing - update; added repair operation
(2) Verify all Database and Storage Servers are synchronized with the same NTP server – update; retrofitted for exachk
(3) Verify that Automatic Storage Management Cluster File System (ACFS) uses 4K metadata block size
(4) Verify database server quorum disks configuration
Mar 08 2018
(1) Verify RAID disk controller CacheVault capacitor condition
(2) Modified - Verify the storage servers in use configuration matches across the cluster
(3) Modified - Verify database server disk controllers use writeback cache
(4) Verify Database Server Virtual Drive Configuration
(5) Verify Database Server Physical Drive Configuration
(6) Verify active system values match those defined in configuration file "cell.conf"
Mar 21 2018 
(1) Check cell BIOS state for restore pending status (ARCHIVE) – archived
Apr 21 2018
(1) Evaluate Automated Maintenance Tasks configuration -new BP added.
May 15 2018
(1) Verify proper ACFS drivers are installed for Spectre v2 mitigation
Jun 7 2018 
(1) Verify "diagsnap.pl" is not executing (ARCHIVE) – archived; we have coverage in critical issue DB41
(2) Verify memlock is 90% of phys ram when huge pages are enabled (ARCHIVE) – archived; orachk will retain memlock check for hugepages
Jun 28 2018
(1) Verify Exafusion Memory Lock Configuration
Jul 13 2018
(1) included release 18c for _asm_max_connected_clients
Aug 14 2018
 
(1) Verify Hidden Initialization Parameter Usage - update; consolidated all recommendations around hidden parameters into one section
(2) Verify there are no unhealthy InfiniBand switch sensors
(3) Verify RAID disk controller CacheVault capacitor condition
(4) Verify RAID Disk Controller Battery Condition
(5) Verify RAID Controller Battery Temperature (ARCHIVE)
Sep 26 2018
(1) Verify the InfiniBand Fabric Topology (verify-topology)
(2) Refer to MOS 1682501.1 if non-Exadata components are in use on the InfiniBand fabric
(3) Verify Database Server Disk Controller Configuration (ARCHIVE) - will not run in 18.1 and higher
(4) Verify Database Server Virtual Drive Configuration (ARCHIVE) - will not run in 18.1 and higher
(5) Verify Database Server Physical Drive Configuration (ARCHIVE) - will not run in 18.1 and higher
(6) Verify Common Instance Database Initialization Parameters for 12.1.0.x & Verify Common Instance Database Initialization Parameters for 12.2.0.1 – expand existing audit_trail and control_files checks
Sep 27 2018
(1) Verify database server disk controllers use "WriteBack" cache (ARCHIVE) – no longer needed in Exadata 18.1 and higher
(2) Verify that "Disk Cache Policy" is set to "Disabled" (ARCHIVE) – no longer needed in Exadata 18.1 and higher
(3) Verify service exachkcfg autostart status (ARCHIVE) – no longer needed in Exadata 19.1 and higher
Oct 3 2018
(1) Verify Hidden Initialization Parameter Usage – update; adjusted _backup_disk_bufcnt, _backup_disk_bufsz, _backup_file_bufcnt, _backup_file_bufsz to only be checked with database version 12.1 and lower
Dec 18 2018
(1) Verify active kernel version matches expected version for installed Exadata Image -- OL7 support added
(2) Verify installed rpm(s) kernel type match the active kernel version -- OL7 support added
(3) Verify the Master Subnet Manager is running on an InfiniBand switch -- OL7 support added
(4) Verify the Subnet Manager is properly disabled -- OL7 support disabled.
Feb 13 2018
(1) Verify the storage servers in use configuration matches across the cluster
Apr 20 2019
(1) Verify the ib_sdp module is not loaded into the kernel
May 03 2019
(1) Verify Hidden Initialization Parameter Usage - update; improved wording for _enable_numa_support
(2) Verify the vm.min_free_kbytes configuration - update; improved logic making it numa aware and increasing value accordingly
(3) Verify all database and storage servers time server configuration - update to cover mixed ntp/chrony case
Jul 11 2019
(1) Verify all voting disks are online & Verify database server quorum disks configuration - improved existing sections
(2) Verify all database and storage servers time server configuration - update to cover mixed ntp/chrony case
(3) Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database files- improved existing section
(4) Verify the recommended patches for Adaptive features are installed- improved existing section
(5) Check alerthistory for stateful alerts not cleared - improved existing section & Check alerthistory for non-test open stateless alerts - improved existing section
(6) Check alerthistory for test open stateless alerts (ARCHIVE)
Sep 18 2019
(1) Verify available ksplice fixes are installed
(2) Verify Automatic Storage Management Cluster File System (ACFS) file systems do not contain critical database files - Improved the existing section 

 

REFERENCES


NOTE:1351559.1 - IDT switch on the PCI riser has a problem resulting in occasional loss of connectivity to pair of flash cards on the cells
NOTE:401749.1 - Oracle Linux: Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration
NOTE:1284070.1 - Updating key software components on database hosts to match those on the cells
NOTE:1298957.1 - Manage Audit File Directory Growth with cron
NOTE:1286796.1 - rp_filter for multiple private interconnects and Linux Kernel 2.6.32+
NOTE:359515.1 - Mount Options for Oracle files for RAC databases and Clusterware when used with NFS on NAS devices
NOTE:1351036.1 - How to Validate and Fix Proper ASM Failure Group Configuration on Oracle Exadata Database Machine
NOTE:1188080.1 - Steps to shut down or reboot an Exadata storage cell without affecting ASM
Didn't find what you are looking for?

No comments:

Post a Comment

Database Options/Management Packs Usage Reporting for Oracle Databases 11.2 and later (Doc ID 1317265.1)

  Database Options/Management Packs Usage Report You can determine whether an option is currently in use in a database by running options_pa...