by Jesse Gordon, Yasmin Hazrat, Masud Khandker, and Michelle Szucs

Detailed descriptions of Java and Oracle WebLogic Server tunings that can improve the performance of Oracle WebLogic Server on Oracle MiniCluster


Introduction


Oracle MiniCluster S7-2 is a high-performance engineered system for running both database and applications. It has a refined set of software tools for automating a variety of management tasks, including deploying and securing virtual machines. With impressive out-of-the box Java performance and a streamlined administration tool, Oracle MiniCluster is ideal for running Oracle WebLogic Server. This article details a series of Java and Oracle WebLogic Server tunings that can further improve the performance of Oracle WebLogic Server on Oracle MiniCluster, providing a helpful guide as to which tunings are most likely to be high-impact for certain types of applications.

Many of these guidelines are equally applicable to other SPARC systems that support the execution of a large number of concurrent application threads. Each tuning option is explained in detail to provide insight into when it might be helpful.

Performance Tuning Overview


Performance tuning recommendations are difficult to boil down into one-size-fits-all suggestions. There are many variables at play, from the application being tuned to the hardware configuration and the network configuration of the nodes. Additionally, goals will vary from application to application. In some cases, throughput is the most important metric, whereas other applications are focused on low latency. Determining the correct set of tunings for your application will require planning and hands-on testing. This article will provide a helpful starting point for your tuning exercise on Oracle MiniCluster by providing insight about which tunings might or might not prove to be high-impact for your environment. For a more detailed description of tuning Oracle WebLogic Server, refer to this detailed document.

The first step of undertaking a tuning exercise is to identify performance objectives, which includes the goals that you need to meet and the characteristics of your environment. Important information to collect includes the load level, the type of data that will be exchanged, and the frequency and severity of peak traffic. Combine this information to provide a picture of whether to focus on optimizing for latency or throughput, which aids in determining the target CPU utilization. Applications that are sensitive to latency need a lower target CPU utilization (to accommodate peak loads) than those that are primarily concerned with throughput.

Once goals have been established, initial performance measurements can be taken by using a large load to try to achieve your target CPU utilization. If the target cannot be reached, examine which components of the system are being fully utilized; those are the bottlenecks that need to be addressed. The information in this article is useful for addressing bottlenecks in an application server. Therefore, the database, database disks, application disks, and network should first be ruled out as potential bottlenecks.

Apply tunings to the system, preferably one at a time, and then re-evaluate the performance until the desired performance goal is achieved.

Oracle MiniCluster S7-2 Performance Testing Architecture


Oracle MiniCluster Performance Testing Architecture and Workload


To investigate the impact of selected performance options, a test environment was set up to run an Oracle WebLogic Server workload and measure the impacts of various tunings.

Testing was performed on an Oracle MiniCluster S7-2 running Oracle Solaris 11.3. Zone configuration and database installation was greatly simplified by using the virtual assistant provided with Oracle MiniCluster. The virtual assistant provides a browser user interface (BUI) that automates the deployment of virtual machines and the installation of Oracle Database, including Oracle Real Application Clusters (Oracle RAC), which is a clustered version of Oracle Database. As virtual machines are deployed, comprehensive security settings are applied.

As shown in Figure 1, two database virtual machines were created—one on each compute node—using the virtual assistant. From the BUI, Oracle Database was installed with Oracle RAC configured across the two nodes. Two secure application virtual machines were deployed, one on each compute node, through the BUI. Oracle WebLogic Server 12.2 was installed on both nodes, using the Oracle WebLogic Cluster feature with session replication enabled. JDK 8 was used.

f1.png
Figure 1. Test architecture

As shown in Figure 2, a three-tier, client-server workload was used. This workload was intended to characterize the performance of Java EE servers, with requests including HTTP, HTTPS, and web services. The Oracle WebLogic Server components that were exercised included Enterprise Java Beans (EJB), Servlet, Java Message Service (JMS), and JDBC connected to Oracle RAC–enabled back-end database servers.

f2.png
Figure 2. Diagram of the three-tier client server workload

Workload Goals


Performance goals were focused on achieving high throughput while meeting response-time targets. For example, the 90th percentile of all response times (for both the HTTP and web service requests) was under two seconds. High utilization of system resources was acceptable for this workload. However, applications with stringent latency requirements might require lower average utilization so that peak loads can be handled without a significant drop in response time.

Oracle MiniCluster S7-2 contains two of Oracle's SPARC S7-2 servers, each of which was configured with two 8-core SPARC S7 processors—exposing a total of 128 virtual CPUs (vCPUs). Tunings were geared to leverage this high degree of parallelism in the hardware layer. A single Oracle WebLogic Server instance scaled well for this application; however, certain applications might benefit from using multiple smaller Oracle WebLogic Server instances. Minimizing garbage collection overhead was a primary goal, because this exercise was more focused on performance than on saving memory resources.

Basic Oracle MiniCluster Configuration


The performance of Oracle WebLogic Server on Oracle MiniCluster was very reasonable before any tunings were made that were not application-specific, such as the heap size. No major changes were made to the system that were atypical of standard Oracle WebLogic use cases. In particular, no major Oracle Solaris or network tuning changes were required to support the workload.

One minor operating system tuning made for this particular workload was increasing the number of anonymous TCP ports. This change should be made if TCP connections cannot be established or if there is a shortage of ports.

ipadm set-prop -p smallest_anon_port=<N> tcp

Ports that are automatically disabled when the secure zones are created by the virtual assistant might need to be manually enabled. Additionally, ipfilter was disabled throughout the system to improve performance by running the following command in both the global and non-global zones:

svcadm disable network/ipfilter

Tuning Options in Detail


Performance Tuning of Oracle WebLogic Server


1. Set the thread pool size:

-Dweblogic.threadpool.MinPoolSize=<N>
-Dweblogic.threadpool.MaxPoolSize=<N>

Oracle WebLogic Server uses a single thread pool to service requests, prioritizing requests based both on rules set by the administrator and on runtime metrics. By default, Oracle WebLogic Server 12.2 will dynamically size this pool between 1 and 65,534 threads. Dynamic resizing is based on historic throughput analysis, and can save administrative resources and time dedicated to testing, tuning, and monitoring the system.

However, in environments with predictable loads that are sensitive to performance, administrators might want to set the threadpool size manually. In some cases, dynamic resizing of the thread pool can happen faster (or slower) than is desirable; setting the minimum and maximum thread pool size to the same value will prevent dynamic resizing, leading to more predictable application performance. Disabling dynamic resizing also removes the overhead that would otherwise be dedicated to monitoring throughput and adjusting the thread count.
Setting these two values to be equal and preventing dynamic resizing might be especially beneficial when a large number of hardware strands or vCPUs are available.

In the Oracle WebLogic administration console, these options are Self Tuning Thread Minimum Pool Size and Self Tuning Thread Maximum Pool Size.

2. Set the number of socket readers:

-Dweblogic.SocketReaders=<N>

A portion of threads in the thread pool are devoted to reading incoming requests to the server. The value of this option determines the number of threads that are designated as socket reader threads. Increasing this value will help when requests are not being processed quickly enough. Decreasing this value if many socket reader threads are routinely idle will free up additional threads to serve as generic execution threads.

3. Select the muxer class:

-Dweblogic.MuxerClass=weblogic.socket.NIOSocketMuxer

This option selects the default NIOSocketMuxer as the software component that reads requests the server receives, delegates their processing appropriately, and ensures the response is routed correctly. This particular muxer uses a non-blocking I/O implementation, which allows for greater scalability of applications. Java muxers will block until there is data at the socket to be read, which can create performance problems for servers with clients that send requests infrequently. Environments with a high number of clients that send requests infrequently will likely benefit from using NIOSocketMuxer.

This setting is the default in Oracle WebLogic Server 12.2. In the administration console, this option is Muxer Class.

4. Enable production mode:

-Dweblogic.ProductionModeEnabled=true

This option is not enabled by default, but it should be turned on in production environments for increased security. Additionally, production mode disables page checks, saving resources from monitoring whether JavaServer Pages (JSP) files need to be recompiled. Enabling this option is appropriate for environments where application code is deployed and stable.

Security improvements provided in production mode include warnings if the default certificates and keystores are used, and disabling automatic deployment and updating of applications in the autodeploy directory. Deployment will require valid login credentials. Log files are also more persistent.

In the administration console, this option can be enabled by selecting the Production Mode checkbox.

5. Set the maximum number of EJB session beans in the cache:

max-beans-in-cache

If session beans and the stateful EJB cache are used, this value should be set equal to the approximate number of concurrent users so that the cache is big enough to hold the working set of sessions in use. A value too low will result in some EJB session beans being passivated, or moved to secondary storage, resulting in slower access times when the client becomes active again. If the number of concurrent users far exceeds the allowed EJB cache, the constant churn between the cache and secondary storage will diminish the value of keeping these objects in cache. On the other hand, setting max-beans-in-cache to too high a value will result in excessive memory consumption.

The session EJB Cached Beans Current Count can be monitored to determine whether the set quantity is appropriate.

6. Pin data source connections to threads:

Pinned-to-Thread

Enabling this option can eliminate the locking contention that is usually caused when two Oracle WebLogic execution threads simultaneously attempt to connect to the database. This option also reduces reservation time when a thread tries to secure a database connection.

When a new execution thread initiates a connection with the data source, the connection will stay pinned to that thread even after the connection is closed. Rather than the connection being made available to the data source again, the connection stays available for the next connection attempt by the execution thread.

7. Set the data source statement cache size:

Statement Cache Size

This setting can improve database response times. The statement cache size determines how many prepared and callable statements each database connection in the JDBC connection pool can store in the cache. It defaults to 10. Caching statements reduces overhead for communication between the application and database server, and reduces load on the database server, freeing CPU cycles.

Note that the DBMS might initiate a cursor for every open statement, and increasing the cache size might cause the cursors to exceed the memory space they have been allocated.

Prepared statements stored in cache are likely to fail when changes are made to the data definition language, or when a table's columns are changed.

8. Enable Logging Last Resource for the data source:

Logging Last Resource (LLR)

Transactional database applications might benefit from enabling Logging Last Resource transaction optimization by eliminating the use of XA drivers.

Global transactions that involve both a JDBC database operation and a message queuing task become two-phase commits, requiring XA drivers because they span multiple resources. These drivers tend to be slower than non-XA drivers for JDBC. LLR is a transactionally safe method of avoiding the use of XA drivers. The LLR option will be set to refer to an LLR table in the database.

Java Virtual Machine (JVM) Tunings


Some of these JVM settings will be enabled by default, though initial settings vary based on the infrastructure executing the JVM. Explicitly setting all JVM options might help clarify the exact state of the application environment, and it also maintains consistency between JDK versions, which might change defaults.

1. Run the JVM in server mode:

-server

Running Java applications in the server system is preferable when performance is the primary goal; client systems have faster startup times and smaller memory footprints, but are often not appropriate for enterprise applications.

On 64-bit capable JDKs, the client VM is not supported, so use of the server system is implicit.

2. Use the 64-bit JVM:

-d64

Run the server in a 64-bit environment. The Java HotSpot Server VM that is enabled with the -server option will implicitly use -d64.

3. Explicitly set the initial and maximum heap size:

-Xms <N> [g|m|k]
-Xmx <N> [g|m|k]


The Java heap is divided into a young generation and an old generation. These options specify their combined size.

Setting the heap size appropriately is one of the most important aspects of tuning an Oracle WebLogic environment. If the minimum (-Xms) and maximum (-Xmx) heap sizes are set to different values, the JVM will dynamically scale the heap size. As with setting the thread pool size for Oracle WebLogic Server, evaluating the needs of the application and setting these two heap size options to the same value will prevent dynamic scaling. Preventing dynamic scaling leads to more predictable performance and prevents resources from being devoted to evaluating and adjusting the heap size.

One of the best metrics for monitoring whether the heap size is set appropriately is the percentage of time spent on garbage collection. The heap should be large enough that garbage collection consumes a small fraction of processing time. While the appropriate value will vary from system to system, an appropriate starting target would be under 5 percent.

While tuning suggestions for earlier JDK versions often advised allowing the heap to dynamically scale, current best practice (when performance is the highest priority) is to set the two heap size values to be equal.

4. Explicitly size the young generation space:

-Xmn <N> [g|m|k]

This single option can be used to set the initial, minimum, and maximum size of the young generation portion of the heap. (Individual commands can be used to set the minimum and maximum size to different values, allowing for dynamic resizing.) As with sizing the Oracle WebLogic thread pool and the total heap size, disallowing dynamic resizing can result in improved performance if the choice is made wisely.

This value was not found to be particularly impactful on the workload tested on Oracle MiniCluster. However, certain applications might benefit from more rigorous tuning of this value. Garbage collections are run more frequently in the young generation heap, so a larger young generation space will reduce the frequency of these small garbage collections. However, an overly generous size will steal space from the old generation heap, and might result in more frequent full collections.

5. Explicitly size the survivor space and Eden:

-XX:InitialSurvivorRatio=<N>
-XX:TargetSurvivorRatio=<N>
-XX:SurvivorRatio=<n>
-XX:-UseAdaptiveSizePolicy


The young generation heap is divided into three components: an area for brand-new objects (called Eden) and two equally sized "survivor spaces" for objects that have lived through a few Eden garbage collections. These options allow manipulation of the size of these spaces. Note that a higher survivor ratio translates to smaller survivor spaces and a larger Eden. The adaptive size policy must be disabled if the survivor ratio is being specified.

Changing these values from their defaults was not found to be particularly impactful on the Oracle MiniCluster workload. Undersized survivor spaces can cause certain garbage collection steps to overflow into the old generation space, while bloated survivor spaces can waste valuable resources.

6. Enable compressed pointers:

-XX:+UseCompressedOops

When using a 64-bit JVM with heap sizes less than 32 GB, CompressedOops is enabled by default and should remain enabled to compress object pointers. Rather than taking 64 bits to express 32-bit (or smaller) addresses, the pointers will be represented as 32-bit offsets, thereby helping to make more efficient use of the CPU cache.

7. Enable the use of large memory pages:

-XX:+UseLargePages
-XX:LargePageSizeInBytes=<N> [g|m|k]


Using large pages is enabled by default in JDK 8 on Oracle Solaris, and large pages should be used in most cases when the operating system supports large pages. Additionally, the large page size should be set to the largest page size supported by the operating system.

These arguments help reduce Translation-Lookaside Buffer (TLB) misses. The TLB caches virtual-to-physical memory address translations and significantly speeds up memory accesses. Using large pages, and setting the page size to its maximum allowable value, increases the amount of physical memory that can be represented in the TLB. These settings should improve the performance of memory-intensive applications.

In some cases, using large pages might reduce performance. If the application is consuming so much memory that other processes are starved, overall system performance might be degraded. Additionally, the JVM might automatically disable large pages on systems that are experiencing fragmentation, such as those that have been running for a very long time.

8. Pre-touch memory pages:

-XX:+AlwaysPreTouch

This parameter is important to enable when a large heap size and/or large memory pages are used. While it might increase the startup time of the JVM, loading the pages before general application execution will result in better runtime performance.

9. Make Java NUMA-aware on multi-socket systems:

-XX:+UseNUMA

This Java option can improve performance on systems that utilize non-uniform memory access (NUMA), where memory access time is dependent on the distance between the microprocessor and the memory location. Multi-socket SPARC systems, such as the two-socket SPARC S7-2 servers in the Oracle MiniCluster S7-2 engineered system, are NUMA architectures. Each processor can access its local memory or memory managed by another processor.

When made NUMA-aware, the JVM attempts to ensure that lower-latency local memory is used when possible. When a thread allocates a new object, the JVM assumes that the same thread will also be accessing the object later. Therefore, the JVM actually tracks multiple Eden spaces (where new objects are created and stored) across the various processors. The JVM will prioritize allocating the new object in an Eden space that is local to the requesting thread. The garbage collector also avoids significant transfer of objects between local memory and memory managed by another processor.

Using NUMA-aware settings requires that at least the young garbage collector uses the throughput, or parallel, garbage collector.

10. Enable detailed garbage collection output:

-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps


An important component of any performance tuning exercise is evaluating and monitoring the performance impacts of the various tuning changes. The garbage collector should be configured to provide information that is necessary to evaluate whether a tuning was beneficial.

These Java options will allow a better understanding of the frequency and duration of garbage collections, including both young garbage collections that run in the young generation space and full garbage collections.

Information provided will include why the garbage collection was triggered, whether a young or full collection ran, the memory utilization in the relevant portion of the heap before and after the collection, the total heap size (relevant if the application is dynamically resizing the heap), and the time spent on garbage collection. Young garbage collections will also include information about the size of objects that were promoted to the old generation heap.

Monitoring garbage collection performance is far more important than any negligible performance savings that could be attained by disabling the output.

11. Disable the use of biased locking:

-XX:-UseBiasedLocking

This option disables the use of biased locking; by default, biased locking is enabled in JDK 8. However, disabling this locking strategy often leads to performance improvements when thread pools are used.

Biased locking aims to reduce the time required for a thread to acquire a lock on an object by making the assumption that a thread that accesses an object once is likely to do so again. When the thread relinquishes its lock on an object, the object will stay assigned to that thread unless it is requested by another thread. If the base assumption holds—that objects tend to be accessed by one thread more than others—this locking strategy can improve performance by eliminating lock overhead.

However, on highly multithreaded systems utilizing thread pools, that assumption might not be appropriate. Many systems that use thread pools will have objects that are accessed by many threads over the course of their lifetimes. In these cases, biased locking actually introduces overhead because new threads must wait for the lock to be released before they can acquire the lock.

When code is written to acquire locks that are rarely needed, uncontended synchronization might be common. In such cases where threads rarely fight for objects, enabling biased locking can improve performance. In the Oracle MiniCluster workload, where multiple threads were vying for reservations on the same objects, disabling biased locking improved the performance of the system.

12. Enable point performance improvements:

-XX:+AggressiveOpts

This option enables a number of point performance compiler optimizations. These changes are minor alterations to the JVM and tend to be experimental improvements that are often incorporated into later releases of the JDK. The effects of this option should be measured any time the JDK is updated, because the improvements can vary from build to build and might not improve the performance of all applications.

13. Use the throughput garbage collector:

-XX:+UseParallelOldGC
-XX:ParallelGCThreads=<N>


These options enable the throughput garbage collector for both young and full garbage collections and specify the number of garbage collector threads. Using multiple threads for garbage collection should improve performance, and the throughput garbage collector is enabled by default in JDK 8.

Garbage collection is executed in dedicated threads that compete with the application threads for CPU resources. On systems with many virtual CPUs, the number of garbage collector threads might be too high by default. Reducing this value is especially common when multiple JVMs are running, such as when multiple Oracle WebLogic Server instances are running on the same server.

To determine an appropriate value for the number of threads, the boolean UseDynamicNumberOfGCThreads can be temporarily set to true to allow the thread count to be sized dynamically. After seeing where the thread count stabilizes, disable the dynamic sizing and set the number of threads manually to avoid the performance overhead of dynamic sizing.

The garbage collectors included with each JDK version might vary, and the choice of garbage collector should be evaluated as new JDK versions are released and garbage collection options are updated.

14. Increase code inlining:

-XX:InlineSmallCode=<N>
-XX:MaxInlineSize=<N>


These values should be adjusted only with careful consideration of the consequences. They did not have substantial performance impacts on the Oracle MiniCluster workload.

Inlining code removes the preparation and cleanup needed for method calls by injecting the code into the main body at compile time. Inlining small methods (that are not called very frequently) can improve application performance without leading to significant increases in program size. Inlining code also allows for the compiler to make additional optimizations over a larger main body of code.

Increasing these two settings allows more functions to be inlined.

15. Adjust compiler settings:

-XX:InitialCodeCacheSize=<N>
-XX:ReservedCodeCacheSize=<N>
-XX:CICompilerCount=<N>


These values should be adjusted only with careful consideration of the consequences. They did not have substantial performance impacts on the Oracle MiniCluster workload.

If the JVM warns, "CodeCache is full. Compiler has been disabled," the code cache settings might need to be increased. These default values are usually adequate, but adjustments might be needed if compiler failures are observed. Decreasing the code cache will result in lower memory usage by the JVM, but can risk compiler failures.

On some large systems, the default compiler count, which scales with the number of cores, might need to be adjusted downward.

Conclusion


While the "right" set of tunings will depend highly on the system, the application, environment requirements, and the time that can be devoted to tuning exercises, many of the tunings listed in this article will serve as reliable starting points for improving the performance of Oracle WebLogic Server on Oracle MiniCluster and on other SPARC systems. Careful consideration and testing of tuning options to tailor the system to the application will leave more resources available for other workloads, improving the already strong Java performance of SPARC systems.

About the Authors


Jesse Gordon is a principal software engineer and an experienced performance analyst. He has improved the performance of projects and products ranging from low-level protocols to complicated multitier and cloud-based middleware configurations.

Yasmin Hazrat is a software engineer in the Cloud Platforms, Applications, and Developers (CPAD) Group at Oracle, where he focuses on performance optimizations and scalability of Oracle Java Cloud technologies.

Masud Khandker has been a principal software engineer at Oracle since 2011. He is a member of a performance engineering team under the Systems division and is currently leading a team to optimize Oracle Java Cloud on Oracle's engineered systems.

Michelle Szucs is a product manager on the Oracle Optimized Solutions team in the Systems division of Oracle. She is the solution manager for Oracle Optimized Solution for Secure Oracle WebLogic Server.

Follow us:
Blog | Facebook | Twitter | YouTube