Mutex: What do we know ?
Source: Mutex: What do we know ?
Mutex: What do we know ?
Source: Mutex: What do we know ?
2016-11-29 EDIT: Oracle has announced that in 12.2 the features that drove this overuse of the result cache (dynamic stats triggered by SPDs) will default to off, and provided patches to apply to 1…
We have come across an issue with Oracle RDBMS 220.127.116.11 Standard Edition (SE) that impacts the performance all requests to external network resource from the database, e.g. calls to webservices. The issue appears to be due to a dependency in the Oracle code on the Result Cache (a feature that is unavailable in SE).
There is a simple test case that highlights the problem i.e.
set timing on declare http_request utl_http.req; http_response utl_http.resp; text varchar2(32767); i number; begin i := 0; while i < 1000 loop http_request := utl_http.begin_request('http://localhost:8087/index.html'); http_response := utl_http.get_response(http_request); utl_http.read_text(http_response, text, 32766); utl_http.end_response(http_response); i := i + 1; end loop; end; /
You will need to use a URL that is accessible from your database, and create an appropriate ACL to give you access to it.
If you run that code in a 18.104.22.168 SE database you will see an elapsed time of several minutes e.g.
SQL> @test-case PL/SQL procedure successfully completed. Elapsed: 00:03:26.81
On further investigation it should be clear that it is this internal Oracle routine that’s at fault
SQL HASH_VALUE EXECUTIONS Seconds Sec/Exec -------------------------------------------------------------------------------- ---------- ---------- ------------ --------------- begin :1 := dbms_network_acl_admin.get_host_aclids(:2,:3); end; 2185038870 2000 201.29 0.10064
The same test case executed in either 22.214.171.124 Enterprise Edition or 126.96.36.199 SE2 will run much faster e.g.
SQL> @test-case PL/SQL procedure successfully completed. Elapsed: 00:00:03.23
The CPU time spent on the internal Oracle routine will also be much lower i.e.
SQL HASH_VALUE EXECUTIONS Seconds Sec/Exec -------------------------------------------------------------------------------- ---------- ---------- ------------ --------------- begin :1 := dbms_network_acl_admin.get_host_aclids(:2,:3); end; 2185038870 2000 0.71 0.00036
We believe this is because the code behind dbms_network_acl_admin.get_host_aclids in 188.8.131.52 SE is dependent on the Result Cache.
The Result Cache is a feature Oracle have disabled in SE and hence the performance of the internal Oracle routine is impacted.
An SR has been logged with Oracle who have agreed it’s due to the Result Cache but have yet to offer a solution.
Providing the Oracle optimizer with accurate statistics on the Apex schema tables can be challenging.
Many of tables tracking Apex session state information see a huge turnover in data such that statistics gathered overnight can quickly become stale resulting in poor choice of execution plan by the optimizer and seriously degraded performance.
There are several ways to tackle the issue e.g. locking the stats on some of the tables to values that are known to perform reliably.
However I believe it’s better to give the optimizer the most accurate information you can, so it can make more informed decisions based on the actual data in the tables.
The problem is that even if you refresh the optimizer statistics often, you need to ensure that an execution plans that are stored in the shared_pool are also cleared or you risk still using a plan that does not reflect those new statistics.
Fortunately you can achieve all this with one relatively straight forward job i.e.
conn / as sysdba variable v_jobno number; begin dbms_job.submit( :v_jobno, 'dbms_stats.gather_schema_stats(ownname=>''APEX_040200'',options=>''GATHER STALE'',cascade=>true,no_invalidate=>false);', sysdate+(5/(24*60)), 'sysdate+(5/(24*60))', TRUE); commit; end; /
This will create a job that runs every five minutes and refresh statistics on any Apex tables that are marked as stale, but also thanks to the no_invalidate option, it will clear any plans from the shared_pool that would be affected by the new statistics gathered.
You should make sure the schema reflects your Apex version, but with this job in place you should have no worries about stale statistics on the Apex session state tables impacting your application performance.
Adding a job to take hourly statspack snapshots has always been simple e.g.
Purging snapshots automatically used to be a little more difficult, fortunately with more recent versions of statspack you can now do e.g.
variable v_jobno number; begin dbms_job.submit(:v_jobno, 'statspack.snap(i_snap_level=>7);', trunc(sysdate+1/24,'HH'), 'trunc(sysdate+1/24,''HH'')', TRUE); commit; end; /
This creates a dbms_job that runs every day at midnight purging any snapshots older than 14 days.
variable v_jobno number; begin dbms_job.submit(:v_jobno, 'statspack.purge(i_num_days=>;14,i_extended_purge=>;TRUE);', trunc(sysdate+1), 'trunc(sysdate+1)', TRUE); commit; end; /
The Oracle 11gAS Reports Server is a fragile thing, but recently one particular error was cropping up regularly with (usually) large reports.
The error message reported was REP-50125 with the usual meaningless gibberish i.e.
[2013-12-11T02:51:33.442+00:00] [reports] [INCIDENT_ERROR] [REP-50125] [oracle.reports.server] [tid: 16] REP-50125 : org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 208
Which has lots of unhelpful hits on the Oracle Support Site – suggesting it’s just a common generic error message.
As it was large reports that were mostly the problem a resource issue was the likely culprit, hence the obvious first guess is to increase the maximum Java heap size by setting the JVM Options for the reports server to e.g.
This had no effect unfortunately so we turned to the maximum Java stack size – which on 64bit Java 6 defaults to 1024K.
Increasing this by setting JVM Options to e.g.
did the trick – the reports now completed successfully.
We’re using Oracle Weblogic 10.3.5.0 and Oracle Forms & Reports 11g R2 184.108.40.206 and have noticed that the cpu usage of the opmn process increases the more concurrent Oracle Forms users you have, to the point where it’s constantly hogging a single cpu with more than a few hundred users connected.
Running a truss on the opmn process indicates that it’s interrogating the /proc filesystem for every Forms Runtime process e.g.
4934/16: open("/proc/23356/xmap", O_RDONLY) = 14 4934/16: fcntl(14, F_SETFD, 0x00000001) = 0 4934/16: fstat(14, 0xFFFFFFFF77BFAE58) = 0 4934/16: pread(14, "\0\0\001\0\0\0\0\0\0\0\0".., 28944, 0) = 28944 4934/16: pread(14, "\0\0\001\0\0\0\0\0\0\0\0".., 43416, 0) = 43416 4934/16: pread(14, "\0\0\001\0\0\0\0\0\0\0\0".., 57888, 0) = 57888 4934/16: pread(14, "\0\0\001\0\0\0\0\0\0\0\0".., 72360, 0) = 72360 4934/16: pread(14, "\0\0\001\0\0\0\0\0\0\0\0".., 86832, 0) = 85968 4934/16: close(14) = 0 4934/16: open("/proc/23356/psinfo", O_RDONLY) = 14 4934/16: fcntl(14, F_SETFD, 0x00000001) = 0 4934/16: read(14, "02\0\0\0\0\0\001\0\0 [ <".., O_RDONLY) = 14 4934/16: fcntl(14, F_SETFD, 0x00000001) = 0 4934/16: read(14, "\0\0\0\0\0\0\001\0\0\0\0".., 504) = 504 4934/16: close(14) = 0 4934/16: open("/proc/23356/status", O_RDONLY) = 14 4934/16: fcntl(14, F_SETFD, 0x00000001) = 0 4934/16: read(14, "\t\0 @ 0\0\0\001\0\0 [ <".., 1776) = 1776 4934/16: close(14) = 0
Similarly running an
includes every single Forms Runtime process in its status list, and can be very slow to return.
That doesn’t sound like it’s going to be particularly scalable and was certainly causing us problems – but there’s very little in the way of documentation about it.
What we found was that removing the section
<ias-component id="forms" type="FormsComponent"> <process-type id="FormsRuntime" status="phantom"> </process-type> </ias-component>
from the opmn.xml file in
and then forcing a reload with
stopped the opmn process from monitoring the /proc fileystem for Forms Runtime processes, and the status command to just list the other components.
The CPU usage of opmn the process has now dropped dramatically, and the 11gAS web console is much more responsive than it was.
Not sure what other impact this change might have, although at this stage everything seems to be working as normal, so if you decide to use the change yourself make sure you test it out thoroughly first.
Suspect this is not exclusive to T4, servers with lots of RAM may be susceptible.
Customer moved to a new T4 based server and started having major problems once more than a few dozen users were connected to the system.
Database was largely ok, some excessive redo which we addressed, but after that statspack reports showing top wait event was 95%+ CPU Time (normally a good sign).
Unfortunately performance of the whole server was sluggish and the application bordering on the unusable, and whilst the server had 96GB RAM and 96 cores it was struggling badly when still 60% idle.
Only clue was that the CPU usage reported by sar and mpstat was SYS rather than user – so something was going wrong at the OS level.
mpstat suggested that something was spinning waiting on a mutex (high counts on smtx were correlated with sys cpu usage).
Using DTrace to understand mpstat output [ID 1278725.1]
lockstat indicated that the caller functions where it was spending most time were page_trylock_contig_pages & page_trylock.
A Primer On Lockstat [ID 1005868.1]
That led us to
E-PERF: Poor Performance on Solaris Server Possibly Associated with PSRUNRMT [ID 663487.1]
it appears that page_coalescing, caused by the Sol10 LPOOB (Large Page Out Of Box) feature, may be contributing to the issue
So let’s get that turned off for a start – it’s a kernel setting so needs to go in /etc/system but you can also change on the fly with
echo 'pg_contig_disable/W 1' | mdb -kw
Things improved for half an hour or so and then started to deteriorate in the same way again, but this time lockstat was telling us the main culprit was page_freelist_lock.
Which took us to
page_freelist_coalesce() could cause system performance issues [ID 1486130.1]
which suggested that we should also disable
echo 'mpss_coalesce_disable/W 1' | mdb -kw
with those two changes made on the fly (and recorded in /etc/system for the next reboot)
* Oracle memory issues set mpss_coalesce_disable=1 set pg_contig_disable=1
the run queue has dropped from over 30 to around 3, server is 95% idle, sys cpu usage has all but disappeared and performance is now as good with 400 users as it was with 40.
In my experience the early Ultra SPARC T Series chips are not great when it comes to supporting Apex applications, the single thread performance is poor and reflects in significantly increased average page times when compared to more conventional CPUs.
However recently encountered and unusual problem after a site upgraded a database running on a T5240 from Oracle RDBMS v10.2.0.5 to v220.127.116.11.
Average page times had gone from 1s to around 4s since the upgrade, and users were complaining loudly about performance.
Statspack was telling us that there was an usually high number of cursor: pin S wait on X waits – but why ?
Top 5 Timed Events Avg %Total wait Call Event Waits Time (s) (ms) Time --------------------------- ------------ ----------- ------ ------ CPU time 24,208 85.6 cursor: pin S wait on X 24,177 1,361 56 4.8 db file sequential read 659,410 943 1 3.3 External Procedure call 206 861 4177 3.0 log file parallel write 93,792 251 3 .9
Well apparently its all down to mutexes – and once again it was Tanel Poder that set me on the right path
The above link describes a problem he had encountered at 10g – the bit that set me thinking was
When a latch getter can’t get the latch after spinning, it will go to sleep to release the CPU. Even if there are many latch getters in the CPU runqueue before the latch holder, they all spin quickly and end up sleeping again. But when a mutex getter doesn’t get the mutex after spinning, it will not go to sleep!!! It will yield() the CPU instead, which means that it will go to the end of runqueue and try to get back onto CPU as soon as possible. So, mutex getters in 10.2 are much less graceful, they can burn a lot of CPU when the mutex they want is held by someone else for long time.
On a T series there’s nearly always a CPU thread available (particularly if like this customer you don’t have that many users accessing the system) so even if a mutex getter yields then it will (nearly always) get a CPU thread back straightaway – so will pretty much be constantly on CPU trying to get hold of the mutex (which explained the high cpu usage and large number of waits – and presumably the performance problems experienced by the customer).
Tanel suggested this was fixed at 11g (mutex getters should sleep instead of just yielding), but it turns out that’s not quite the whole story – the default behaviour is still to always yield, but there are some hidden parameters that provide some control
Bug 10411618 – Enhancement to add different “Mutex” wait schemes
In order to encourage the mutex getters to sleep a bit instead of just yielding all the time I got the customer to add the following parameters
alter system set "_mutex_wait_scheme" = 1 scope = spfile; alter system set "_mutex_wait_time" = 8 scope = spfile;
which means they should yield the first time but will sleep for 8ms after each subsequent attempt to acquire the mutex before trying again.
The customer confirmed that since applying the parameters performance has improved to around the same level as prior to the upgrade to 11g.
Since you may not always have access to the Apex administration pages to monitor the activity and performance of your Apex application, here are some scripts that you can use to extract the headline figures directly from the log tables.
apex-by-day.sql – the Page Views by Day report
SQL> @apex-by-day Date Avg Time Max Time Page Views Users ----------- -------- -------- ---------- ----- 04-APR-2011 0.2315 30.2 21582 111 03-APR-2011 0.3673 11.4 9080 11 02-APR-2011 0.4219 6.4 549 2 01-APR-2011 0.8711 4.1 45 2
apex-by-hour.sql – the Hourly Usage report
SQL> @apex-by-hour Enter value for which_date: sysdate Hour Avg Time Max Time Page Views Users -------------------- -------- -------- ---------- ----- 04-APR-2011 14:00:00 0.1983 30.2 3074 46 04-APR-2011 13:00:00 0.3098 30.2 2237 48 04-APR-2011 12:00:00 0.2637 30.2 3580 60 04-APR-2011 11:00:00 0.2398 30.2 5611 71 04-APR-2011 10:00:00 0.2148 30.1 2942 54 04-APR-2011 09:00:00 0.1968 30.1 2683 62
apex-by-weight.sql – the Weighted Page Performance report
These are the first reports that I check when looking at the current performance levels of an Apex application.