Is 11.2.0.2 ready for production? Judge yourself …

By Martin | April 18th, 2011 | Category: 11gR2, Bugs, Data Guard, MetaLink, Oracle Database | 5 comments

Murphy said: if problems arise, they all come at the same time. After several weeks of stability, this week brought hell of a lot of Oracle problems with 11.2.0.2 and RAC. To list the worst:

Downtime because of ASM process limit reached. Normally, the formula to calculate depends on the number of concurrent datafile extensions. In our case 100 ASM processes were not enough. Increased to 200 and found Note 1287496.1 which describes the issue. Merge Patch for bug is available for 11.2.0.2 GI Bundle 1 but not yet for GI Bundle 2. Ups.

CPU Starvation because of Adaptive Cursor Sharing: One server process was taking up 100% of cpu time for the last several hours. It turned out to be an OEM monitoring query having several thousands of child cursors.

PID 	USER 	PR NI VIRT 	RES SHR S %CPU %MEM TIME+ COMMAND
8667 	ora11 25 0 	8423m 33m 28m R 99.7 	0.1 358:27.78 ora_pz98_MDDB

I then checked different databases and found that almost all of our 11.2.0.2 databases have several hundreds or even thousands of child cursors.

SELECT sql_id, is_shareable, MIN(child_number), MAX(child_number), COUNT(*)
FROM gv$sql
GROUP BY sql_id, is_shareable HAVING COUNT(*) &gt;100 ORDER BY 6 DESC
 
SQL_ID        IS_SHAREABLE MIN(CHILD_NUMBER) MAX(CHILD_NUMBER)  COUNT(*)
------------- ------------ ----------------- ------------------ ---------
c7kt3njhnmtkm Y            0                 5097               1397
c7kt3njhnmtkm N            3                 3544               1836
1vnhgmpc17vv0 Y            0                 3022               2697
1vnhgmpc17vv0 N            6                 2185               444
93qh89pxuxzuw Y            0                 1949               1522
93qh89pxuxzuw N            2                 1625               428
5fk0v8km2f811 Y            0                 1281               1763
4f3ufvfcgfqsg Y            0                 792                794
cjbwk0ww7j5rv Y            0                 627                1251
dyqdzgxcv4d3t Y            0                 626                1252
5fk0v8km2f811 N            3                 543                260
f0jxh8d6b5af2 Y            0                 494                564
f0jxh8d6b5af2 N            0                 290                130
dbvkky621gqtr Y            0                 266                267
32rqmpqpfv0gm Y            0                 255                257
g9uwxs7pr8tjm Y            0                 254                257
40k6jjt90n4fa Y            3                 199                129

I suspect this to be Bug 10182051 Extended cursor sharing generates many shareable child cursors and there is a workaround:

alter system set “_optimizer_extended_cursor_sharing_rel”=none;

Limited database availability because of failing queries on gv$ tables:

SQL> select count(*) from gv$session
2 ;
select count(*) from  gv$session
*
ERROR at line 1:
ORA-12850: Could not allocate slaves on  all specified instances: 2 needed, 1
allocated
ORA-12801: error signaled  in parallel query server P001, instance 3599

Currently no other known workaround than bouncing all the RAC instances.

DataGuard ASYNC Redo Transport not reliable: We have a RAC primary / single instance physical standby setup and use async redo transport. During times of heavy ETL on the primary, the standby databases stops at recovery of one archivelog with “(in transit)”. Primary is showing this error in alert log:

ARC7: Creating remote archive destination LOG_ARCHIVE_DEST_2: 'HAMDB' (thread 2 SEQUENCE 4044) (MUCDB2)
ARC7: Archive log rejected (thread 2 SEQUENCE 4044) at host 'HAMDB'
FAL[server, ARC7]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance MUCDB2 - Archival Error. Archiver continuing.

On standby side, it says:
Media Recovery Waiting for thread 2 sequence 4044 (in transit)

The standby database never recovers from this problem, except when standby database is bounced. The problem appears with and without broker configuration. Currently there is no known workaround.

Maybe some of this issues will be addressed in upcoming PSU April, which will be released this week.

5 comments
Leave a comment »

coskan gundogar April 18th, 2011 11:49 :
Martin,

thanks for this very helpful sum up,

For the ASM processes, did you have this on multiple DB RAC or single DB RAC?

We have 4 DB on our cluster so I wonder if we expect yo hit the issue

thanks in advance
admin April 18th, 2011 14:30 :
Coskan,

thank you for visiting….

Indeed, we have 3 RAC databases on this 2 node cluster. At this point, little is known about the problem. More often than not, we identify “suspicious” MOS Bugs ourselves, but it is only Oracle Support who has enough information to confirm or deny. So far, we have neither one.

Regards,
Martin
Christina March 20th, 2013 21:19 :
Hi Martin, Are you still have issue with “DataGuard ASYNC Redo Transport”? I’m having same problem, but our environment is not RAC. Just one primary and one physical (active) standby and we are in 11.2.0.3 on Solaris 11. Thanks.
admin March 20th, 2013 22:52 :
Hello Christina,

it turned out that part of the problem was that there were millions of small audit trace files in the filesystem on primary and standby and this caused the filesystem to respond poorly.

After cleaning up the audit trace files regularly with a housekeeping job, the situation improved.
Christina March 21st, 2013 18:30 :
Thanks!