Is 11.2.0.2 ready for production? Judge yourself …
By Martin | April 18th, 2011 | Category: 11gR2, Bugs, Data Guard, MetaLink, Oracle Database | 5 commentsMurphy said: if problems arise, they all come at the same time. After several weeks of stability, this week brought hell of a lot of Oracle problems with 11.2.0.2 and RAC. To list the worst:
- Downtime because of ASM process limit reached. Normally, the formula to calculate depends on the number of concurrent datafile extensions. In our case 100 ASM processes were not enough. Increased to 200 and found Note 1287496.1 which describes the issue. Merge Patch for bug is available for 11.2.0.2 GI Bundle 1 but not yet for GI Bundle 2. Ups.
- CPU Starvation because of Adaptive Cursor Sharing: One server process was taking up 100% of cpu time for the last several hours. It turned out to be an OEM monitoring query having several thousands of child cursors.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8667 ora11 25 0 8423m 33m 28m R 99.7 0.1 358:27.78 ora_pz98_MDDB
I then checked different databases and found that almost all of our 11.2.0.2 databases have several hundreds or even thousands of child cursors.
SELECT sql_id, is_shareable, MIN(child_number), MAX(child_number), COUNT(*) FROM gv$sql GROUP BY sql_id, is_shareable HAVING COUNT(*) >100 ORDER BY 6 DESC SQL_ID IS_SHAREABLE MIN(CHILD_NUMBER) MAX(CHILD_NUMBER) COUNT(*) ------------- ------------ ----------------- ------------------ --------- c7kt3njhnmtkm Y 0 5097 1397 c7kt3njhnmtkm N 3 3544 1836 1vnhgmpc17vv0 Y 0 3022 2697 1vnhgmpc17vv0 N 6 2185 444 93qh89pxuxzuw Y 0 1949 1522 93qh89pxuxzuw N 2 1625 428 5fk0v8km2f811 Y 0 1281 1763 4f3ufvfcgfqsg Y 0 792 794 cjbwk0ww7j5rv Y 0 627 1251 dyqdzgxcv4d3t Y 0 626 1252 5fk0v8km2f811 N 3 543 260 f0jxh8d6b5af2 Y 0 494 564 f0jxh8d6b5af2 N 0 290 130 dbvkky621gqtr Y 0 266 267 32rqmpqpfv0gm Y 0 255 257 g9uwxs7pr8tjm Y 0 254 257 40k6jjt90n4fa Y 3 199 129
I suspect this to be Bug 10182051 Extended cursor sharing generates many shareable child cursors and there is a workaround:
alter system set “_optimizer_extended_cursor_sharing_rel”=none;
- Limited database availability because of failing queries on gv$ tables:
SQL> select count(*) from gv$session 2 ; select count(*) from gv$session * ERROR at line 1: ORA-12850: Could not allocate slaves on all specified instances: 2 needed, 1 allocated ORA-12801: error signaled in parallel query server P001, instance 3599
Currently no other known workaround than bouncing all the RAC instances.
- DataGuard ASYNC Redo Transport not reliable: We have a RAC primary / single instance physical standby setup and use async redo transport. During times of heavy ETL on the primary, the standby databases stops at recovery of one archivelog with “(in transit)”. Primary is showing this error in alert log:
ARC7: Creating remote archive destination LOG_ARCHIVE_DEST_2: 'HAMDB' (thread 2 SEQUENCE 4044) (MUCDB2) ARC7: Archive log rejected (thread 2 SEQUENCE 4044) at host 'HAMDB' FAL[server, ARC7]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance MUCDB2 - Archival Error. Archiver continuing.
On standby side, it says:
Media Recovery Waiting for thread 2 sequence 4044 (in transit)The standby database never recovers from this problem, except when standby database is bounced. The problem appears with and without broker configuration. Currently there is no known workaround.
Maybe some of this issues will be addressed in upcoming PSU April, which will be released this week.
Martin,
thanks for this very helpful sum up,
For the ASM processes, did you have this on multiple DB RAC or single DB RAC?
We have 4 DB on our cluster so I wonder if we expect yo hit the issue
thanks in advance
Coskan,
thank you for visiting….
Indeed, we have 3 RAC databases on this 2 node cluster. At this point, little is known about the problem. More often than not, we identify “suspicious” MOS Bugs ourselves, but it is only Oracle Support who has enough information to confirm or deny. So far, we have neither one.
Regards,
Martin
Hi Martin, Are you still have issue with “DataGuard ASYNC Redo Transport”? I’m having same problem, but our environment is not RAC. Just one primary and one physical (active) standby and we are in 11.2.0.3 on Solaris 11. Thanks.
Hello Christina,
it turned out that part of the problem was that there were millions of small audit trace files in the filesystem on primary and standby and this caused the filesystem to respond poorly.
After cleaning up the audit trace files regularly with a housekeeping job, the situation improved.
Thanks!