Is ready for production? Judge yourself …

Murphy said: if problems arise, they all come at the same time. After several weeks of stability, this week brought hell of a lot of Oracle problems with and RAC. To list the worst:

  • Downtime because of ASM process limit reached. Normally, the formula to calculate depends on the number of concurrent datafile extensions. In our case 100 ASM processes were not enough. Increased to 200 and found  Note 1287496.1 which describes the issue. Merge Patch for bug is available for GI Bundle 1 but not yet for GI Bundle 2. Ups.
  • CPU Starvation because of Adaptive Cursor Sharing: One server process was taking up 100% of cpu time for the last several hours. It turned out to be an OEM monitoring query having several thousands of child cursors.
    8667 	ora11 25 0 	8423m 33m 28m R 99.7 	0.1 358:27.78 ora_pz98_MDDB

    I then checked different databases and found that almost all of our databases have several hundreds or even thousands of child cursors.

    SELECT sql_id, is_shareable, MIN(child_number), MAX(child_number), COUNT(*)
    FROM gv$sql
    GROUP BY sql_id, is_shareable HAVING COUNT(*) >100 ORDER BY 6 DESC
    ------------- ------------ ----------------- ------------------ ---------
    c7kt3njhnmtkm Y            0                 5097               1397
    c7kt3njhnmtkm N            3                 3544               1836
    1vnhgmpc17vv0 Y            0                 3022               2697
    1vnhgmpc17vv0 N            6                 2185               444
    93qh89pxuxzuw Y            0                 1949               1522
    93qh89pxuxzuw N            2                 1625               428
    5fk0v8km2f811 Y            0                 1281               1763
    4f3ufvfcgfqsg Y            0                 792                794
    cjbwk0ww7j5rv Y            0                 627                1251
    dyqdzgxcv4d3t Y            0                 626                1252
    5fk0v8km2f811 N            3                 543                260
    f0jxh8d6b5af2 Y            0                 494                564
    f0jxh8d6b5af2 N            0                 290                130
    dbvkky621gqtr Y            0                 266                267
    32rqmpqpfv0gm Y            0                 255                257
    g9uwxs7pr8tjm Y            0                 254                257
    40k6jjt90n4fa Y            3                 199                129

    I suspect this to be Bug 10182051 Extended cursor sharing generates many shareable child cursors and there is a workaround:

    alter system set “_optimizer_extended_cursor_sharing_rel”=none;

  • Limited database availability because of failing queries on gv$ tables:
    SQL> select count(*) from gv$session
    2 ;
    select count(*) from  gv$session
    ERROR at line 1:
    ORA-12850: Could not allocate slaves on  all specified instances: 2 needed, 1
    ORA-12801: error signaled  in parallel query server P001, instance 3599

    Currently no other known workaround than bouncing all the RAC instances.

  • DataGuard ASYNC Redo Transport not reliable: We have a RAC primary / single instance physical standby setup and use async redo transport. During times of heavy ETL on the primary, the standby databases stops at recovery of one archivelog with “(in transit)”. Primary is showing this error in alert log:
    ARC7: Creating remote archive destination LOG_ARCHIVE_DEST_2: 'HAMDB' (thread 2 SEQUENCE 4044) (MUCDB2)
    ARC7: Archive log rejected (thread 2 SEQUENCE 4044) at host 'HAMDB'
    FAL[server, ARC7]: FAL archive failed, see trace file.
    ARCH: FAL archive failed. Archiver continuing
    ORACLE Instance MUCDB2 - Archival Error. Archiver continuing.

    On standby side, it says:
    Media Recovery Waiting for thread 2 sequence 4044 (in transit)

    The standby database never recovers from this problem, except when standby database is bounced. The problem appears with and without broker configuration. Currently there is no known workaround.

Maybe some of this issues will be addressed in upcoming PSU April, which will be released this week.

Leave a comment »

  1. Martin,

    thanks for this very helpful sum up,

    For the ASM processes, did you have this on multiple DB RAC or single DB RAC?

    We have 4 DB on our cluster so I wonder if we expect yo hit the issue

    thanks in advance

  2. Coskan,

    thank you for visiting….

    Indeed, we have 3 RAC databases on this 2 node cluster. At this point, little is known about the problem. More often than not, we identify “suspicious” MOS Bugs ourselves, but it is only Oracle Support who has enough information to confirm or deny. So far, we have neither one.


  3. Hi Martin, Are you still have issue with “DataGuard ASYNC Redo Transport”? I’m having same problem, but our environment is not RAC. Just one primary and one physical (active) standby and we are in on Solaris 11. Thanks.

  4. Hello Christina,

    it turned out that part of the problem was that there were millions of small audit trace files in the filesystem on primary and standby and this caused the filesystem to respond poorly.

    After cleaning up the audit trace files regularly with a housekeeping job, the situation improved.

  5. Thanks!

Leave Comment