Data Guard

Data Guard 11.2.0.2 – update

In March, Oracle MAA has published a white paper for implementing Data Guard on Exadata platforms. (http://www.oracle.com/technetwork/database/features/availability/maa-wp-dr-dbm-130065.pdf)

Although the paper is focused on Exadata, some information can be applied to non-Exadata systems as well:

  • Corruption Protection:
    • Primary:
      • DB_BLOCK_CHECKSUM=FULL,
      • DB_BLOCK_CHECKING=FULL,
      • DB_LOST_WRITE_PROTECT=TYPICAL
    • (Physical) Standby:
      • DB_BLOCK_CHECKSUM=FULL,
      • DB_BLOCK_CHECKING=OFF,
      • DB_LOST_WRITE_PROTECT=TYPICAL
  • Network Tuning:
    TCP Send/Receive Buffers: 3 x Bandwith Delay Product or 10 MB, whichever is greater
  • Redo Transport Modes:
    • SYNC is recommended if round-trip-time is less than 5 ms.  Impact of SYNC Mode on primary performance has been improved because local online redo log write and redo shipping is not done sequentially anymore but in parallel.
    • ASYNC: Transport Lag is reduced because Redo is not read from online redo log from disk but from log buffer if possible. init.ora log_buffer might need to be increased for this to yield the maximum benefit.
  • LOGGING/NOLOGGING: Normally, in Data Guard you set database-wide “ALTER DATABASE FORCE LOGGING”. In Data Warehouse environments, it might be advisable to set it to NOLOGGING and decide on a tablespace-basis whether the tablespace should allow or disallow NOLOGGING operations. E.g. a tablespace containing only transient, recreatable  or non-critical data might benefit from being set to NOLOGGING for certain bulk operations. (CTAS, Direct Path Inserts, index rebuilds, etc.)
  • STANDBY-FIRST Patches: Some patches (PSU, CPU, PSE) will be flagged in the patch README with “Standby-First”. This means that they can be applied on a physical standby before being applied on the primary. For regular patches, this was not supported.


Is 11.2.0.2 ready for production? Judge yourself …

Murphy said: if problems arise, they all come at the same time. After several weeks of stability, this week brought hell of a lot of Oracle problems with 11.2.0.2 and RAC. To list the worst:

  • Downtime because of ASM process limit reached. Normally, the formula to calculate depends on the number of concurrent datafile extensions. In our case 100 ASM processes were not enough. Increased to 200 and found  Note 1287496.1 which describes the issue. Merge Patch for bug is available for 11.2.0.2 GI Bundle 1 but not yet for GI Bundle 2. Ups.
  • CPU Starvation because of Adaptive Cursor Sharing: One server process was taking up 100% of cpu time for the last several hours. It turned out to be an OEM monitoring query having several thousands of child cursors.
    PID 	USER 	PR NI VIRT 	RES SHR S %CPU %MEM TIME+ COMMAND
    8667 	ora11 25 0 	8423m 33m 28m R 99.7 	0.1 358:27.78 ora_pz98_MDDB

    I then checked different databases and found that almost all of our 11.2.0.2 databases have several hundreds or even thousands of child cursors.

    SELECT sql_id, is_shareable, MIN(child_number), MAX(child_number), COUNT(*)
    FROM gv$sql
    GROUP BY sql_id, is_shareable HAVING COUNT(*) >100 ORDER BY 6 DESC
     
    SQL_ID        IS_SHAREABLE MIN(CHILD_NUMBER) MAX(CHILD_NUMBER)  COUNT(*)
    ------------- ------------ ----------------- ------------------ ---------
    c7kt3njhnmtkm Y            0                 5097               1397
    c7kt3njhnmtkm N            3                 3544               1836
    1vnhgmpc17vv0 Y            0                 3022               2697
    1vnhgmpc17vv0 N            6                 2185               444
    93qh89pxuxzuw Y            0                 1949               1522
    93qh89pxuxzuw N            2                 1625               428
    5fk0v8km2f811 Y            0                 1281               1763
    4f3ufvfcgfqsg Y            0                 792                794
    cjbwk0ww7j5rv Y            0                 627                1251
    dyqdzgxcv4d3t Y            0                 626                1252
    5fk0v8km2f811 N            3                 543                260
    f0jxh8d6b5af2 Y            0                 494                564
    f0jxh8d6b5af2 N            0                 290                130
    dbvkky621gqtr Y            0                 266                267
    32rqmpqpfv0gm Y            0                 255                257
    g9uwxs7pr8tjm Y            0                 254                257
    40k6jjt90n4fa Y            3                 199                129

    I suspect this to be Bug 10182051 Extended cursor sharing generates many shareable child cursors and there is a workaround:

    alter system set “_optimizer_extended_cursor_sharing_rel”=none;

  • Limited database availability because of failing queries on gv$ tables:
    SQL> select count(*) from gv$session
    2 ;
    select count(*) from  gv$session
    *
    ERROR at line 1:
    ORA-12850: Could not allocate slaves on  all specified instances: 2 needed, 1
    allocated
    ORA-12801: error signaled  in parallel query server P001, instance 3599

    Currently no other known workaround than bouncing all the RAC instances.

  • DataGuard ASYNC Redo Transport not reliable: We have a RAC primary / single instance physical standby setup and use async redo transport. During times of heavy ETL on the primary, the standby databases stops at recovery of one archivelog with “(in transit)”. Primary is showing this error in alert log:
    ARC7: Creating remote archive destination LOG_ARCHIVE_DEST_2: 'HAMDB' (thread 2 SEQUENCE 4044) (MUCDB2)
    ARC7: Archive log rejected (thread 2 SEQUENCE 4044) at host 'HAMDB'
    FAL[server, ARC7]: FAL archive failed, see trace file.
    ARCH: FAL archive failed. Archiver continuing
    ORACLE Instance MUCDB2 - Archival Error. Archiver continuing.

    On standby side, it says:
    Media Recovery Waiting for thread 2 sequence 4044 (in transit)

    The standby database never recovers from this problem, except when standby database is bounced. The problem appears with and without broker configuration. Currently there is no known workaround.

Maybe some of this issues will be addressed in upcoming PSU April, which will be released this week.



DOAG Presentation on Oracle Data Guard 11g (R1/R2) What´s new? (german)

I have uploaded the material of the german DOAG presentation on Oracle Data Guard 11gr2 Whats New to the “presentation” section. It also contains 6 recorded demos using Enterprise Manager Grid Control.



DataGuard Broker Setup fails because of Cisco Firewall ASA Feature “fixup/inspect”

I ran into this problem a couple of years ago but I still think that it is worth blogging about. Back then, I tried to setup a data guard broker enviroment but when trying to enable the configuration on the dgmgrl command line, the dataguard configuration files could not be synced to the standby host. After a couple of days of troubleshooting, it turned out that the Cisco ASA Firewall hat the SQL*Net Inspection Feature enabled. But because the data guard broker data packets did not exactly conform to SQL*Net, the Firewall corrupted the packet and it was not able to create the data guard configuration. After disabling the SQL*Net inspection, it worked without any further problems.



ORA-00322, ORA-00312 at DataGuard Standby

From time to time I have encountered the following errors on the physical dataguard standby database while in recovery mode:

Errors in file /oracle/STBDB1/oratrace/bdump/stbdb1_mrp0_26719.trc:
ORA-00322: log 5 of thread 1 is not current copy
ORA-00312: online log 5 thread 1: '/oracle/STBDB1/origlogB/standby_g5_m1.log'
Sun Oct 16 13:04:08 2008
Errors in file /oracle/STBDB1/oratrace/bdump/stbdb1_mrp0_26719.trc:
ORA-00322: log 5 of thread 1 is not current copy
ORA-00312: online log 5 thread 1: '/oracle/STBDB1/origlogB/standby_g5_m1.log'

The problem corrects itself after a couple of minutes, so there is no real problem, but if you have alert log monitoring active, this will trigger a call for investigation. Oracle is tracking this error with Bug 5238386 – ORA-322 possible reading standby redo log header. There is a one-off patch available for 10.2.0.3 and the patch is included in patchset 10.2.0.4.

There is a workaround to clear the referenced redo log but I don´t see any point in doing that, because the error can occurr again as long as the patch is not installed.