Your experience with RAC Dynamic Remastering (DRM) in 10gR2?

One of my customers is having severe RAC performance issues, which appeared a dozen times so far. Each time, the performance impact lasted around 10 minutes and caused basically a hang of the application. ASH investigation revealed that the time frame of performance issues exactly matches a DRM operation of the biggest segment of the database. During the problematic time period, there are 20-50 instead of 5-10 active sessions and they are mostly waiting for gc related events: “gc buffer busy”,”gc cr block busy”, “gc cr block 2-way”, “gc current block 2-way”, “gc current request”, “gc current grant busy”, etc.

In addition, there is one single session which has wait event “kjbdrmcvtq lmon drm quiesce: ping completion” (on instance 1) and 1-3 sessions with wait event “gc remaster“. (on instance 2) The cur_obj# of the session waiting on “gc remaster” is pointing to the segment being remastered.

Does anybody have any experience with DRM problems with 10.2.0.4 on Linux Itanium?

I know that it is possible to deactive DRM, but usually it should be beneficial to have it enabled. I could not find any reports of performance impact during DRM operation on metalink. Support is involved but clueless so far.

Regards,
Martin

http://forums.oracle.com/forums/message.jspa?messageID=3447436#3447436

RAC Deadlock Detection not working on 10.2.0.4 / Linux Itanium

We recently experienced a big issue, when the production database was hung. It turned out that the database has deadlocked, but the GES did not detect the deadlock situation, so all the sessions were waiting on “enq: TX row lock contention”.

We could provide a reproducible testcase and it turned out to be bug 7014855. The bug is platform specific to Linux Itanium port and a patch is available.

Oracle Enterprise Manager – Grid Control 10gR5 (10.2.0.5)

A while ago, I attended a one-day seminar of DOAG, the german oracle user group, which covered Oracle Enterprise Manager – Grid Control 10gR5.

There were presentations about:

  • Bare-Metal Provisioning: Provisioning of Linux on top of new hardware
  • Provisioning of RAC Nodes (Clusterware, ASM, RAC)
  • Provisioning of RAC Databases
  • Command Line Interface EMCLI
  • Managing Oracle VM with Oracle Enterprise Manager
  • OEM Grid Control 10gR5 – New Features

The presentation of the New Features is available (although in german) here. The other presentations are available for DOAG members at the DOAG site. Moreover, one of the presenters will install a website for OEM topics on insight-oem.com.

Monitoring Index Usage not reliable in 10.2.0.4

I recently had to analyze a system in respect of index utilization. The application has dozens of indexes and we have to find out, which ones are really needed and which ones are candidates for removal.

There is a nice feature, called Index Monitoring which switches a flag if the index is used.

1
ALTER INDEX schema.index MONITORING USAGE;

Internally, a recursive SQL Statement is executed for every parsing of a SQL:

1
UPDATE object_usage SET flags=flags+1 WHERE obj#=:1 AND bitand(flags, 1)=0;

To check whether the index was used, you can query the view v$object_usage. Unfortunately, the view definition contains a where clause to list only indexes of the current schema.

1
   WHERE    io.owner# = userenv('SCHEMAID')

Therefore, you can create a new view or just omit the limitation from the where clause:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 CREATE OR REPLACE VIEW V$ALL_OBJECT_USAGE 
         (OWNER,
         INDEX_NAME, 
         TABLE_NAME, 
         MONITORING, 
         USED, 
         START_MONITORING, 
         END_MONITORING) 
         AS 
           SELECT u.name, io.name, t.name, 
           decode(bitand(i.flags, 65536), 0, 'NO', 'YES'), 
           decode(bitand(ou.flags, 1), 0, 'NO', 'YES'), 
           ou.start_monitoring, 
           ou.end_monitoring 
         FROM sys.user$ u, sys.obj$ io, sys.obj$ t, sys.ind$ i, sys.object_usage ou 
         WHERE i.obj# = ou.obj# 
         AND io.obj# = ou.obj# 
         AND t.obj# = i.bo#
         AND u.user# = io.owner#

However, what is less known is that the DBMS_STATS can in some cases manipulate these flags. The issue is tracked with “Bug 6798910 – DBMS_STATS or EXPLAIN PLAN trigger index usage monitoring”. If you are using 10.2.0.4 and you are gathering stats manually with “CASCADE=>TRUE”, then the monitoring flag is set to “YES”. Obviously, in this case you can not identify candidate indexes for removal.

Diagnosing “cursor: pin S wait on X”

I came across a metalink note (Note 786507.1) which describes a very short way to identify the blocker of a session waiting on mutex “cursor: pin S wait on X”. Previously, I thought it could only be analyzed with system state dumps, but the note describes that parameter P2RAW of wait event “cursor: pin S wait on X” contains the blocking session id and the refcount. For 32bit environments the field contains 4 bytes (2 bytes for session id, 2 bytes for refcount). On 64 bit environments, the field contains 8 bytes (4 vs. 4).

select p2raw from v$session where event = ‘cursor: pin S wait on X’

The first 2 or 4 bytes from p2raw have to be converted to dec and point to the blocking Session ID. (SID)

Oracle RAC 10.2.0.4 bug when adding/dropping redo log groups

On the oracle-l list, there was a post about a nasty bug in Oracle RAC 10.2.0.4 which can cause LGWR errors, instance failures and redo log corruptions.

The bug occurs if you add/remove redo log groups on a cluster database, which has some of it´s instances shut down.

More information about this Bug, which is tracked with Bug ID 6786022 can be found on MetaLink. There is a patch for 10.2.0.4 and Linux x86-64 available.

ASSM Problem with too low PCTFREE

During the Hotsos Symposium 2009 Training Day, Jonathan Lewis presented a problem which appears even on current 10g/11g databases. What is especially interesting is how this issue can be diagnosed. I reproduced the problem in 11.1.0.7 and will provide the steps you can use to verify. The problem can be demonstrated best when comparing response time of an update statement for 8k blocksize and 16k blocksize.

  • 8k Blocksize
  • SQL> DROP TABLE t1_8k purge;
     
    SQL> CREATE TABLESPACE DEMO8K datafile SIZE 128M blocksize 8192;
     
    TABLESPACE created.
     
    Elapsed: 00:00:07.35
     
    SQL> CREATE TABLE mdecker.t1_8k
    (n1 NUMBER,
     n2 NUMBER)
     TABLESPACE DEMO8K;
     
    TABLE created.
     
    Elapsed: 00:00:00.02
     
    SQL> INSERT INTO t1_8k
    SELECT TRUNC(dbms_random.VALUE(10000000,100000000)) n1,
               TO_NUMBER(NULL) AS n2
        FROM dual
    CONNECT BY LEVEL <= 500000
    /
     
     
    500000 ROWS created.
     
    Elapsed: 00:00:06.89
     
    SQL> BEGIN dbms_stats.gather_table_stats(
            ownname => 'MDECKER',
            tabname => 'T1_8K',
            estimate_percent => 100);
    END;
    /
     
    PL/SQL PROCEDURE successfully completed.
     
    Elapsed: 00:00:02.13
     
    SQL> SELECT num_rows,blocks FROM dba_tables WHERE table_name = 'T1_8K';
     
      NUM_ROWS     BLOCKS
    ---------- ----------
        500000        874
     
    Elapsed: 00:00:00.15
    SQL> UPDATE t1_8k SET n2 = n1;
     
    500000 ROWS updated.
     
    Elapsed: 00:01:09.04
    SQL> COMMIT;

  • 16k Blocksize
  • SQL> CREATE TABLESPACE DEMO16K datafile SIZE 128M blocksize 16384;
     
    TABLESPACE created.
     
    Elapsed: 00:00:14.75
     
    SQL> CREATE TABLE mdecker.t1_16k
    (n1 NUMBER,
     n2 NUMBER)
     TABLESPACE DEMO16K;
     
    TABLE created.
     
    Elapsed: 00:00:00.03
     
    SQL> INSERT INTO t1_16k
    SELECT TRUNC(dbms_random.VALUE(10000000,100000000)) n1,
               TO_NUMBER(NULL) AS n2
        FROM dual
    CONNECT BY LEVEL <= 500000
    /  
     
     
    500000 ROWS created.
     
    Elapsed: 00:00:05.51
    SQL> BEGIN dbms_stats.gather_table_stats(
            ownname => 'MDECKER',
            tabname => 'T1_16K',
            estimate_percent => 100);
    END;
    /  
     
    PL/SQL PROCEDURE successfully completed.
     
    Elapsed: 00:00:01.78
    SQL> SELECT num_rows,blocks FROM dba_tables WHERE table_name = 'T1_16K';
     
      NUM_ROWS     BLOCKS
    ---------- ----------
        500000        436
     
    Elapsed: 00:00:00.07
     
    SQL>  UPDATE t1_16k SET n2 = n1;
     
    500000 ROWS updated.
     
    Elapsed: 00:20:20.89

    As you can see, the update statement for the 8k blocksize table took around 69 seconds whereas the same update for the table in the 16k tablespace took more than 20 minutes. When executing oradebug short_stack, you can see that for the 16k update, the stacktrace is similar for many executions. So, a lot of time is spent in the kernel functions ktspfsrch() and ktspscan_bmb().


    SQL> oradebug setospid 23668
    Oracle pid: 18, Unix process pid: 23668, image: oracle@ora-vm1.intra (TNS V1-V3)
    SQL> oradebug short_stack
    .....ktspfsrch()+559<-ktspscan_bmb()+315 .....
    SQL> oradebug short_stack
    .....ktspfsrch()+559<-ktspscan_bmb()+315 .....
    SQL> oradebug short_stack
    .....ktspfsrch()+559<-ktspscan_bmb()+315 .....

    It is important to understand that the problem is not necessarily related to the blocksize, but to the PCTFREE value. More information about this topic can be found here:

    http://structureddata.org/files/jl_test_case.html
    http://structureddata.org/2008/09/08/understanding-performance/
    http://www.oraclealchemist.com/oracle/hey-guys-does-size-matter/

Is data in DBA_HIST_SQLTEXT aged out?

This question came up during the excellent presentation of Dave Abercrombie – A Tour of the AWR Tables at the Hotsos Symposium 2009.

Nobody knew the answer but curious as I am, I wanted to know:

Where is the data stored?

SQL>  select owner, object_name, object_type from dba_objects where object_name = ‘DBA_HIST_SQLTEXT’;

OWNER      OBJECT_NAME                    OBJECT_TYPE
———- —————————— ——————-
SYS        DBA_HIST_SQLTEXT               VIEW
PUBLIC     DBA_HIST_SQLTEXT               SYNONYM

SQL> select text from dba_views where view_name = ‘DBA_HIST_SQLTEXT’;

TEXT
——————————————————————————–
select dbid, sql_id, sql_text, command_type
from WRH$_SQLTEXT

SQL> select owner, object_name, object_type, object_id, data_object_id  from dba_objects where object_name = ‘WRH$_SQLTEXT’;

OWNER      OBJECT_NAME                    OBJECT_TYPE          OBJECT_ID DATA_OBJECT_ID
———- —————————— ——————- ———- ————–
SYS        WRH$_SQLTEXT                   TABLE                     8996           8996
=> Not partitioned.

SQL> select min(snap_id), max(snap_id) from wrh$_SQLTEXT;

MIN(SNAP_ID) MAX(SNAP_ID)
———— ————
3040         3274

We could guess that snap_ids before 3040 were removed but I want to know. (BAAG)  I could monitor the table over a period of time and check min(snap_id) but I can also query the backed up object statistics, specifically the minimum value for column snap_id:

select to_char(savtime,’DD.MM.YYYY’), minimum, maximum, distcnt,  sample_size
from WRI$_OPTSTAT_HISTHEAD_HISTORY where obj# = 8996 and intcol# = 1;
2
TO_CHAR(SA    MINIMUM    MAXIMUM    DISTCNT SAMPLE_SIZE
———- ———- ———- ———- ———–
16.02.2009       2229       2461        114         455
17.02.2009       2296       2535        122         471
18.02.2009       2326       2559        114         469
19.02.2009       2346       2584        119         476
20.02.2009       2370       2607        108         448
21.02.2009       2394       2632        112         437
23.02.2009       2417       2640        104         412
24.02.2009       2471       2703        113         400
25.02.2009       2495       2728         94         394
26.02.2009       2516       2750        101         396
27.02.2009       2536       2776        103         322
28.02.2009       2560       2799         93         310
02.03.2009       2585       2807         88         303
03.03.2009       2637       2871         92         289
04.03.2009       2693       2895         85         287
05.03.2009       2693       2919         91         298
06.03.2009       2704       2943         86         294
09.03.2009       2730       2967         92         303
10.03.2009       2801       3039        104         303
11.03.2009       2836       3063         94         304
12.03.2009       2856       3087         99         303
13.03.2009       2872       3111        101         301
14.03.2009       2896       3135         99         299
16.03.2009       2921       3144         95         292
17.03.2009       2976       3206         94         292
18.03.2009       3003       3232         94         305

26 rows selected.

We can see that at the stats gathering during the last 26 days, every day, the min value increased. So, the answer is yes.


Autostart Oracle Services at Clusterware startup

I recently recommended the utilization of Oracle Services to a new client. I knew that you have to start a service manually if you restart your cluster. I told them that I will investigate on how to set the services to autostart. I knew that there is an AUTO_START property of the clusterware service resource set to “restore”, which means that at instance crash, Clusterware will restore the services to the status before crash. However, on manual shutdown/startup, the services will not be autostarted.

[root@ora-vm1 tmp]# crs_stat -p ora.MDDB.S_TEST.MDDB1.srv
NAME=ora.MDDB.S_TEST.MDDB1.srv
TYPE=application
AUTO_START=restore

I did some research and learned that the easiest way to configure autostart is with user callouts. A quick web search referred me to the OTN site “Real Application Clusters Sample Scripts” which contains a FAN callout script to autostart services.

The steps to install the scripts are:

  1. Copy them to $ORA_CRS_HOME/racg/usrco on both nodes
  2. Set permissions to 710 and ownership to oracle:oinstall or whatever your installation is using.
  3. Modify perl Scripts to contain your ORA_CRS_HOME, ORACLE_HOME. Moreover, if you have used short hostnames for your CRS installation, you have to change “/bin/hostname” to “/bin/hostname -s”.

I just realized that Dan Norris also blogged about this topic.

DataGuard Broker Setup fails because of Cisco Firewall ASA Feature “fixup/inspect”

I ran into this problem a couple of years ago but I still think that it is worth blogging about. Back then, I tried to setup a data guard broker enviroment but when trying to enable the configuration on the dgmgrl command line, the dataguard configuration files could not be synced to the standby host. After a couple of days of troubleshooting, it turned out that the Cisco ASA Firewall hat the SQL*Net Inspection Feature enabled. But because the data guard broker data packets did not exactly conform to SQL*Net, the Firewall corrupted the packet and it was not able to create the data guard configuration. After disabling the SQL*Net inspection, it worked without any further problems.