10g

Online Redefinition on Mission Critical Siebel database

I have documented a successful online redefinition operation on a mission-critical Siebel database. It was performed after a big portion of the rows have been deleted. Additionally, row chaining was removed.

The paper can be found in the “papers” section or downloaded here: Online Reorganisation with DBMS_REDEFINITION



ORA-600 – be careful when dynamically resizing SGA structures

If you have Automatic Memory Management turned on (sga_target > 0) and you dynamically resize db_cache_size, you might – like me – encounter an instance crash with ORA-600 [Kmgs_pre_process_request_2]. Small change – big impact! Oracle tracks this problem with Bug 6597948 and Bug 5942310. There is a patch for HP-UX Itanium 10.2.0.3. It will be fixed in 10.2.0.5 and 11g. MetaLink Note 737458.1 gives more details.

An even more obscure problem occurs when sga_target is set to an exact multiple of 4 GB: Ora-600 [Kmgs_Pre_Process_Request_6] Terminates Instance When Resizing Caches. See MetaLink 373802.1.



Wait Event “cursor: pin S wait on X” suspected to be related with Automatic Memory Management

Quite a while ago, I experienced severe performance problems while processing peak workload with multiple sessions, all waiting on "pin S wait on X". Somehow I suspected that it might have to do with frequent Automatic Memory Management Resize operations. Therefore, we have disabled it with sga_target=0. After that, we did not experience these issues anymore.

As a responsible DBA I have opened a service request with Oracle and asked:

When looking for the wait event CURSOR: PIN S WAIT ON X on metalink, there are quite a few relations to the automatic shared memory management. Could you investigate, whether disabling ASMM would benefit regarding to these mutex waits?

Oracle replied:

Based on the uploaded systemstate and stacks it is no connection between the mutex waits and the ASMM.

Today, I have read a metalink note: 742599.1 – FREQUENT RESIZE OF SGA, which confirms my suspicion. The bug is tracked as Bug 6528336 – LARGE NUMBER OF SESSIONS WAITING ON CURSOR: PIN S WAIT ON X. There is a patch available for HP-UX Itanium on 10.2.0.3 but I would rather disable Automatic Memory Management and wait for it to be fixed with a later – and better tested – patchset.

Support however recommended to get systemstate level 266 dumps as soon as the problem occurs again and then look for the holder of the mutex. An example is given on MetaLink Note: 423153.1.

The correlation is done after th IDN as bellow:

To find more details use the idn=XXXXXX to search down in the systemstate (ie idn=535d1a6c)

KGX Atomic Operation Log 7000002e5b9d160
Mutex 7000002b8e92268(3094, 0) idn 535d1a6c oper GET_SHRD SID 3094 holds it
Cursor Pin uid 2489 efd 0 whr 5 slp 58733 SID 2489 wants it in Shared (GET_SHRD)
opr=2 pso=70000028c47def0 flg=0
pcs=7000002b8e92268 nxt=0 flg=34 cld=3 hd=70000030d6c6eb0 par=7000002eefe64d0
ct=31 hsh=0 unp=0 unn=0 hvl=b825a4d0 nhv=1 ses=700000309b42600
hep=7000002b8e922e8 flg=80 ld=1 ob=7000002de49f8a0 ptr=70000022cf39db8 fex=70000022cf390c8

To find the HOLDER, search for idn XXXXXXX oper until you find one which is held (ie not GET_XXX)(
ie idn 535d1a6c oper):-

KGX Atomic Operation Log 7000002cd934270
Mutex 7000002b8e92268(3094, 0) idn 535d1a6c oper EXCL SID 3094 holds in Exclusive (EXCL)
Cursor Pin uid 3094 efd 0 whr 7 slp 0
opr=3 pso=7000002a71c4180 flg=0
pcs=7000002b8e92268 nxt=0 flg=34 cld=3 hd=70000030d6c6eb0 par=7000002eefe64d0
ct=31 hsh=0 unp=0 unn=0 hvl=b825a4d0 nhv=1 ses=700000309b42600
hep=7000002b8e922e8 flg=80 ld=1 ob=7000002de49f8a0 ptr=70000022cf39db8 fex=70000022cf390c8

The wait event "cursor: pin S wait on X" is also related to two different bugs:

  • Bug 5485914 – MUTEX REPORTED SELF DEADLOCK AFTER
    DBMS_MONITOR.SESSION_TRACE_ENABLE
  • Bug 5907779 – Self deadlock hang on “cursor: pin S wait on X” (typically from DBMS_STATS). An excellent way to diagnose if you suffer from this bug is given here.


ORA-00322, ORA-00312 at DataGuard Standby

From time to time I have encountered the following errors on the physical dataguard standby database while in recovery mode:

Errors in file /oracle/STBDB1/oratrace/bdump/stbdb1_mrp0_26719.trc:
ORA-00322: log 5 of thread 1 is not current copy
ORA-00312: online log 5 thread 1: '/oracle/STBDB1/origlogB/standby_g5_m1.log'
Sun Oct 16 13:04:08 2008
Errors in file /oracle/STBDB1/oratrace/bdump/stbdb1_mrp0_26719.trc:
ORA-00322: log 5 of thread 1 is not current copy
ORA-00312: online log 5 thread 1: '/oracle/STBDB1/origlogB/standby_g5_m1.log'

The problem corrects itself after a couple of minutes, so there is no real problem, but if you have alert log monitoring active, this will trigger a call for investigation. Oracle is tracking this error with Bug 5238386 – ORA-322 possible reading standby redo log header. There is a one-off patch available for 10.2.0.3 and the patch is included in patchset 10.2.0.4.

There is a workaround to clear the referenced redo log but I don´t see any point in doing that, because the error can occurr again as long as the patch is not installed.



Renaming SQL Profiles generated with OEM

If you created an SQL Profile in OEM by executing the SQL Tuning Advisor Job and implementing the profile, then the profile has a system-generated name like "SYS_SQLPROF_0145a2a3145a2a31". You can rename this name to a more descriptive name with DBMS_SQLTUNE.ALTER_SQL_PROFILE.

BEGIN dbms_sqltune.alter_sql_profile ( NAME => ‘SYS_SQLPROF_0145a2a3145a2a31’, attribute_name => ‘NAME’, VALUE => ‘ORDER_QUERY_AVOID_ISS’ );
END;
/



Notifications of DBMS_SCHEDULER Jobs

Oracle is offering source code for notification of DBMS Jobs. The package can be downloaded from http://www.oracle.com/technology/products/database/scheduler/job_notification.zip. Although it is filed under 11g, it should work fine with 10gR2 as well.

With the package, you can get notifications if e.g. your GATHER_STATS_JOB did not finish within the maintenance windows and was aborted.



Session terminates when querying v$sql_plan

If your Oracle session  terminates when you query v$sql_plan, you are potentially hitting Bug 5933643, which is fixed in 10.2.0.4. A workaround is to disable “_cursor_plan_unparse_enabled”. A MetaLink Note is available for further information: 361342.1 – Dump In msqsub() When Querying V$SQL_PLAN



PGA Management in Oracle 10gR2

When researching PGA utilization on 10gR2, I collected a couple of important facts for automatic pga management:

Parameters (partially hidden)

  • pga_aggregate_target: total physical memory for all workareas of all processes system-wide
  • _smm_max_size: limit for single work area
  • _pga_max_size: limit for single process
  • _smm_px_max_size: limit for all parallel slaves of one workarea

pga_aggregate_target:

  • if p_a_t  <= 500 MB, then _smm_max_size = 20% of p_a_t
  • if p_a_t  between 500 and 1000 MB, then _smm_max_size = 100 MB
  • if p_a_t >= 1000 MB, then _smm_max_size = 10% of p_a_t

_pga_max_size:

  • _pga_max_size = 2 *  _smm_max_size

_smm_px_max_size:

  • _smm_px_max_size = 50% PAT
  • degree of parallelism (dop) <= 5, then _smm_max_size is used for parallel slaves instead of smm_px_max_size
  • degree of parallelism (dop) > 5, then _smm_px_max_size/DOP is used

References:

Richmond Shee: If your memory serves you right

The Pythian Group: Working with Automatic PGA



AWR Analysis for IO Reporting

Oracle gathers a wealth of information in the Automatic Workload Repository (AWR). We can query this data to get very important IO access time information.

  • Create a working table because depending of the size of the AWR, the queries could take a long time and cause load on the database.

drop table mdecker.filestatistics;
create table mdecker.filestatistics as select t.snap_id, t.begin_interval_time, filename, file#, tsname, phyrds, singleblkrds, readtim, singleblkrdtim, phyblkrd from
dba_hist_filestatxs f ,
dba_hist_snapshot t
where
t.snap_id = f.snap_id and (
begin_interval_time between to_date('05.11.2007 09:00','DD.MM.YYYY HH24:MI') and
to_date('09.11.2007 17:00','DD.MM.YYYY HH24:MI')
OR
begin_interval_time between to_date('12.11.2007 09:00','DD.MM.YYYY HH24:MI') and
to_date('16.11.2007 17:00','DD.MM.YYYY HH24:MI') );

  • Create a summary table

create table mdecker.filestatistics_summary as
select snap_id, begin_interval_time, sum(phyrds) phyrds, sum(singleblkrds) singleblkrds from mdecker.filestatistics
group by snap_id, begin_interval_time
order by snap_id

  • Query the summary table

select snap_id,
y.begin_interval_time,
phyrds as prev_phyrds,
lead(phyrds) over(order by snap_id) current_physrds,
lead(phyrds) over(order by snap_id) - phyrds,
begin_interval_time as prev_interval_time,
lead(begin_interval_time) over(order by snap_id) current_time,
lead(begin_interval_time) over(order by snap_id) - begin_interval_time,
extract(second from lead(begin_interval_time) over(order by snap_id) - begin_interval_time) as sec,
extract(minute from lead(begin_interval_time) over(order by snap_id) - begin_interval_time) as min,
(extract(minute from lead(begin_interval_time) over(order by snap_id) - begin_interval_time)*60)+extract(second from lead(begin_interval_time) over(order by snap_id) - begin_interval_time),
(lead(phyrds) over(order by snap_id) - phyrds)/
((extract(minute from lead(begin_interval_time) over(order by snap_id) - begin_interval_time)*60)+extract(second from lead(begin_interval_time) over(order by snap_id) - begin_interval_time))
from filestatistics_summary y



Is your RAC 10.2.0.4 a ticking bomb?

I have come a across a very nasty bug on Oracle RAC on HP-UX Itanium after upgrading to 10.2.0.4. The problem might also occur on Solaris and Linux x86-64. On one of the RAC instances of a 2 node Cluster, the number of open file descriptors of the oracle racgimon process is increasing by 1 every 60 seconds. This means that if your ulimit of open files for a process is set high and the HP-UX Kernel parameter nfiles is also set high, it might take weeks to months until the racgimon process finally hits the limit. If that happens, it can cause instability of the node because no more filedescriptors can be opened system-wide.

How to check, whether my installation suffers from this bug?

It is very easy: do an lsof -p and look for dozens of open filedescriptors of file hc_SID.dat.

– or –

Check logfile $ORACLE_HOME/log//racg/imon_.log:

2008-10-07 13:05:50.879: [ RACG][82] [29827][82][ora.DBNAME.DBNAME1.inst]: GIMH: GIM-00104: Health
check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

Is there a patch?
Good news is, there is. The bug is tracked via BugID 6931689 and there is patch #7298531 available to fix the problem with metalink note 739557.1.