Your experience with RAC Dynamic Remastering (DRM) in 10gR2?

By Martin | May 12th, 2009 | Category: 10g, Bugs, Linux, Linux Itanium, Oracle Database, Performance Tuning, Real Application Clusters, Unix | 5 comments

One of my customers is having severe RAC performance issues, which appeared a dozen times so far. Each time, the performance impact lasted around 10 minutes and caused basically a hang of the application. ASH investigation revealed that the time frame of performance issues exactly matches a DRM operation of the biggest segment of the database. During the problematic time period, there are 20-50 instead of 5-10 active sessions and they are mostly waiting for gc related events: “gc buffer busy”,”gc cr block busy”, “gc cr block 2-way”, “gc current block 2-way”, “gc current request”, “gc current grant busy”, etc.

In addition, there is one single session which has wait event “kjbdrmcvtq lmon drm quiesce: ping completion” (on instance 1) and 1-3 sessions with wait event “gc remaster“. (on instance 2) The cur_obj# of the session waiting on “gc remaster” is pointing to the segment being remastered.

Does anybody have any experience with DRM problems with 10.2.0.4 on Linux Itanium?

I know that it is possible to deactive DRM, but usually it should be beneficial to have it enabled. I could not find any reports of performance impact during DRM operation on metalink. Support is involved but clueless so far.

Regards,
Martin

http://forums.oracle.com/forums/message.jspa?messageID=3447436#3447436

5 comments
Leave a comment »

Martin June 13th, 2009 18:42 :
Oracle Support has requested stacktraces of lms processes during the period of performance degradation. We decided to enable OSWatcher to get systemwide linux data and procwatcher to get lms process stacktraces. We created a Grid Control User Defined Metric to check whether the symptoms of a DRM performance problem is taking place. Then we triggered the lms stacktraces with a Grid Control Response Action script of the UDM.

Oracle Support has also requested global hanganalyze and system state dumps but we decided not to collect system state dumps because of the big additional performance impact.

The oswatcher data showed that during the drm period, the lms processes had very high CPU resource utilization.

In the meantime Oracle Support has confirmed that we are hitting 6960699. We have received patch 8516675 which includes the bugfix and have installed it. Now, we are waiting to see whether this indeed fixes the issue.
Martin June 27th, 2009 18:21 :
Unfortunately, patch 8516675 resulted in instance crashes so we had to deinstall it again. Now, Oracle has provided two new patches, which we are currently testing.
Benjamin September 5th, 2009 11:22 :
Hi Martin,

I’ve a 2-node Solaris 10/Oracle 10.2.0.4 RAC setup and have run into the same issues with my 1 instance restarting after a shutdown abort about twice within an hour. I have also applied the patch 8516675 which does not appear to have fixed the issue. Most of the references to the error are listed against previous versions 9.2.0.x and 10.1.

Would you share some feedback regarding what progress you have made with the 2 new patches you’ve been provided with?

I am currently opening an SR for this though, but a headstart would be good.

Regards,

Benjamin
admin September 6th, 2009 18:43 :
The two interim patches provided by Oracle Support (8541032 and 8625153). They replaced the problematic patch 8516675 and fixed the DRM problems as well.

Best regards,
Martin
lefterhs June 10th, 2010 14:05 :
We had the same issue on a 2-node 10.2.0.4 RAC (HP-UX).
2 error 481 instance crashes and GRD freezes due to DRM operations.
The biggest one recorded, lasted 48 minutes.
In our case, the object being remastered was not always the same and not relevant to its size.
For instance, the 48 minute freeze was caused by a DRM on a 800mb table, and the database contains tables in the scale of Terabytes (12 TB total db size).
Since, the local Oracle branch and Metalink had not suggested any patches, we decided to disable DRM and, since then,
these issues have not reappeared.