Oracle RAC Rolling Patching without Downtime

As part of a customer engagement, I have documented the approach of patching a mission-critical Oracle Real Application Cluster (RAC) Database from 19.8.0 to 19.10.0 WITHOUT DOWNTIME. The critical part is the Rolling Patching Oracle Java VM (OJVM).

The paper can be downloaded here: Oracle RAC Rolling Patching without Downtime

A list of recommeded patches can be found here:
https://www.ora-solutions.net/web/resources/patch-recommendations/
If you need assistance or an updated list of recommended patches, please contact me at: martin.decker@ora-solutions.net

DOAG Conference 2019 Presentation: “LINUX Know-How for DBAs (german)”

I have uploaded my presentation plus the recorded demos to the website.

Oracle Database Proactive Bundle 12.1.0.2 April 2018 changes DataPump File Format Number

After applying Oracle Database Proactive Bundle Patch (DBBP) 12.1.0.2 April 2018 (Patch 27486326), the datapump dump file internal format is changed from version 4.1 to version 4.2. The effect is, that a dump file created with DataPump from Version 12.1.0.2 DBBP 180417 can no longer be imported in any lower database except when the datapump export is performed with VERSION=12.1 clause.

MOS 462488.1 gives a script to extract Dump File Version from file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
SQL> exec show_dumpfile_info('DATA_PUMP_DIR','test.dmp'); 
---------------------------------------------------------------------------- 
Purpose..: Obtain details about export dumpfile. Version: 18-DEC-2013 
Required.: RDBMS version: 10.2.0.1.0 or higher 
. Export dumpfile version: 7.3.4.0.0 or higher 
. Export Data Pump dumpfile version: 10.1.0.1.0 or higher 
Usage....: execute show_dumfile_info('DIRECTORY', 'DUMPFILE'); 
Example..: exec show_dumfile_info('MY_DIR', 'expdp_s.dmp') 
---------------------------------------------------------------------------- 
Filename.: test.dmp 
Directory: DATA_PUMP_DIR 
Disk Path: ...
Filetype.: 1 (Export Data Pump dumpfile) 
---------------------------------------------------------------------------- 
...Database Job Version..........: 12.01.00.02.00 
...Internal Dump File Version....: 4.2 
...Creation Date.................: Tue Jun 05 16:23:31 2018 
...File Number (in dump file set): 1 
...Master Present in dump file...: 1 (Yes) 
...Master in how many dump files.: 1 
...Master Piece Number in file...: 1 
...Operating System of source db.: x86_64/Linux 2.4.xx 
...Instance Name of source db....: myhost:MYINST 
...Characterset ID of source db..: 178 (WE8MSWIN1252) 
...Language Name of characterset.: WE8MSWIN1252 
...Job Name......................: "MYUSER"."SYS_EXPORT_FULL_01" 
...GUID (unique job identifier)..: 6DE6CAA3E6377A30E053A505120A8FB1 
...Block size dump file (bytes)..: 4096 
...Metadata Compressed...........: 1 (Yes) 
...Data Compressed...............: 0 (No) 
...Compression Algorithm.........: 3 (Basic) 
...Metadata Encrypted............: 0 (No) 
...Table Data Encrypted..........: 0 (No) 
...Column Data Encrypted.........: 0 (No) 
...Encryption Mode...............: 2 (None) 
...Internal Flag Values..........: 514 
...Max Items Code (Info Items)...: 23 
---------------------------------------------------------------------------- 
 
PL/SQL procedure successfully completed.

When trying to import dump in lower version database, this error is raised:

1
2
3
ORA-39001: invalid argument value 
ORA-39000: bad dump file specification 
ORA-39142: incompatible version number 4.2 in dump file "test.dmp"

Only workaround is to use option “VERSION=12.1” when performing the datapump export.

Update 14.06.2018: Oracle Support has published a MOS Note for this issue today: 2409523.1

Update 16.07.2018: A month after we had experienced the issue, oracle has published an “Alert” in MOS this MOS Note: Alert – Regression in DataPump After Applying 12.1.0.2.180417DBBP or 12.1.0.2.180717DBBP (Doc ID 2422236.1)
In addition, a patch is now available: 21480031

Beware of loopback MTU size with RAC on Oracle Linux 7

I recently had to troubleshoot a hung instance in a 2 node RAC system. 4 months earlier, the system was reinstalled in a rolling fashion due to the requirement of Linux Upgrade from Oracle Linux 5 to Oracle Linux 7. This was required because of lack of certification for a storage migration to an AllFlash Storage. The system has been stable when running with Oracle Linux 5 for several years. Around 4 months after the reinstallation, one node got hung with and traces showed these error messages:

1
2
3
4
5
6
7
8
9
Mon Feb 26 08:53:37 2018
skgxpvfynet: mtype: 61 process 15801 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/trace/PROD2_ora_15801.trc (incident=480004):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/incident/incdir_480004/PROD2_ora_15801_i480004.trc

This was strange because OS /proc/meminfo was showing huge amounts of free memory for this physical host with 512GB of RAM.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[root@node17 ~]# cat /proc/meminfo 
zzz ***Mon Feb 26 08:53:09 CET 2018
MemTotal:       528028424 kB
MemFree:        14593828 kB
MemAvailable:   78305772 kB
Buffers:        28009752 kB
Cached:         46896496 kB
SwapCached:            0 kB
Active:         22627436 kB
Inactive:       66945168 kB
Active(anon):   14315300 kB
Inactive(anon):  2105748 kB
Active(file):    8312136 kB
Inactive(file): 64839420 kB
Unevictable:      363996 kB
Mlocked:          364020 kB
SwapTotal:      33554428 kB
SwapFree:       33554428 kB
Dirty:               404 kB
Writeback:             0 kB
AnonPages:      15768480 kB
Mapped:           709896 kB
Shmem:            945384 kB
Slab:            2462644 kB
SReclaimable:    2092232 kB
SUnreclaim:       370412 kB
KernelStack:       28336 kB
PageTables:       395568 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    87853440 kB
Committed_AS:   22087384 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1106260 kB
VmallocChunk:   34085810172 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:          16384 kB
CmaFree:               0 kB
HugePages_Total:   204800
HugePages_Free:    20214
HugePages_Rsvd:     6266
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:    36827756 kB
DirectMap2M:    78444544 kB
DirectMap1G:    423624704 kB

Oracle Support then referred to this MOS Note:

ORA-27301: OS Failure Message: No Buffer Space Available / ORA-27302: failure occurred at: sskgxpsnd2 ( Doc ID 2322410.1 )

It turned out that on systems with a lot of physical memory and on Oracle Linux 7, the MTU size of loopback adapter lo0 has to be reduced from the default value of 64k to 16436 bytes to avoid memory fragmentation in RAC. The note also mentioned that the parameter vm.min_free_kbytes should be set to physmem * 0,004 * (here 2). This requirement was already considered during installation of the system, so the only thing missing was the MTU size.

I was very surprised that neither Cluster Verification Utility (CVU), nor orachk in most recent version did catch this problem at the point of installation. In my opinion, if default value of MTU size of loopback interface on Oracle Linux 7 has the potential to cause an outage, then this must be pre-checked by CVU at installation time or at least integrated into orachk. Unfortunately, this is not the case. It seems that in July we will know if the on-prem release 18.3.0 eventually will catch and enforce this configuration requirement during installation time.

Patch Recommendations for Oracle Database 12cR1 and Cloud Control 13cR2

During my consulting engagements I see a lot of systems and many bugs. Most of the time, there is already a patch available to avoid the bug. I have collected all the recommended patches for the Oracle Database 12.1.0.2 (SE2 and EE) and Oracle Enterprise Manager 13cR2. This should help to avoid most critical known issues. Versions 12.2.0.1 and 18c will be added later this year.

Patch Recommendations

Update: 18.05.2018: I have added a list of recommended patches for release 12.2.0.1.
Update: 17.07.2018: Updated EM 13.2 Patches

BAAS: Battle Against Archiver Stuck – ALTERNATE Archive Location

If you are managing volatile systems, you might have encountered Archiver-Stuck in the past.

1
2
0RA-00257:archiver error, connect internal only until freed
ORA-16014:log 2 sequence# 1789 not archived, no available destinations

If this is the case, you might be interested in a rarely used feature called “ALTERNATE Archive Locations”. This is a second independent archive log destination, that is only used in case the primary destination is failing (full). In that case, the dest_2 is automatically enabled, preventing stuck database. As soon as the archivelog backup has cleaned up the full DEST_1 destination, there is an AUTOMATIC FALLBACK to the DEST_1 destination.

1
2
3
4
ALTER SYSTEM SET LOG_ARCHIVE_DEST_1='LOCATION=/disk1 MANDATORY MAX_FAILURE=1 ALTERNATE=LOG_ARCHIVE_DEST_2';
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_1=ENABLE;
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2=ALTERNATE;
ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='LOCATION=/disk2 MANDATORY ALTERNATE=LOG_ARCHIVE_DEST_1';

Testing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Thread 1 advanced to log sequence 174 (LGWR switch)
  Current log# 3 seq# 174 mem# 0: /u01/app/oracle/oradata/CDB122/redo03.log
2018-04-27T17:18:47.685736+02:00
ARC3: Encountered disk I/O error 19502
2018-04-27T17:18:47.685800+02:00
ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1 '/disk1/1_173_971869447.dbf' (error 19502) (CDB122)
2018-04-27T17:18:47.687074+02:00
Errors in file /u01/app/oracle/diag/rdbms/cdb122/CDB122/trace/CDB122_arc3_31260.trc:
ORA-27072: File I/O error
Additional information: 4
Additional information: 65536
Additional information: 315392
ORA-19502: write error on file "/disk1/1_173_971869447.dbf", block number 65536 (block size=512)
2018-04-27T17:18:47.816938+02:00
Errors in file /u01/app/oracle/diag/rdbms/cdb122/CDB122/trace/CDB122_arc3_31260.trc:
ORA-19502: write error on file "/disk1/1_173_971869447.dbf", block number 65536 (block size=512)
ORA-27072: File I/O error
Additional information: 4
Additional information: 65536
Additional information: 315392
ORA-19502: write error on file "/disk1/1_173_971869447.dbf", block number 65536 (block size=512)
ARC3: I/O error 19502 archiving log 2 to '/disk1/1_173_971869447.dbf'

After the error, archivelogs are placed in /dir2 directory and as soon as space is available again, archive destination is switched back to /dir1 automatically.

Archiver Stuck never again, right?

Modifications to hidden parameters between 12.1.0.2 and 12.2.0.1

I recently troubleshooted an issue with RMAN catalog database using seperate schemas for each target database and more than 300 schemas present. Catalog Database was on 12.2.0.1 and suffering from severe mutex contention because of huge version counts for RMAN catalog SQL statements. I then realized that the default number of max child cursors was increased from 1024 to 8192 in 12.2.0.1. This made me check what other parameters were changed. In my opinion, especially interesting are parameters, that were OFF / FALSE in 12.1.0.2 and ON / TRUE in 12.2.0.1 as this probably activates new code.

This SQL Statement shows all parameters, including hidden parameters:

1
2
3
4
5
6
7
8
9
-- file pd2.sql
SET LINES 3000 pages 0 feedback off trimspool ON
SELECT
 n.ksppinm ||'|'||
  c.ksppstvl ||'|'||
  n.ksppdesc
FROM sys.x$ksppi n, sys.x$ksppcv c
WHERE n.indx=c.indx
;

Then, I spooled Pipe-seperated output to text files for both releases:

1
2
3
4
5
6
7
8
9
10
11
. ~/121.env
sqlplus / as sysdba <<EOF
spool /tmp/121.txt
@/tmp/pd2.sql
EOF
 
. ~/122.env
sqlplus / as sysdba <<EOF
spool /tmp/122.txt
@/tmp/pd2.sql
EOF

Finally, I set up External Tables for accessing these files with SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
sqlplus / as sysdba
 
drop table db122 purge;
drop table db121 purge;
 
CREATE TABLE db122 (
  param      VARCHAR2(2000),
  param_val      VARCHAR2(2000),
  param_desc     VARCHAR2(2000)
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY tmp_dir
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    FIELDS TERMINATED BY '|'
    MISSING FIELD VALUES ARE NULL
    (
      param CHAR(2000)   , 
      param_val  CHAR(2000),    
      param_desc CHAR(2000)
    )
  )
  LOCATION ('122.txt')
)
PARALLEL 1
REJECT LIMIT UNLIMITED;
 
CREATE TABLE db121 (
  param      VARCHAR2(2000),
  param_val      VARCHAR2(2000),
  param_desc     VARCHAR2(2000)
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY tmp_dir
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    FIELDS TERMINATED BY '|'
    MISSING FIELD VALUES ARE NULL
    (
      param   CHAR(2000)  , 
      param_val  CHAR(2000),    
      param_desc CHAR(2000)
    )
  )
  LOCATION ('121.txt')
)
PARALLEL 1
REJECT LIMIT UNLIMITED;

The final query now compares hidden parameters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
SET LINES 500 pages 2000 tab off
col param format a56
col db121_value format  a20
col db122_value format  a20
col param_desc heading DESCRIPTION format a100 word_wrap
SELECT db121.param, db121.param_val AS db121_value, db122.param_val AS db122_value, db122.param_desc FROM 
(SELECT param, param_val, param_desc FROM db121) db121,
(SELECT param, param_val, param_desc FROM db122) db122
WHERE db121.param = db122.param AND db121.param_val != db122.param_val  AND db121.param LIKE '\_%' escape '\'
and db121.param not IN (
 '__db_cache_size',
 '__pga_aggregate_target',
 '__pga_aggregate_target',
 '__shared_pool_size',
 '__streams_pool_size',
 '_diag_adr_trace_dest',
 '__large_pool_size')
order by db121.param; 
 
 
PARAM                                                    DB121_VALUE          DB122_VALUE          DESCRIPTION
-------------------------------------------------------- -------------------- -------------------- ----------------------------------------------------------------------------------------------------
_adaptive_scalable_log_writer_disable_worker_threshold   90                   50                   Percentage of overlap across multiple outstanding writes
_advanced_index_compression_options_value                20                   0                    advanced index compression options2
_aq_latency_absolute_threshold                           100                  300                  Absolute threshold greater than average latency
_aq_latency_relative_threshold                           30                   100                  Relative threshold of average latency
_aq_lb_stats_collect_cycle                               120                  45                   Time(seconds) between consecutive AQ load statistics collection
_aq_lookback_size                                        30                   60                   AQ PT Look Back Size
_aq_pt_shrink_frequency                                  30                   1450                 PT shrink window Size
_aq_pt_statistics_window                                 30                   60                   PT statistics sample window Size
_asm_compatibility                                       10.1                 11.2.0.2             default ASM compatibility level
_asm_max_cod_strides                                     5                    10                   maximum number of COD strides
_asm_scrub_unmatched_dba                                 3                    1024                 Scrub maximum number of blocks with unmatched DBA
_asm_xrov_nvios                                          4                    8                    Specify number of VIO processes
_asm_xrov_single                                         TRUE                 FALSE                Enable single issues of IOs
_bigdata_external_table                                  FALSE                TRUE                 enables use of ORACLE_HIVE and ORACLE_HDFS access drivers
_cache_fusion_pipelined_updates                          FALSE                TRUE                 Block ping without wait for log force
_catalog_foreign_restore                                 TRUE                 FALSE                catalog foreign file restore
_column_tracking_level                                   1                    21                   column usage tracking
_compression_compatibility                               12.1.0.2.0           12.2.0               Compression compatibility
_controlfile_enqueue_holding_time_tracking_size          10                   0                    control file enqueue holding time tracking size
_cursor_db_buffers_pinned                                1387                 1375                 additional number of buffers a cursor can pin at once
_cursor_obsolete_threshold                               1024                 8192                 Number of cursors per parent before obsoletion.
_db_block_buffers                                        444542               440775               Number of database blocks cached in memory: hidden parameter
_db_block_cache_protect                                  FALSE                false                protect database blocks (for debugging only)
_dbg_scan                                                0                    2048                 generic scan debug
_dbrm_dynamic_threshold                                  33490396             989922280            DBRM dynamic threshold setting
_dbrm_runchk                                             0                    32769000             Resource Manager Diagnostic Running Thread Check
_dbwr_stall_write_detection_interval                     0                    900                  dbwriter stall write detection interval
_enable_pdb_close_abort                                  FALSE                TRUE                 Enable PDB shutdown abort (close abort)
_gc_async_send                                           FALSE                TRUE                 if TRUE, send blocks asynchronously
_gc_element_percent                                      120                  105                  global cache element percent
_gc_latches                                              8                    32                   number of latches per LMS process
_gc_max_downcvt                                          256                  1024                 maximum downconverts to process at one time
_gc_msgq_buffers                                         128                  0                    set number of MSGQ buffers
_gc_policy_time                                          10                   20                   how often to make object policy decisions in minutes
_gc_trace_freelist_empty                                 TRUE                 FALSE                if TRUE, dump a trace when we run out of lock elements
_gc_transfer_ratio                                       2                    75                   dynamic object read-mostly transfer ratio
_gcs_latches                                             0                    128                  number of gcs resource hash latches to be allocated per LMS process
_hang_base_file_space_limit                              20000000             100000000            File space limit for current normal base trace file
_hang_int_spare2                                         FALSE                0                    Hang Management int 2
_hang_lws_file_space_limit                               20000000             100000000            File space limit for current long waiting session trace file
_hang_resolution_scope                                   PROCESS              OFF                  Hang Management hang resolution scope
_hang_terminate_session_replay_enabled                   FALSE                TRUE                 Hang Management terminates sessions allowing replay
_ilmstat_memlimit                                        10                   5                    Percentage of shared pool for use by ILM Statistics
_inmemory_fs_blk_inv_blk_percent                         20                   50                   in-memory faststart CU invalidation threshold(blocks)
_inmemory_fs_enable                                      FALSE                TRUE                 in-memory faststart enable
_inmemory_imcu_target_rows                               1048576              0                    IMCU target number of rows
_inmemory_private_journal_maxexts                        32000                5000                 Max number of extents per PJ
_inmemory_private_journal_sharedpool_quota               64                   20                   quota for transaction in-memory objects
_inmemory_repopulate_invalidate_rate_percent             100                  0                    In-memory repopulate invalidate rate percent
_inmemory_repopulate_threshold_scans                     2048                 0                    In-memory repopulate threshold number of scans
_inmemory_strdlxid_timeout                               327680               0                    max time to determine straddling transactions
_ipddb_enable                                            TRUE                 FALSE                Enable IPD/DB data collection
_kebm_suspension_time                                    82800                104400               kebm auto suspension time in seconds
_kgh_restricted_subheaps                                 60                   180                  number of subheaps in heap restricted mode
_kgl_message_locks                                       64                   256                  RAC message lock count
_kqr_optimistic_reads                                    FALSE                TRUE                 optimistic reading of row cache objects
_kse_snap_ring_suppress                                  608 1403 6510        942 1403             List of error numbers to suppress in the snap error history
_ksi_clientlocks_enabled                                 TRUE                 FALSE                if TRUE, DLM-clients can provide the lock memory
_ksxp_ipclw_enabled                                      0                    1                    enable ipclw for KSXP
_ktb_debug_flags                                         0                    8                    ktb-layer debug flags
_lm_comm_channel                                         ksxp                 msgq                 GES communication channel type
_lm_comm_tkts_calc_period_length                         5                    1000                 Weighted average calculation interval length (us)
_lm_drm_min_interval                                     300                  600                  minimum interval in secs between two consecutive drms
_lm_drmopt12                                             26                   56                   enable drm scan optimizations in 12
_lm_enable_aff_benefit_stats                             TRUE                 FALSE                enables affinity benefit computations if TRUE
_lm_procs                                                320                  1088                 number of client processes configured for cluster database
_lm_xids                                                 352                  1196                 number of transaction IDs configured for cluster database
_lock_next_constraint_count                              120                  3                    max number of attempts to lock _NEXT_CONSTRAINT
_lthread_cleanup_intv_secs                               5                    1                    interval for cleaning lightweight threads in secs
_max_fsu_segments                                        4096                 1024                 Maximum segments to track for fast space usage
_max_services                                            1024                 8200                 maximum number of database services
_multi_instance_pmr                                      FALSE                TRUE                 enable multi instance redo apply
_multiple_char_set_cdb                                   FALSE                TRUE                 Multiple character sets enabled in CDB
_number_cached_group_memberships                         32                   2048                 maximum number of cached group memberships
_number_group_memberships_per_cache_line                 3                    6                    maximum number of group memberships per cache line
_optimizer_ads_use_partial_results                       FALSE                TRUE                 Use partial results of ADS queries
_optimizer_dsdir_usage_control                           126                  0                    controls optimizer usage of dynamic sampling directives
_optimizer_undo_cost_change                              12.1.0.2             12.2.0.1             optimizer undo cost change
_parallel_server_sleep_time                              10                   1                    sleep time between dequeue timeouts (in 1/100ths)
_pmon_dead_blkrs_max_blkrs                               200                  50                   max blockers to check during cleanup
_posix_spawn_enabled                                     FALSE                TRUE                 posix_spawn enabled
_rcfg_parallel_verify                                    TRUE                 FALSE                if TRUE enables parallel verify at reconfiguration
_reliable_block_sends                                    TRUE                 FALSE                if TRUE, no side channel on reliable interconnect
_rowsets_max_rows                                        200                  256                  maximum number of rows in a rowset
_side_channel_batch_size                                 200                  240                  number of messages to batch in a side channel message (DFS)
_skip_trstamp_check                                      FALSE                TRUE                 Skip terminal recovery stamp check
_small_table_threshold                                   8890                 8815                 lower threshold level of table size for direct reads
_sql_plan_directive_mgmt_control                         3                    67                   controls internal SQL Plan Directive management activities
_step_down_limit_in_pct                                  1                    20                   step down limit in percentage
_target_log_write_size_timeout                           1                    0                    How long LGWR will wait for redo to accumulate (msecs)
 
90 rows selected.

Blocking Session not visible in v$ Views

I had to troubleshoot an issue on a company ERP systems database, where several sessions were waiting for ‘enq: TX row lock contention’. It was a table containing print server nodes and this caused no more printouts. Usually troubleshooting locking issues is quite straightforward in recent versions of oracle database, but this time, the column “v$session.blocking_session” was not populated and there was no entry in v$lock for the blocker. This was strange, so I involved Oracle Support. They requested a system state dump.

SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> select * from dual;
SQL> oradebug dump systemstate 258
SQL> oradebug dump systemstate 258

In the system state dump, there was a reference for the hanging session:

Waiter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
  ----------------------------------------
  SO: 0x000000075C68F8B8, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x000000075C68F8B8, name=process, file=ksu.h LINE:12740, pg=0
  (process) Oracle pid:161, ser:247, calls cur/top: 0x000000075BC6A370/0x000000075BC6A370
            flags : (0x0) -
            flags2: (0x0),  flags3: (0x10) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0x0
    Process Group: DEFAULT, pseudo proc: 0x000000076068CE68
    O/S info: user: SYSTEM, term: HOSTNAME, ospid: 4600 
    OSD pid info: Windows thread id: 4600, image: ORACLE.EXE (SHAD)
    Short stack dump: 
ksedsts()+585<-ksdxfstk()+44<-ksdxcb()+2402<-ssthreadsrgruncallback()+629<-OracleOradebugThreadStart()+829
<-0000000076D559CD<-0000000076E8A561<-0000000076EABD7A<-000007FEFCE910AC<-skgpwwait()+182<-ksliwat()+2459
<-kslwaitctx()+184<-ksqcmi()+7848<-ksqgtlctx()+6814<-ksqgelctx()+829<-ktuGetTxForXid()+190<-ktcwit1()+526
<-kdddgb()+9636<-kddlkr()+284<-qerfuProcessFetchedRow()+1019<-__PGOSF308_qerfuLockingRowProcedure()+89
<-qertbFetchByRowID()+2708<-qerfuStart()+590<-selexe0()+3495<-opiexe()+22377<-kpoal8()+2624<-opiodr()+1646
<-ttcpip()+1481<-opitsk()+2166<-opiino()+1246<-opiodr()+1646<-opidrv()+862<-sou2o()+98<-opimai_real()+158
<-opimai()+191<-OracleThreadStart()+724<-0000000076D559CD<-0000000076E8A561
01-04-2018 21:16:06.636
    ----------------------------------------
    SO: 0x00000007606D9B88, type: 4, owner: 0x000000075C68F8B8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x000000075C68F8B8, name=session, file=ksu.h LINE:12748, pg=0
    (session) sid: 68 ser: 20263 trans: 0x0000000755DB9838, creator: 0x000000075C68F8B8
              flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID: 
              txn branch: 0x0000000000000000
              edition#: 100              oct: 3, prv: 0, sql: 0x00000007708AA2E8, psql: 0x000000076BDCDED8, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$USERS
    client details:
      O/S info: user: username, term: hostname, ospid: 5968:4960
      machine: hostname program: sqlplus.exe
      application name: sqlplus.exe, hash value=254292535
    Current Wait Stack:
     0: waiting for 'enq: TX - row lock contention'
        name|mode=0x54580006, usn<<16 | slot=0x130009, sequence=0xb394f
        wait_id=10 seq_num=11 snap_id=1
        wait times: snap=8 min 15 sec, exc=8 min 15 sec, total=8 min 15 sec
        wait times: max=infinite, heur=8 min 15 sec
        wait counts: calls=166 os=166
        in_wait=1 iflags=0x15a0
    Wait State:
      fixed_waits=0 flags=0x22 boundary=0x0000000000000000/-1
...
      ----------------------------------------
      SO: 0x0000000778A7FCA8, type: 8, owner: 0x000000075BC6A370, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
       proc=0x000000075C68F8B8, name=enqueue, file=ksq1.h LINE:380, pg=0
      (enqueue) TX-00130009-000B394F	DID: 0001-00A1-0003F59C
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
      req: X, lock_flag: 0x10, lock: 0x0000000778A7FD00, res: 0x0000000760A2BD80
      own: 0x00000007606D9B88, sess: 0x00000007606D9B88, proc: 0x000000075C68F8B8, prv: 0x0000000760A2BDA0
    ----------------------------------------

In line 60 you can see that the session is requesting an eXclusive lock for resource res: 0x0000000760A2BD80. When searching for the next appearance of res: 0x0000000760A2BD80, we find a reference to the exclusive enqueue after the last process in a section PSEUDO PROCESS / DETACHED BRANCHES.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
PSEUDO PROCESS for group DEFAULT:
  ----------------------------------------
  SO: 0x000000076068CE68, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x000000076068CE68, name=process, file=ksu.h LINE:12740, pg=0
  (process) Oracle pid:0, ser:0, calls cur/top: 0x0000000000000000/0x0000000000000000
            flags : (0x20) PSEUDO
            flags2: (0x0),  flags3: (0x0) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0x0
    Process Group: DEFAULT, pseudo proc: 0x000000076068CE68
    O/S info: user: , term: , ospid:  (DEAD)
    OSD pid info: Windows thread id: 0, image: PSEUDO
  PSO child state object changes :
Dump of memory from 0x00000007605E5698 to 0x00000007605E58A0
7605E5690                   00000000 00000000          [........]
7605E56A0 00000000 00000000 00000000 00000000  [................]
  Repeat 31 times
BEGIN DUMP RESERVED PROCESSES
END DUMP RESERVED PROCESSES
DETACHED BRANCHES:
----------------------------------------
SO: 0x0000000755E7C0F0, type: 58, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x0000000000000000, name=branch, file=ktccts.h LINE:420, pg=0
(branch) trn = 0x0000000000000000, flg = 0x40, state = 0x00 bno=0 ser=0 evt=0
..
  SO: 0x0000000755E2B378, type: 58, owner: 0x0000000755E7C0F0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x000000077860EB80, name=branch, file=ktccts.h LINE:420, pg=0
  (branch) trn = 0x0000000755E01068, flg = 0x41090, state = 0x00 bno=1 ser=10 evt=10
   creator = 0x00000007607940E8 uid = 0x42 serial# = 7929 ttl = 0x15180 dtm = 0x5a4db9f5
   2PCrole = (nch = 0x0 flg = 0x0) DTP svc = 0x773ffc978
...
    SO: 0x0000000755E01068, type: 56, owner: 0x0000000755E2B378, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x000000077860EB80, name=transaction, file=ktccts.h LINE:410, pg=0
    (trans) flg = 0x00401e03, flg2 = 0x04080000, flg3 = 0x00000000, prx = 0x0000000000000000, ros = 2147483647, crtses=0x00000007607940E8
...
    (enqueue) TX-00130009-000B394F	DID: 0001-00E6-000010CD
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x6
    mode: X, lock_flag: 0x0, lock: 0x0000000755E010E0, res: 0x0000000760A2BD80
    own: 0x0000000755E01068 (TXN), sess: 0x0000000000000000, prv: 0x0000000760A2BD90

The SEV1 support engineer could not be convinced that the entry DETACHED BRANCHES has nothing to do with the last session listed in the system state dump. I delivered multiple system state dumps and he insisted that the lock was always held by the very last session of the system state dump – what a coincidence. He just scrolled up from the line 46 until the first occurence of “(session)” and was convinced to have found the blocker, not realizing that he overjumped the sections “DETACHED BRANCHES”/”PSEUDO PROCESS for group DEFAULT”.

After some research I found this note:

Solving locking problems in a XA environment ( Doc ID 1248848.1 )

This note explains that with XA (distributed transactions), e.g. DB Links, there can be situations where blocker is not visible. Unfortunately, the symptoms disappeared from the production system before the diagnostic steps from the MOS Note could be tried. Frequently, when there is an issue that remains unsolved, it comes back at a later point and so it was in this case. This time, I was armed with knowledge of the MOS note 1248848.1 and could use the XA rollback script provided there to free up the “session-less” blocking lock.

No entry in DBA_2PC_PENDING but the Query from MOS Note shows Active XA Trans:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
select --+ ORDERED
'----------------------------------------'||'
Curent Time : '|| substr(to_char(sysdate,' HH24.MI.SS'),1,22) ||'
'||'TX start_time: '||t.KTCXBSTM||'
'||'FORMATID: '||g.K2GTIFMT ||'
'||'GTXID: '||g.K2GTITID_EXT ||'
'||'Branch: '||g.K2GTIBID ||'
Local_Tran_Id ='||substr(t.KXIDUSN||'.'||t.kXIDSLT||'.'||t.kXIDSQN,1,15)||'
'||'KTUXESTA='|| x.KTUXESTA ||'
'||'KTUXEDFL='|| x.KTUXECFL ||'
Lock_Info: ID1: ' || ((t.kXIDUSN*64*1024)+ t.kXIDSLT)
||' ID2: '|| t.kXIDSQN
XA_transaction_INFO
from x$k2gte g, x$ktcxb t, x$ktuxe x
where g.K2GTDXCB =t.ktcxbxba and
x.KTUXEUSN = t.KXIDUSN(+) and
x.KTUXESLT = t.kXIDSLT(+) and
x.KTUXESQN =t.kXIDSQN(+);
 
"----------------------------------------
Curent Time : 11.35.27
TX start_time: 02/05/18 15:35:17
FORMATID: 131075
GTXID: 312D613331303063613A653165623A35613732336438373A31326335323962
Branch: 613331303063613A653165623A35613732336438373A31326335326166
Local_Tran_Id =25.18.61640
KTUXESTA=ACTIVE
KTUXEDFL=NONE
Lock_Info: ID1: 1638418 ID2: 61640"
"----------------------------------------
Curent Time : 11.35.27
TX start_time: 02/05/18 15:35:17
FORMATID: 131075
GTXID: 312D613331303063613A653165623A35613732336438373A31326335326135
Branch: 613331303063613A653165623A35613732336438373A31326335326165
Local_Tran_Id =16.26.2292874
KTUXESTA=ACTIVE
KTUXEDFL=NONE
Lock_Info: ID1: 1048602 ID2: 2292874"
"----------------------------------------
Curent Time : 11.35.27
TX start_time: 02/05/18 15:35:17
FORMATID: 131075
GTXID: 312D613331303063613A653165623A35613732336438373A31326335326161
Branch: 613331303063613A653165623A35613732336438373A31326335326232
Local_Tran_Id =3.30.3767984
KTUXESTA=ACTIVE
KTUXEDFL=NONE
Lock_Info: ID1: 196638 ID2: 3767984"

This output proved that it was CASE 2: No Entry in dba_2pc_pending und Status ACTIVE. Then I could successfully avoid an instance restart and rollback the global transaction to free up the locks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
set JAVA_HOME=c:\oracle\product\11.2.0.4\dbhome_1\jdk\jre
set CLASSPATH=c:\dba\scripts\;c:\oracle\product\11.2.0.4\dbhome_1\jdbc\lib\ojdbc5.jar:%CLASSPATH%
C:\DBA\scripts>c:\oracle\product\11.2.0.4\dbhome_1\jdk\jre\bin\java XA_rb jdbc:oracle:thin:@hostname:1521:PROD -ROLLBACK 25 18 61640
++ URL: jdbc:oracle:thin:@hostname:1521:PROD
++ Rollback of Local_tran_ID : 25.18.61640
++ got XA resource handle
++ got Connection handle
++ SQL : select g.K2GTIFMT, g.K2GTITID_EXT, g.K2GTIBID, rawtohex( g.K2GTITID_EXT), rawtohex(g.K2GTIBID) from sys.vw_x$k2gte g, sys.vw_x$ktcxb t, sys.vw_x$ktuxe x where g.K2GTDXCB =t.KTCXBXBA and x.KTUXEUSN
SLT(+) and x.KTUXESQN =t.KXIDSQN(+) and t.KXIDUSN=25 and t.kXIDSLT=18 and t.kXIDSQN= 61640
++ Getting XID for Local_Tran_ID: 25.18.61640
+ Found global XID: -->
++ Format Id: 131075
++ Group Id: 312D613331303063613A653165623A35613732336438373A31326335323962
++ Branch Id: 613331303063613A653165623A35613732336438373A31326335326166
 
-> Would your really like to continue (yes or no):
yes
... doing Rollback
 
... Rollback done !

With Redo Log Mining for the transaction ID, we could get some minor details about the session which initially opened this global transaction.

So whenever you find the strange situation that sessions are blocked on ‘enq: TX – row lock contention’ but there is no blocker in v$session blocking_session column or no entry of a session holding the lock in v$lock, then this MOS Note might be helpful in cleaning up the lock. Sadly, another case where even SEV1 support was no help at all. I have to mention however, that i recently had a SEV1 issue with an excellent and ambitious engineer. So there is still hope…

Super-Sizing PGA Workareas in Oracle 12c

After a customer asked me for possibilities of super-sizing PGA workareas in version 12c, I took the chance to revisit the topic and perform some tests. Great material has already been posted by late Alex Fatkulin (Hotsos Symposium 2014) and Norbert Debes (Secrets of the Oracle Database, Apress), but I wanted to verify, how the current release is behaving and I found some interesting things.

Super-Sizing PGA Workareas in 12c

Cloud Control – Privilege Delegation – so you don´t have the oracle / root password?

Quite frequently in database environments, security policies dictate that only personalized logons to Unix / Linux are allowed and that from there, one has to “sudo” to change to the oracle account. While this adds an additional layer of security, it makes administration a little more complicated.

Oracle Enterprise Manager – Cloud Control has a feature, which allows to cope with such a sudo environment. The feature is called “Privilege Delegation”. This post describes how to set it up and for what it can be used.

  1. Setup of “sudo” by the root account
  2. In order to use privilege delegation, specific sudo rules have to be defined. This rule normally already exists:

    1
    
    mdecker ALL=(root) NOPASSWD:/bin/su - oracle

    In addition to this one, two additional rules are required. If only sudo to “oracle” is required, then only the first line is needed.

    1
    2
    
    mdecker ALL=(oracle) SETENV:/u01/app/oracle/cloud/agent/sbin/nmosudo *
    mdecker ALL=(root) SETENV:/u01/app/oracle/cloud/agent/sbin/nmosudo *
  3. Setup of Privilege Delegation in Cloud Control
  4. Go to Setup -> Security -> Privilege Delegation.

    Then you can either set it globally via template or individually for each host. This depends mainly on the path to “sudo” binary in the operating system. Then you choose “sudo” and provide the path to the sudo binary on your operating system (bash$ which sudo). The required parameters are then appended: /usr/bin/sudo -E -u %RUNAS% %COMMAND%

  5. The next step is to configure a Named Credential.
  6. If you are in a team of administrators, then each administrator should have his own account to log on to Cloud Contol and avoid using “sysman” user. For obvious reasons, each administrator has to create his own “named credential” (Setup->Security->Named Credential), because it contains his personalized username and password.

    Here you provide your personalized credentials (username/password) and specify that sudo should be used to change to “oracle”.

  7. Lastly, verify that it works as desired.
  8. Go to “Targets” -> “Host” and click “Run Host Command”. Then give the command to run, e.g. “id -a”, and then add a named credential as well as a specific host and click “Run”.

    If all is well, then you will get this output showing you the id of user oracle.

How does this work behind the scenes? The agent java process spawns a process “nmo”. This process was granted SETUID root Privileges by executing $ORACLE_HOME/root.sh at the time of agent deployment. This executable is calling “sudo” to run command “nmosudo” as user oracle with the “payload” command, which the user wanted to execute.

1
2
3
root   14595  3149  0 14:23 pts/0    00:00:00 /u01/app/oracle/cloud/agent/sbin/nmo
root   14598 14595  0 14:23 pts/1    00:00:00   /usr/bin/sudo -p ###AGENT-PDP-PASSWORD-PROMPT### -E -u oracle /u01/app/oracle/cloud/agent/sbin/nmosudo DEFAULT_PLUGIN DEFAULT_FUNCTIONALITY DEFAULT_SUBACTION DEFAULT_ACTION /bin/sh -c id -a
oracle 14602 14598  0 14:23 pts/1    00:00:00     sleep 300

The linux logfile /var/log/secure will contain this messages. It can be seen that the personal user “mdecker” was running the command “nmosudo” with the payload command “id -a” as attribute.

1
Sep 13 22:03:59 xxx sudo:  mdecker : TTY=pts/2 ; PWD=/u01/app/oracle/cloud/agent/agent_inst/sysman/emd ; USER=oracle ; COMMAND=/u01/app/oracle/cloud/agent/sbin/nmosudo DEFAULT_PLUGIN DEFAULT_FUNCTIONALITY DEFAULT_SUBACTION DEFAULT_ACTION /bin/sh -c id -a