Unix

Beware of loopback MTU size with RAC on Oracle Linux 7

I recently had to troubleshoot a hung instance in a 2 node RAC system. 4 months earlier, the system was reinstalled in a rolling fashion due to the requirement of Linux Upgrade from Oracle Linux 5 to Oracle Linux 7. This was required because of lack of certification for a storage migration to an AllFlash Storage. The system has been stable when running with Oracle Linux 5 for several years. Around 4 months after the reinstallation, one node got hung with and traces showed these error messages:

1
2
3
4
5
6
7
8
9
Mon Feb 26 08:53:37 2018
skgxpvfynet: mtype: 61 process 15801 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/trace/PROD2_ora_15801.trc (incident=480004):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/incident/incdir_480004/PROD2_ora_15801_i480004.trc

This was strange because OS /proc/meminfo was showing huge amounts of free memory for this physical host with 512GB of RAM.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[root@node17 ~]# cat /proc/meminfo 
zzz ***Mon Feb 26 08:53:09 CET 2018
MemTotal:       528028424 kB
MemFree:        14593828 kB
MemAvailable:   78305772 kB
Buffers:        28009752 kB
Cached:         46896496 kB
SwapCached:            0 kB
Active:         22627436 kB
Inactive:       66945168 kB
Active(anon):   14315300 kB
Inactive(anon):  2105748 kB
Active(file):    8312136 kB
Inactive(file): 64839420 kB
Unevictable:      363996 kB
Mlocked:          364020 kB
SwapTotal:      33554428 kB
SwapFree:       33554428 kB
Dirty:               404 kB
Writeback:             0 kB
AnonPages:      15768480 kB
Mapped:           709896 kB
Shmem:            945384 kB
Slab:            2462644 kB
SReclaimable:    2092232 kB
SUnreclaim:       370412 kB
KernelStack:       28336 kB
PageTables:       395568 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    87853440 kB
Committed_AS:   22087384 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1106260 kB
VmallocChunk:   34085810172 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:          16384 kB
CmaFree:               0 kB
HugePages_Total:   204800
HugePages_Free:    20214
HugePages_Rsvd:     6266
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:    36827756 kB
DirectMap2M:    78444544 kB
DirectMap1G:    423624704 kB

Oracle Support then referred to this MOS Note:

ORA-27301: OS Failure Message: No Buffer Space Available / ORA-27302: failure occurred at: sskgxpsnd2 ( Doc ID 2322410.1 )

It turned out that on systems with a lot of physical memory and on Oracle Linux 7, the MTU size of loopback adapter lo0 has to be reduced from the default value of 64k to 16436 bytes to avoid memory fragmentation in RAC. The note also mentioned that the parameter vm.min_free_kbytes should be set to physmem * 0,004 * (here 2). This requirement was already considered during installation of the system, so the only thing missing was the MTU size.

I was very surprised that neither Cluster Verification Utility (CVU), nor orachk in most recent version did catch this problem at the point of installation. In my opinion, if default value of MTU size of loopback interface on Oracle Linux 7 has the potential to cause an outage, then this must be pre-checked by CVU at installation time or at least integrated into orachk. Unfortunately, this is not the case. It seems that in July we will know if the on-prem release 18.3.0 eventually will catch and enforce this configuration requirement during installation time.



Oracle Linux Update from 7.2 to 7.3 crashes with uncorrectable CPU error

During the Upgrade of Oracle Linux 7.2 to 7.3 with “yum update”, when updating the microcode_ctl package, the system crashed and firmware reported an uncorrectable CPU error after reboot.

It turned out that the update of the package in combination with specific Intel CPUs causes the issue. In the meantime, there is a red hat bug and solution available.

 

References:

  • https://community.oracle.com/thread/3997160
  • https://bugzilla.redhat.com/show_bug.cgi?id=1398698
  • MOS: The system could not start since suddenly rebooted during upgrade to microcode_ctl package (Doc ID 2217876.1)


Problems with big SGAs (>200G) on Linux

I recently had an issue where a database with 240GB SGA (1 huge shared memory segment) configured with hugepages on a system with 512G RAM was suddenly becoming instable and new logons were denied with these error message:


ORA-01034: ORACLE not available
ORA-27123: unable to attach to shared memory segment
Linux-x86_64 Error: 22: Invalid argument
Additional information: 2667
Additional information: 1736718
Additional information: 215016800256

This was strange because ipcs -a showed all shared memory segments and all processes existed. It turned out that there is a known issue with the Linux feature “prelink” and very big SGAs (>200g). This MOS Note gives some details:

Connected to an idle instance, while database instance is running (Process J000 died, see its trace file kkjcre1p: unable to spawn jobq slave process ) (Doc ID 1578491.1)

Obviously, exadata customers also suffered from this issue and therefore prelink is disabled beginning with exadata version 11.2.3.3. Redhat is reporting the same issue in combination with Informix databases in Knowledgebase article: https://access.redhat.com/solutions/1186953



Bug 13397104: INSTANCE CRASH WITH ORA-600 [KJBLPKEYDRMQSCCHK:PKEY]

I would like to point your attention to a bug in RAC 11.2.0.3. The bug leads to instance crashes during Dynamic Remastering. As of June 1st, there is no patch publicly available. The workaround seems to be to set:

_gc_read_mostly_locking=false

So, if you are thinking about upgrading your RAC to 11.2.0.3, maybe you should wait just a little bit longer… 😉

More details are in MOS: Bug 13397104: INSTANCE CRASH WITH ORA-600 [KJBLPKEYDRMQSCCHK:PKEY]

Update 12.06.2012: Patch 13397104 is now available in MOS for 11.2.0.3.



Oracle certifies OEL6/RHEL6 for Oracle Database 11gR2 and Fusion Middleware

Oracle yesterday announced certification of RHEL6 / OEL6 with Oracle Database 11gR2 and Oracle Fusion Middleware 11.1.

Moreover, Oracle will from now on provide errata packages for free. Until now, they could only be obtained with a valid ULN support contract. I think this will be a big plus for Oracle Linux in competition with Red Hat Enterprise Linux.

I wonder how long it will take before they ship Exadata and Oracle Database Appliance (ODA) with Unbreakable Kernel.

Reference: http://www.oracle.com/us/corporate/press/1563775



ACFS Filesystem Monitoring and Group Ownership

When you create an ACFS Filesystem in Grid Infrastructure 11.2.0.1 or 11.2.0.2, the filesystem root directory group ownership is set to Group SS_ASM_GRP, e.g. asmadmin.

1
2
3
4
5
6
7
8
9
10
11
12
[grid@md1 ~]$ cd /opt/oracle/gridbase/acfsmounts
[grid@md1 acfsmounts]$ ls -al
total 12
drwxr-xr-x  3 grid oinstall 4096 Jan 10 09:44 .
drwxr-xr-x 10 grid oinstall 4096 Jan 10 09:43 ..
drwxrwx---  4 root   asmadm 4096 Jan 10 09:44 data_testvol
 
SQL> select * from v$asm_filesystem where fs_name = '/opt/oracle/gridbase/acfsmounts/data_testvol'
 
FS_NAME                                        AVAILABLE BLOCK_SIZE STATE         CORRU    NUM_VOL TOTAL_SIZE TOTAL_FREE TOTAL_SNAP_SPACE_USAGE
---------------------------------------------- --------- ---------- ------------- ----- ---------- ---------- ---------- ----------------------
/opt/oracle/gridbase/acfsmounts/data_testvol   10-JAN-11          4 AVAILABLE     FALSE          1        256 119.769531                      0

If – for whatever reason – you change the group ownership from asmadm to a different group, ASM can not populate the views v$asm_filesystem and v$asm_acfsvolumes which in turn means that you can not monitor the filesystem with Oracle Enterprise Manager Grid Control because it uses those 2 views for monitoring.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@md1 data_testvol]# chgrp myapp .
[root@md1 data_testvol]# ls -la
total 80
drwxrwx--- 4 root   myapp     4096 Jan 10 09:45 .
drwxr-xr-x 3 grid   oinstall  4096 Jan 10 09:44 ..
drwxr-xr-x 5 root   root      4096 Jan 10 09:44 .ACFS
-rw-r--r-- 1 root   asmadm     610 Jan 10 09:45 .fslimit
drwx------ 2 root   root     65536 Jan 10 09:44 lost+found
 
 
SQL> select * from v$asm_filesystem where fs_name = '/opt/oracle/gridbase/acfsmounts/data_testvol'
  2  ;
 
no rows selected

From my point of view, this is a severe limitation. ACFS Filesystems, should like any other filesystem, be able to allow any user/group ownership and still be able to monitor it. However, I could not convince my oracle support engineer to see it the same way…



Grid Control 11g: Agent Metric Swap Utilization on HP-UX with Pseudo-Swap

If you are running Grid Control on HP-UX and you are using pseudo-swap, then you have to add the property:

NMUPM_USE_PSEUDO_MEM=TRUE

to the emd.properties. Otherwise, Swap Utilization is calculated wrongly (higher than acutal) and it might trigger false alarms.

How is the Grid Control Agent calculating Swap Utilization?

In file $AGENT_HOME/sysman/admin/metadata/host.xml line 310 you find some answers:

310
311
<ColumnDescriptor NAME="swapUtil" TYPE="NUMBER" IS_KEY="FALSE" 
COMPUTE_EXPR="(100.0 * usedSwap / (usedSwap + freeSwap))">

In other parts, you find the definition of “usedSwap” and “freeSwap”.

310
311
312
313
 <ColumnDescriptor NAME="usedSwap" TYPE="NUMBER" IS_KEY="FALSE" 
COMPUTE_EXPR="(usedSwapRaw / 1024.0)" HELP="NO_HELP">
 <ColumnDescriptor NAME="freeSwap" TYPE="NUMBER" IS_KEY="FALSE" 
COMPUTE_EXPR="(freeSwapRaw / 1024.0)" HELP="NO_HELP">

Now, we have to find “usedSwapRaw” and “freeSwapRaw”. In a different part (line 387) of this file, you see this section:

387
388
389
390
391
392
393
394
395
396
397
398
399
<GetView NAME="LoadInternalView" FROM_TABLE="_LoadInternal">
                        <Column NAME="cpuLoad_1min" COLUMN_NAME="cpuLoad_1min"/>
                        <Column NAME="cpuLoad" COLUMN_NAME="cpuLoad"/>
                        <Column NAME="cpuLoad_15min" COLUMN_NAME="cpuLoad_15min"/>
                        <Column NAME="pgScan" COLUMN_NAME="pgScan"/>
                        <Column NAME="noOfProcs" COLUMN_NAME="noOfProcs"/>
                        <Column NAME="transfers" COLUMN_NAME="transfers"/>
                        <Column NAME="pageSize" COLUMN_NAME="pageSize"/>
                        <Column NAME="realMem" COLUMN_NAME="realMem"/>
                        <Column NAME="freeMemRaw" COLUMN_NAME="freeMemRaw"/>
                        <Column NAME="usedSwapRaw" COLUMN_NAME="usedSwapRaw"/>
                        <Column NAME="freeSwapRaw" COLUMN_NAME="freeSwapRaw"/>
        </GetView>

In the “Metric” XML Tag for Metric NAME=”_LoadInternal”, you find the QueryDescriptor:

620
621
622
623
624
625
626
627
628
629
 <QueryDescriptor FETCHLET_ID="OSLineToken">
      <Property NAME="NMUPM_USE_PSEUDO_MEM" SCOPE="INSTANCE" OPTIONAL="TRUE">NMUPM_USE_PSEUDO_MEM</Property>
      <Property NAME="ENVNMUPM_USE_PSEUDO_MEM" SCOPE="GLOBAL">%NMUPM_USE_PSEUDO_MEM%</Property>
      <Property NAME="emdRoot" SCOPE="SYSTEMGLOBAL">emdRoot</Property>
      <Property NAME="command" SCOPE="GLOBAL"> %emdRoot%/bin/nmupm </Property>
      <Property NAME="args" SCOPE="GLOBAL">osLoad</Property>
      <Property NAME="startsWith" SCOPE="GLOBAL">em_result=</Property>
      <Property NAME="delimiter" SCOPE="GLOBAL">|</Property>
      <Property NAME="ENVNMUPM_TIMEOUT" OPTIONAL="TRUE" SCOPE="SYSTEMGLOBAL">NMUPM_TIMEOUT</Property>
    </QueryDescriptor>

So, now we know that the agent executes “nmupm osLoad”. Let´s try it.

1
2
3
$ nmupm osLoad
ncpus=8
em_result=4.10|2.43|4.46|500038168.000000|987|3|0.000000|4096.000000|40775544.000000|10464849920.000000|41094647808.000000|21731692544.000000

The result is pipe seperated and consists of the columns from above. The last two are used “usedSwapRaw” and “freeSwapRaw”.

Please note that you will get different results depending on the value of the environment variable “NMUPM_USE_PSEUDO_MEM”.

1
2
3
4
5
6
bash$ NMUPM_USE_PSEUDO_MEM=FALSE nmupm osLoad
ncpus=8
em_result=0.86|1.84|3.89|501596312.000000|986|3|0.000000|4096.000000|40775544.000000|10436485120.000000|21024821248.000000|47362048.000000
bash$ NMUPM_USE_PSEUDO_MEM=TRUE nmupm osLoad
ncpus=8
em_result=0.90|1.83|3.88|501596312.000000|989|3|0.000000|4096.000000|40775544.000000|10432708608.000000|41164206080.000000|21662134272.000000

Summary

If you are getting warning/critical alerts about Swap Utilization, check with glance or “swapinfo -t” to verify if pseudo-swap is used. If it is, set the property and it will fix the metric error.



Grid Control 11g: Diagnosing OMS High CPU Utilization

Beginning with 11g Oracle Enterprise Manager Grid Control is using Weblogic instead of Oracle Application Server. I am currently experiencing very sluggish Grid Control Console performance and high CPU Utilization of EMGC_OMS1 weblogic server. This article shows how you can diagnose the problem and “assist” Oracle support in helping to solve the issue.

First step to find out what´s going on is to use “top”.

1
2
3
4
5
6
7
8
top - 08:29:12 up 4 days, 20:56,  3 users,  load average: 1.47, 1.60, 1.61
Tasks: 532 total,   1 running, 531 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.3%us,  0.7%sy,  0.0%ni, 97.8%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132033652k total, 20255244k used, 111778408k free,   728860k buffers
Swap: 20971512k total,        0k used, 20971512k free, 15391704k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30181 gc11g    15   0 1778m 1.1g  49m S 99.7  0.9 626:54.30 java

PID 30181 is using 100% (and has been using it for a very long time)

It´s the EMGC_OMS1 server:

1
2
gc11g   30181 30129  9 Nov06 ?        10:27:20 /oracle/jdk/jdk1.6.0_21/bin/java -client -Xms256m -Xmx512m -XX:CompileThreshold=8000 -XX:PermSize=128m -XX:MaxPermSize=512m -Dweblogic.Name=EMGC_OMS1 -Djava.security.policy=/oracle/grid/middleware/wlserver_10.3/server/lib/weblogic.policy -Dweblogic.system.BootIdentityFile=/oracle/grid/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/data/nodemanager/boot.properties -Dweblogic.nodemanager.ServiceEnabled=true -Dweblogic.security.SSL.ignoreHostnameVerification=true -Dweblogic.ReverseDNSAllowed=false -DINSTANCE_HOME=/oracle/grid/gc_inst/em/EMGC_OMS1 -DORACLE_HOME=/oracle/grid/middleware/oms11g -Ddomain.home=/oracle/grid/gc_inst/user_projects/domains/GCDomain -Djava.awt.headless=true -Ddomain.name=GCDomain -Docm.repeater.home=/oracle/grid/middleware/oms11g -Djava.security.egd=file:///dev/urandom -Xverify:none -da -Dplatform.home=/oracle/grid/middleware/wlserver_10.3 -Dwls.home=/oracle/grid/middleware/wlserver_10.3/server -Dweblogic.home=/oracle/grid/middleware/wlserver_10.3/server -Ddomain.home=/oracle/grid/gc_inst/user_projects/domains/GCDomain -Dcommon.components.home=/oracle/grid/middleware/oracle_common -Djrf.version=11.1.1 -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.Jdk14Logger -Djrockit.optfile=/oracle/grid/middleware/oracle_common/modules/oracle.jrf_11.1.1/jrocket_optfile.txt -Doracle.domain.config.dir=/oracle/grid/gc_inst/user_projects/domains/GCDomain/config/fmwconfig -Doracle.server.config.dir=/oracle/grid/gc_inst/user_projects/domains/GCDomain/config/fmwconfig/servers/EMGC_OMS1 -Doracle.security.jps.config=/oracle/grid/gc_inst/user_projects/domains/GCDomain/config/fmwconfig/jps-config.xml -Djava.protocol.handler.pkgs=oracle.mds.net.protocol -Digf.arisidbeans.carmlloc=/oracle/grid/gc_inst/user_projects/domains/GCDomain/config/fmwconfig/carml -Digf.arisidstack.home=/oracle/grid/gc_inst/user_projects/domains/GCDomain/config/fmwconfig/arisidprovider -Dweblogic.alternateTypesDirectory=/oracle/grid/middleware/oracle_common/modules/oracle.ossoiap_11.1.1,/oracle/grid/middleware/oracle_common/modules/oracle.oamprovider_11.1.1 -Dweblogic.jdbc.remoteEnabled=false -Dweblogic.management.discover=false -Dweblogic.management.server=https://t44grid.wwk-group.com:7101 -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/oracle/grid/middleware/patch_wls1032/profiles/default/sysext_manifest_classpath weblogic.Server
oel5n1:/home/gc11g >

Next step is to find out which thread is consuming 100% of cpu.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30181   5
30181 gc11g   1821640 1167200 S 15 0 00:00:03  0.0 java            30182   8
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30183   0
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30184   8
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30185  11
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30186  10
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30187   3
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30188  15
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30189   9
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30190   2
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30191  13
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30192   1
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30193   4
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30194  14
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            30195   5
30181 gc11g   1821640 1167200 R 25 0 08:30:49  7.6 java            30196   2       <<<<<<
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30197  10
30181 gc11g   1821640 1167200 S 15 0 00:00:04  0.0 java            30198   4
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30199   5
30181 gc11g   1821640 1167200 S 15 0 00:00:58  0.0 java            30200  15
30181 gc11g   1821640 1167200 S 15 0 00:00:57  0.0 java            30201   2
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30202   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30203  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30204  10
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30496   4
30181 gc11g   1821640 1167200 S 15 0 00:18:50  0.2 java            30497  11
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30498  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30499   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30504   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30505  10
30181 gc11g   1821640 1167200 S 15 0 00:00:11  0.0 java            30510  15
30181 gc11g   1821640 1167200 S 15 0 00:00:38  0.0 java            30513  11
30181 gc11g   1821640 1167200 S 15 0 00:00:38  0.0 java            30514  13
30181 gc11g   1821640 1167200 S 15 0 00:00:37  0.0 java            30517  10
30181 gc11g   1821640 1167200 S 15 0 00:00:39  0.0 java            30518  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30529   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30533   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30588   8
30181 gc11g   1821640 1167200 S 15 0 00:00:01  0.0 java            30589   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30590  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30591   4
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30592   3
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30899   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30901   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30904   3
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30907  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            30910  12
30181 gc11g   1821640 1167200 S 15 0 00:00:02  0.0 java            31205   2
30181 gc11g   1821640 1167200 S 15 0 00:29:45  0.4 java            31526   4
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            31568  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            31593   5
30181 gc11g   1821640 1167200 S 15 0 00:12:02  0.1 java            31600  15
30181 gc11g   1821640 1167200 S 15 0 00:08:27  0.1 java            31601  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32157   4
30181 gc11g   1821640 1167200 S 15 0 00:00:01  0.0 java            32160   2
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32161  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32163   2
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32166  11
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32168   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32169  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32170  10
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32171  10
30181 gc11g   1821640 1167200 S 16 0 00:00:00  0.0 java            32172   3
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32173  10
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32174   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32175  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32176  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32177   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32178   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32179  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32180   9
30181 gc11g   1821640 1167200 S 15 0 00:00:01  0.0 java            32181   6
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32182   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32183   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32264  15
30181 gc11g   1821640 1167200 S 15 0 00:01:02  0.0 java            32265   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32269   8
30181 gc11g   1821640 1167200 S 15 0 00:00:12  0.0 java            32270   1
30181 gc11g   1821640 1167200 S 15 0 00:00:14  0.0 java            32271   6
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32275  10
30181 gc11g   1821640 1167200 S 15 0 00:00:15  0.0 java            32280  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32283  12
30181 gc11g   1821640 1167200 S 15 0 00:10:45  0.1 java            32286  13
30181 gc11g   1821640 1167200 S 15 0 00:00:47  0.0 java            32289  12
30181 gc11g   1821640 1167200 S 15 0 00:02:57  0.0 java            32295   0
30181 gc11g   1821640 1167200 S 15 0 00:00:03  0.0 java            32297  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32301   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32302  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32305   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32306  12
30181 gc11g   1821640 1167200 S 15 0 00:01:09  0.0 java            32307   7
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32308   2
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32309   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32310   4
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32311   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32326  13
30181 gc11g   1821640 1167200 S 15 0 00:00:14  0.0 java            32327  11
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32336  13
30181 gc11g   1821640 1167200 S 15 0 00:00:01  0.0 java            32337   8
30181 gc11g   1821640 1167200 S 15 0 00:00:29  0.0 java            32338   3
30181 gc11g   1821640 1167200 S 15 0 00:00:05  0.0 java            32339   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32340  10
30181 gc11g   1821640 1167200 S 15 0 00:00:18  0.0 java            32341  12
30181 gc11g   1821640 1167200 S 15 0 00:00:02  0.0 java            32342   7
30181 gc11g   1821640 1167200 S 15 0 00:00:29  0.0 java            32343   7
30181 gc11g   1821640 1167200 S 15 0 00:00:10  0.0 java            32344   0
30181 gc11g   1821640 1167200 S 15 0 00:00:05  0.0 java            32351   0
30181 gc11g   1821640 1167200 S 15 0 00:00:02  0.0 java             1943   9
30181 gc11g   1821640 1167200 S 15 0 00:01:19  0.0 java             1944   6
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java             1945  15
30181 gc11g   1821640 1167200 S 15 0 00:00:06  0.0 java             1946  11
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java             2890  15
30181 gc11g   1821640 1167200 S 15 0 00:02:56  0.0 java             7839  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            10208   0
30181 gc11g   1821640 1167200 S 15 0 00:00:02  0.0 java            16818  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            18310   2
30181 gc11g   1821640 1167200 S 15 0 00:00:09  0.0 java              642  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java              690   7
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            27517   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            27518   8
30181 gc11g   1821640 1167200 S 15 0 00:00:01  0.0 java            29228   3
30181 gc11g   1821640 1167200 S 15 0 00:02:46  0.1 java            12075  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            12160   1
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            12829  13
30181 gc11g   1821640 1167200 S 15 0 00:00:02  0.0 java             2773   7
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14748  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14749   6
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14750   1
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14751   0
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14752   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14753   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            14860  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            16706   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            17312   1
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            17313  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            17425  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            17426  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            19993  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            19994  15
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            19995  12
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java             3575   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            16593  14
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            18488   8
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            18489   5
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            21901   7
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java             5572   7
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java             7993   9
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            27543  13
30181 gc11g   1821640 1167200 S 15 0 00:00:00  0.0 java            32075  15

So, it is thread 30196 of PID 30181. Let´s remove the ” sleeping ” threads with grep.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 25 0 08:31:37  7.7 java            30196   6
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 25 0 08:31:39  7.7 java            30196   2
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 25 0 08:31:40  7.7 java            30196   2
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 18 0 08:31:40  7.7 java            30196   2
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 18 0 08:31:41  7.7 java            30196  15
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 19 0 08:31:41  7.7 java            30196  15
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 20 0 08:31:42  7.7 java            30196  13
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 20 0 08:31:42  7.7 java            30196   9
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 21 0 08:31:42  7.7 java            30196   9
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 21 0 08:31:43  7.7 java            30196   9
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 23 0 08:31:43  7.7 java            30196   9
oel5n1:/home/gc11g >  ps -Leo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr | grep 30181 | grep -v " S "
30181 gc11g   1821640 1167200 R 24 0 08:31:43  7.7 java            30196   9

It is always the same thread which is using 7.7% (~1/16 cores) of the all the CPU´s on the system.

let´s take a pstack:

1
oel5n1:/home/gc11g > pstack 30181

let´s grep for the LWP we know:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
 
Thread 135 (Thread 0x4024f940 (LWP 30196)):                           <<<<<<<<<<<<<<<<<<
#0  0x00002b7471e17d3f in instanceKlass::oop_follow_contents ()
#1  0x00002b747203e6c3 in objArrayKlass::oop_follow_contents ()
#2  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#3  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#4  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471e17d3f in instanceKlass::oop_follow_contents ()
#1  0x00002b747203e6c3 in objArrayKlass::oop_follow_contents ()
#2  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#3  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#4  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7472046d27 in objArrayKlass::oop_adjust_pointers ()
#1  0x00002b747209549e in PSMarkSweepDecorator::adjust_pointers ()
#2  0x00002b74720967f0 in PSOldGen::adjust_pointers ()
#3  0x00002b7472094c3a in PSMarkSweep::mark_sweep_phase3 ()
#4  0x00002b74720938f6 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471fee9d3 in MarkSweep::IsAliveClosure::do_object_b ()
#1  0x00002b7471dfad18 in Hashtable::unlink ()
#2  0x00002b74720949ef in PSMarkSweep::mark_sweep_phase1 ()
#3  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#4  0x00002b74720a0737 in PSScavenge::invoke ()
#5  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#6  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#7  0x00002b747219ccfa in VM_Operation::evaluate ()
#8  0x00002b747219c312 in VMThread::evaluate_operation ()
#9  0x00002b747219c583 in VMThread::loop ()
#10 0x00002b747219c08e in VMThread::run ()
#11 0x00002b74720560df in java_start ()
#12 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#13 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471fee706 in MarkSweep::AdjustPointerClosure::do_oop ()
#1  0x00002b7471dfadb1 in Hashtable::oops_do ()
#2  0x00002b7472094bf9 in PSMarkSweep::mark_sweep_phase3 ()
#3  0x00002b74720938f6 in PSMarkSweep::invoke_no_policy ()
#4  0x00002b74720a0737 in PSScavenge::invoke ()
#5  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#6  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#7  0x00002b747219ccfa in VM_Operation::evaluate ()
#8  0x00002b747219c312 in VMThread::evaluate_operation ()
#9  0x00002b747219c583 in VMThread::loop ()
#10 0x00002b747219c08e in VMThread::run ()
#11 0x00002b74720560df in java_start ()
#12 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#13 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471e179e5 in instanceKlass::adjust_static_fields ()
#1  0x00002b7471e234ad in instanceKlassKlass::oop_adjust_pointers ()
#2  0x00002b747209549e in PSMarkSweepDecorator::adjust_pointers ()
#3  0x00002b74720967f0 in PSOldGen::adjust_pointers ()
#4  0x00002b7472094c3a in PSMarkSweep::mark_sweep_phase3 ()
#5  0x00002b74720938f6 in PSMarkSweep::invoke_no_policy ()
#6  0x00002b74720a0737 in PSScavenge::invoke ()
#7  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#8  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#9  0x00002b747219ccfa in VM_Operation::evaluate ()
#10 0x00002b747219c312 in VMThread::evaluate_operation ()
#11 0x00002b747219c583 in VMThread::loop ()
#12 0x00002b747219c08e in VMThread::run ()
#13 0x00002b74720560df in java_start ()
#14 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#15 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471e17e54 in instanceKlass::oop_follow_contents ()
#1  0x00002b747203e6c3 in objArrayKlass::oop_follow_contents ()
#2  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#3  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#4  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b747203e640 in objArrayKlass::oop_follow_contents ()
#1  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#2  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#3  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#4  0x00002b74720a0737 in PSScavenge::invoke ()
#5  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#6  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#7  0x00002b747219ccfa in VM_Operation::evaluate ()
#8  0x00002b747219c312 in VMThread::evaluate_operation ()
#9  0x00002b747219c583 in VMThread::loop ()
#10 0x00002b747219c08e in VMThread::run ()
#11 0x00002b74720560df in java_start ()
#12 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#13 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
 
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471d50445 in ConstantPoolCacheEntry::follow_contents ()
#1  0x00002b7471d4f4f6 in constantPoolCacheKlass::oop_follow_contents ()
#2  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#3  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#4  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6
 
oel5n1:/home/gc11g > pstack 30181 | grep -A 20 30196
Thread 135 (Thread 0x4024f940 (LWP 30196)):
#0  0x00002b7471f8af28 in klassKlass::oop_follow_contents ()
#1  0x00002b7471e226cc in instanceKlassKlass::oop_follow_contents ()
#2  0x00002b7471fee5e3 in MarkSweep::follow_stack ()
#3  0x00002b7472094964 in PSMarkSweep::mark_sweep_phase1 ()
#4  0x00002b7472093859 in PSMarkSweep::invoke_no_policy ()
#5  0x00002b74720a0737 in PSScavenge::invoke ()
#6  0x00002b747206814e in ParallelScavengeHeap::failed_mem_allocate ()
#7  0x00002b747218fbc9 in VM_ParallelGCFailedAllocation::doit ()
#8  0x00002b747219ccfa in VM_Operation::evaluate ()
#9  0x00002b747219c312 in VMThread::evaluate_operation ()
#10 0x00002b747219c583 in VMThread::loop ()
#11 0x00002b747219c08e in VMThread::run ()
#12 0x00002b74720560df in java_start ()
#13 0x0000003130e064a7 in start_thread () from /lib64/libpthread.so.0
#14 0x00000031302d3c2d in clone () from /lib64/libc.so.6

I´m not much of a Java guy, but this tells me that the thread is doing ParallelGC. So, the issue might be related to a configuration of too small Heap size. (Xmx) I will talk with support to decide whether they support to have Xmx increased to 1G or 2G. In case the cpu consumption is not for Garbage Collection but some other OEM code, then there is a command “emctl dump omsthread”, which can be used:

1
2
3
4
5
oel5n1:/home/gc11g > emctl dump omsthread
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Thread dumped successfully for process '30181' to log file 
/oracle/grid/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/logs/EMGC_OMS1.out

If we calculate LWP 30196 from dec to hex, we get 75f5, this is NID from OMS thread dump.

1
2
3
4
"Reference Handler" daemon prio=10 tid=0x000000004fdff000 nid=0x75f5 waiting for monitor entry [0x0000000040725000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:108)
        - waiting to lock <0x00002aaacf4967a0> (a java.lang.ref.Reference$Lock)

Update 26.11.2010:

Oracle support has recommended to increase Xms/Xmx to 1024m and check whether that is sufficient. This has to be changed in file gc_inst/user_projects/domains/GCDomain/bin/setDomainEnv.sh and in those variables for Linux 64bit:

XMS_SUN_64BIT="1024"
XMX_SUN_64BIT="1024"


Upgrade/Install of Grid Infrastructure to 11.2.0.2 is failing

I have come across a quite difficult issue when trying to install/upgrade to Grid Infrastructure 11.2.0.2 on Linux. We have performed the following trials:

  • Installation of 11.1.0.6 to have Cluster Files (OCR/Votedisk) outside of ASM, then upgrade to 11.2.0.1. After that, we installed all the latest patches (PSU 11.2.0.1.2). Firstly we considered running RDBMS 11.1.0.7 in a 11.2 Grid Infrastructure cluster. However, there are quite a few issues when running this mixed setup. For most of them, patches or workarounds do exist. (see Note 948456.1) However, during failover testing, we realized that the remote database instance does not survive the failure of the public lan of the local node. This is fixed in 11.2.0.2 which motivated us to upgrade.
  • We tried to upgrade our GI 11.2.0.1.2 installation to 11.2.0.2. Everything went fine until it failed when running rootupgrade.sh on the last (second) node. Here, the second node could not join the cluster. Important to know is that 11.2.0.2 brings the new feature HAIP (High available IP), which allows to have multiple private interconnect interfaces (each with a different private interconnect network ip) to be used for failover/load balancing by Grid Infrastructure.  For this feature, Oracle uses Multicast on network “230.0.1.0”. This is stated in the updated version of the README for Oracle Database 11.2: http://download.oracle.com/docs/cd/E11882_01/readmes.112/e17129/toc.htm#CHDIEHCH
ACFS-9309: ADVM/ACFS installation correctness verified.
Failed to start Oracle Clusterware stack
Failed to start Cluster Synchorinisation Service in clustered mode at
/opt/oracle/grid11/11.2.0.2/grid/crs/install/crsconfig_lib.pm line 1016.
/opt/oracle/grid11/11.2.0.2/grid/perl/bin/perl -I/opt/oracle/grid11/11.2.0.2/grid/perl/lib -
I/opt/oracle/grid11/11.2.0.2/grid/crs/install
/opt/oracle/grid11/11.2.0.2/grid/crs/install/rootcrs.pl execution failed
  • The next try was to remove everything and directly install 11.2.0.1 with cluster files stored inside of ASM. This worked fine and we then proceeded in patching with latest PSU 11.2.0.1.2. Then, we gave it another try to upgrade to 11.2.0.2. The same problem occurred again at rootupgrade.sh on second node.
  • Then we decided to remove everything again and try a direct installation of 11.2.0.2 without any upgrade. This also failed at root.sh on last node.

With the hints of several colleagues from OTN forum (http://forums.oracle.com/forums/messageview.jspa?messageID=5393319&stqc=true) we realized that it might be related to multicast setup. We verified that multicast works with the script from MetaLink Note 413783.1. However this script does not take into account that we want multicasting via private interconnect interface instead of public LAN.

I realized that it might be necessary to add a network route to our system to direct multicast to the private interconnect network:

e.g.: /sbin/route add -net 224.0.0.0 netmask 240.0.0.0 dev bond1

After doing that, the multicast test script from MOS succeeded.

starting receiver:

[grid11@Node2 ]$ java MultiCastTestReceive

starting sender:

[grid11@Node1 ]$ java MultiCastTestSend
Sent 10 bytes to /230.0.1.0

message received by receiver:

Received data from: /10.128.128.1:13139 with length: 10

Please note that in this case 10.128.128.1 is private interconnect ip. Although it is currently just a strong suspicion, I am quite positive that this was the problem and we will consider retrying the upgrade if time allows and will report the results.

Unfortunately, Oracle support was of no help at all with this problem.



Oracle Clusterware / ASM 11.1.0.7: ASM Instance crash

During a 11gR1 Clusterware installation for a Single Instance Failover Cluster at a customer site, I have experienced an interesting behaviour, which was caused by an Oracle Bug.

The environment was:

  • 2 Node Oracle Clusterware 11.1.0.7 Cluster on Linux x86_64 using latest Recommended Patches as of October 19th. (pre PSU 11.1.0.7.1)
  • Clusterware installed as unix user crs
  • ASM installed as unix user oracle

The ASM instances could be started with SQL*Plus without any problems, but if the ASM instances were started by means of clusterware using srvctl (either from root, crs or oracle) the  ASM instances would crash at diskgroup mount with:

ORA-07445: exception encountered: core dump  [sskgds_find_rtn_hdr()+1171]
[SIGBUS] [ADDR:0x2AACD701342C] [PC:0x25799DF]
[Non-existent physical address] []

Oracle Support identified this behaviour as Bug 6952915, for which there are patches for Linux x86, x86_64 and Solaris Sparc64.