OpenManage Does Not Start Because Semaphore Count Reached (DRAFT)

SR Information: 1608620

Product / Software Version: Found on a DXi8500 with 2.2.1 software.

Problem Description: OpenManage does not start because semaphore count reached (OMSA won't start).

Related PTRs: Bug 35994 - Disk timeout causes semaphore leakage

Overview

This topic describes how to fix an issue where OpenManage is unable to restart due to the semaphore counters.

Although it's possible the issue occurs in other DXi platforms, this resolution is applicable for DXi8500. For other platforms, please consult bug 35994. In the SR it is requested that you confirm the workaround for other platforms or escalate the issue to your backline support.

What is semaphore?

A semaphore is as a counter used to control access to shared resources. Under an OS there will be multiple process that will request access to the shared resources. The semaphore comes into play as a locking mechanism that will prevent processes from accessing a shared resource beyond the limit defined by the counters.

When you have a process trying to access a resource that reached the limit defined on the semaphore counters, you'll receive a message in the OS logs (messages logs) indicating that the semaphore counter (set, group, etc) was exceeded.

Symptom (How to Identify the Problem)

OpenManage (omsa) is down
System GUI shows Storage Arrays as *unreachable* and array hardware as *unknown*.
DSET Log does not show any details of Dell controllers or attached HW.
Messages log will register the following event:

Sep 9 04:05:16 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded.

Manual start of OpenManage (omsa) will fail

Root Cause and Additional Symptoms

There could be several reasons when a system exceed the semaphore count. For the SR 1608620, it was found that a flood of disk command timeout events caused the semaphore to exceed the limit. You will find the additional symptoms beyond the symptom description above.

First, you'll see dsm_sa* shuting down periodically (expected) but, in the following example, you can see the some shutdown failures starting on Sep 4th:

Aug 29 04:02:12 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Aug 30 04:02:06 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Aug 30 04:02:13 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Aug 31 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Aug 31 04:02:12 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 1 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 1 04:02:13 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 2 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 2 04:02:14 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 3 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 3 04:02:14 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 4 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 4 04:03:08 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 4 04:03:20 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 5 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 5 04:03:08 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 5 04:03:29 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 6 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 6 04:03:08 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 6 04:03:21 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 7 04:02:08 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 7 04:03:08 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 7 04:03:09 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 8 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 8 04:03:07 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 9 04:02:10 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 9 04:03:10 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 9 04:03:20 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 10 04:02:07 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 10 04:03:08 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown failed

Sep 10 04:03:09 NR1PQ8500VTLM01 dataeng: dsm_sa_datamgrd shutdown succeeded

Sep 10 15:27:53 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 11 04:02:13 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

Sep 11 10:16:36 NR1PQ8500VTLM01 dataeng: dsm_sa_eventmgrd shutdown succeeded

(END)

Note that the sempahores issues starts on Sept 9.

$ grep semaphore messages | less -NI

1 Sep 9 04:05:16 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

2 Sep 9 04:15:04 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

3 Sep 9 04:24:57 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

4 Sep 9 04:34:49 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

5 Sep 9 04:44:54 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

>>> and events continues until Sep 11 when workaround were applied

220 Sep 11 10:07:02 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

221 Sep 11 10:07:02 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

222 Sep 11 10:17:17 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

223 Sep 11 10:18:08 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

224 Sep 11 10:24:34 NR1PQ8500VTLM01 Server Administrator (Shared Library): Data Engine EventID: 0 A semaphore set has to be created but the system limit for the maximum number of semaphore sets has been exceeded

Starting on Sept 4th, disk 16 start to present several command timeout issues (command timeout) and messages is filled by those events on Sep 9th:

$ grep 'Storage Service' messages | less -NI

635 Sep 4 02:14:37 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

636 Sep 4 02:15:58 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2095 Unexpected sense. SCSI sense data: Sense key: 6 Sense code: 29 Sense qualifier: 2: Physical Disk 0:0:16 Controller 1, Connector 0

637 Sep 4 02:15:58 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

638 Sep 4 02:16:40 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

639 Sep 4 02:16:41 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2346 Error occurred: Error on PD 30(e0x35/s16) (Error f0).: Physical Disk 0:0:16 Controller 1, Connector 0

640 Sep 4 02:16:41 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2048 Device failed: Physical Disk 0:0:16 Controller 1, Connector 0

641 Sep 4 02:16:41 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2123 Redundancy lost: Virtual Disk 7 (BPMD_8) Controller 1 (PERC H800 Adapter)

642 Sep 4 02:16:41 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2057 Virtual disk degraded: Virtual Disk 7 (BPMD_8) Controller 1 (PERC H800 Adapter)

643 Sep 4 02:16:42 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

644 Sep 4 02:16:43 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2065 Physical disk Rebuild started: Physical Disk 0:0:0 Controller 1, Connector 0

645 Sep 4 02:16:43 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

646 Sep 4 02:29:31 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

647 Sep 4 02:29:31 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

648 Sep 4 02:29:32 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2095 Unexpected sense. SCSI sense data: Sense key: 5 Sense code: 24 Sense qualifier: 0: Physical Disk 0:0:8 Controller 2, Connector 0

649 Sep 4 02:29:32 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

650 Sep 4 03:01:43 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

651 Sep 4 04:01:55 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

652 Sep 4 04:04:59 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2048 Device failed: Physical Disk 0:0:16 Controller 1, Connector 0

653 Sep 4 04:05:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2048 Device failed: Physical Disk 0:0:16 Controller 1, Connector 0

654 Sep 4 04:05:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2057 Virtual disk degraded: Virtual Disk 7 (BPMD_8) Controller 1 (PERC H800 Adapter)

655 Sep 4 04:05:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

656 Sep 4 04:05:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

657 Sep 4 04:05:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

658 Sep 4 04:05:01 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

. . . >>> message repeats

681 Sep 4 04:05:06 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

682 Sep 4 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

683 Sep 4 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

684 Sep 4 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

685 Sep 4 04:05:36 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

686 Sep 4 04:06:18 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

687 Sep 4 05:00:12 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

688 Sep 4 05:01:36 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

689 Sep 4 05:22:36 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

690 Sep 4 06:01:48 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

691 Sep 4 07:02:00 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

692 Sep 4 07:13:12 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

693 Sep 4 08:01:30 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

694 Sep 4 09:01:42 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

695 Sep 4 09:32:30 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

696 Sep 4 09:38:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

697 Sep 4 10:01:55 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

. . .

1041 Sep 8 23:35:03 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1042 Sep 9 00:01:39 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1043 Sep 9 01:01:53 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1044 Sep 9 01:27:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2181 The controller battery Learn cycle will start in 24 hours.: Battery 0 Controller 1

1045 Sep 9 01:27:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1046 Sep 9 02:00:01 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1047 Sep 9 02:01:25 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1048 Sep 9 02:02:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1049 Sep 9 02:30:10 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2181 The controller battery Learn cycle will start in 24 hours.: Battery 0 Controller 2

1050 Sep 9 02:30:10 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1051 Sep 9 02:30:51 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1052 Sep 9 03:01:41 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1053 Sep 9 04:01:56 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1054 Sep 9 04:05:06 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2048 Device failed: Physical Disk 0:0:16 Controller 1, Connector 0

1055 Sep 9 04:05:06 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2048 Device failed: Physical Disk 0:0:16 Controller 1, Connector 0

1056 Sep 9 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

1057 Sep 9 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

1058 Sep 9 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

1059 Sep 9 04:05:07 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

1060 Sep 9 04:05:08 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

1061 Sep 9 04:05:08 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

1062 Sep 9 04:05:08 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 01 dc 01 1d 00: Controller 1 (PERC H800 Adapter)

1063 Sep 9 04:05:08 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

. . .

1084 Sep 9 04:05:14 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: Command timeout on PD 30(e0x35/s16) Path 500000e117037dc2, CDB: 12 00 00 00 60 00: Controller 1 (PERC H800 Adapter)

1085 Sep 9 04:05:14 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 30(e0x35/s16) Path 500000e117037dc2 reset (Type 03): Controller 1 (PERC H800 Adapter)

1086 Sep 9 04:05:44 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1087 Sep 9 04:06:26 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1088 Sep 9 04:15:32 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

. . . >>> message repeats

1132 Sep 9 09:00:31 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1133 Sep 9 09:01:55 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1134 Sep 9 09:10:19 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1135 Sep 9 09:11:01 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1136 Sep 9 09:30:37 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1137 Sep 9 09:32:01 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

1138 Sep 9 09:32:01 NR1PQ8500VTLM01 Server Administrator: Storage Service EventID: 2405 Command timeout on physical disk: Physical Disk 0:0:16 Controller 1, Connector 0

Resolution (Workaround)

Note: Because this article describes one specific root cause for the semaphore count issue, make sure to only apply the following resolution steps to the specific issues that are covered in this topic.

Before attempting to start OpenManage manually, make sure to solve the semaphore issue first. In this case you have two options:

Option 1: Reboot the DXi

This action will release all the shared resources and zero the counter.

Note that this option will work in most scenarios where you have exceeded the semaphore count, unless you are facing a condition where the root cause must to be addressed first prior to the reboot.

SR1608620 is a good example where reboot may work initially, but while the bad disk still on the DXi is generating command timeouts, the customer may face the issue again. In this situation, you should use the resolution steps outlined in Option 2 below.

Option 2: Increase the semaphore counter

Note that this resolution was applied for a DXi8500. For other platforms, please consult PTR 35994, where a request for semaphore count settings was filled for other platforms. If the PTR is not updated for your platform, please escalate to your backline team.

First save the counter information. You can collect the information using 'cat' as shown bellow:

# cat /proc/sys/kernel/sem

250 32000 32 128

To add the counter, execute the following command:

echo 500 32000 256 256 > /proc/sys/kernel/sem

Note: those changes are temporary (reboot the machine will bring the old settings back)

Try to bring the OpenManage service back :

Make sure all modules of OpenManger are stopped by executing a stop:

sh /opt/dell/srvadmin/sbin/srvadmin-services.sh stop

Now start OpenManager:

sh /opt/dell/srvadmin/sbin/srvadmin-services.sh start

With Option 2:

You can solve the issue to the customer without rebooting the machine
GUI and DSET will be available so you can identify the physical location of the bad disk for replacement

Additional Information

OpenManage is a Dell product. If you encounter an issue different from the one described in this topic, open a case with Dell. If you require assistance from your backline team, please make sure you gather additional data before you escalate:

Check if there is no core files related to OpenManger (if so, please collect)

Note: There are some bugs reporting core cases with OpenManage and semaphore issues. Check to see if your case matches any existing bugs.

Collect the following data:

Current semaphore count set in the DXi. Use the command cat as shown bellow:

# cat /proc/sys/kernel/sem

Status of the omsa:

# sh /opt/dell/srvadmin/sbin/srvadmin-services.sh status

The Semaphores currently allocated in the system and the pid/service information associated to each semaphore. The procedure to collect the information is described bellow, please save the output of the commands:

In order to collect the PID that is allocating each semaphore, get the semaphore ID from the output of the command above and execute the following command according to the given exemple where we are collecting the information of which process is using semaphore id 1651343360

# ipcs -a

------ Shared Memory Segments --------

key shmid owner perms bytes nattch status

0x00005056 0 root 666 1504 1

0x00005059 32769 root 666 6170192 2

0x79067eb8 65538 root 666 808 81

0x07021999 98307 root 644 1792 2

------ Semaphore Arrays --------

key semid owner perms nsems

0x00000000 1651343360 root 600 1

0x00000000 1651965953 root 600 1

0x00000000 1651998722 root 600 1

0x00000000 1652260867 root 600 1

0x00000000 1652228100 root 600 1

0x00000000 1652097029 root 600 1

0x00000000 1652129798 root 600 1

0x00000000 1652162567 root 600 1

0x00000000 1652195336 root 600 1

0x00000000 1652293641 root 600 1

0x00000000 1652326410 root 600 1

0x00000000 1652359179 root 600 1

0x00000000 1652391948 root 600 1

0x00000000 1652424717 root 600 1

0x00000000 1652457486 root 600 1

0x00000000 1652490255 root 600 1

0x00000000 1652523024 root 600 1

0x00000000 1652686865 root 600 1

0x00000000 1652588562 root 600 1

0x00000000 1652621331 root 600 1

0x00000000 1652654100 root 666 1

0x00000000 1652719637 root 600 1

0x79067eb8 28409878 root 666 1

0x00000000 47677463 nobody 600 1

0x00000000 47710232 nobody 600 1

0x00000000 47743001 nobody 600 1

0x00000000 47775770 nobody 600 1

0x00000000 47808539 nobody 600 1

0x00000000 47841308 nobody 600 1

0x00000000 47874077 nobody 600 1

------ Message Queues --------

key msqid owner perms used-bytes messages

# ipcs -s -i 1651343360

Semaphore Array semid=1651343360

uid=0 gid=0 cuid=0 cgid=0

mode=0600, access_perms=0600

nsems = 1

otime = Not set

ctime = Thu Sep 19 04:02:18 2013

semnum value ncount zcount pid

0 0 1 0 5374

# ps -ef | grep 5374

root 5374 1 0 04:02 ? 00:00:21 /opt/dell/srvadmin/sbin/dsm_sa_datamgrd

root 25699 9234 0 23:15 pts/2 00:00:00 grep 5374

Or you can use the following command instead of ps (easy for who wants to write a quick script):

# more /proc/5374/cmdline

/opt/dell/srvadmin/sbin/dsm_sa_datamgrd