PVE4.0 Cluster iDRAC Watchdog

churnd

Active Member
Aug 11, 2013
43
2
28
I'm trying to set up our cluster of 3 Dell R730's & I'm following the wiki here: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29

I do have OMSA installed so I edited out dcwddy64.ini:
Code:
cat /etc/opt/dell/srvadmin/srvadmin-isvc/ini/dcwddy64.ini
;--------------------------------------------------------------------
;
;          Dell Inc. PROPRIETARY INFORMATION
; This software is supplied under the terms of a license agreement or
; nondisclosure agreement with Dell Inc. and may not
; be copied or disclosed except in accordance with the terms of that
; agreement.
;
; Copyright (c) 1995-2011 Dell Inc.
; All Rights Reserved.
;
; Module Name:
;
; DCWDDY64.INI
;
; Abstract/Purpose:
;
; Instrumentation Service Watchdog ("Dynamic" Data) INI file
;
;--------------------------------------------------------------------


;[HWC Configuration]
;watchDogObj.settings=0
;watchDogObj.expiryTime=480

[HWC Configuration]
watchDogObj.settings=0
watchDogObj.expiryTime=480

One thing I noticed is the location is different than the wiki. There was no file at /opt/dell/srvadmin/etc/srvadmin-isvc/ini/dcwddy64.ini. However, you can see I commented out the watchdog sections & rebooted the server. However, it seems OMSA added them back. How should I correct this?

I also noticed "idracadm7 getsysinfo -w" returns no recovery action for me:
Code:
# idracadm7 getsysinfo -w

Watchdog Information:

Recovery Action         = None
Present countdown value = 474 seconds
Initial countdown value = 480 seconds
 
  • Like
Reactions: Maksym
/etc/opt/dell/srvadmin/etc is a symlink to /etc/opt/dell/srvadmin,
so it's same file.
strange that you don't have /opt/dell/srvadmin/etc/srvadmin-isvc/ini/dcwddy64.ini.
Do you use openmanage 7.4 ?

I'm using 7.4, and it don't overwrite the config file
 
I didn't install everything, just the basic CLI stuff to get OMSA working. Basically just did this:

apt-get install srvadmin-idrac7 srvadmin-storageservices srvadmin-base ipmitool srvadmin-omcommon

Yes, using 7.4.0-1. My config file was overwritten on all 3 servers.
 
only difference for me is that I have installed all packages
"apt-get install srvadmin-all"

which install "srvadmin-isvc" package, which seem to be the one with the config file.

is this package installed for you ? (maybe dependencies of other packages you have installed).

I can confirm that reboot don't overwrite the config file for me.
 
All these got installed as dependencies:

Code:
# dpkg -l | grep srvadmin
ii  srvadmin-base                  7.4.0                          amd64        Meta package for installing the Server Agent
ii  srvadmin-deng                  7.4.0-1                        amd64        Dell OpenManage Data Engine
ii  srvadmin-deng-snmp             7.4.0-1                        amd64        Dell OpenManage Data Engine SNMP
ii  srvadmin-hapi                  7.4.0-1                        amd64        Dell OpenManage Hardware Abstraction Programming Interface
ii  srvadmin-idrac-snmp            7.4.0-1                        amd64        iDRAC SNMP components
ii  srvadmin-idrac-vmcli           7.4.0-1                        amd64        CLI utils from the management station to the iDRAC
ii  srvadmin-idrac7                7.4.0-1                        amd64        Meta package for iDRAC
ii  srvadmin-idracadm7             7.4.0-1                        amd64        The command line user interface to the Remote Access Controller (RAC).
ii  srvadmin-isvc                  7.4.0-1                        amd64        Dell OpenManage Instrumentation Services
ii  srvadmin-isvc-snmp             7.4.0-1                        amd64        Dell OpenManage Instrumentation Services SNMP
ii  srvadmin-nvme                  7.4.0-1                        amd64        Libraries to manage NVMe devices
ii  srvadmin-omacore               7.4.0-1                        amd64        Server Administrator CLI
ii  srvadmin-omacs                 7.4.0-2                        amd64        Dell OpenManage Server Administrator OMACS
ii  srvadmin-omcommon              7.4.0-2                        amd64        Dell OpenManage Server Administrator Common Framework
ii  srvadmin-omilcore              7.4.0-1                        all          Dell OpenManage Server Administrator Install Core
ii  srvadmin-ominst                7.4.0-1                        amd64        OMINST
ii  srvadmin-rac-components        7.4.0-1                        amd64        Remote Access Controller SNMP components for Server Administrator.
ii  srvadmin-racadm4               7.4.0-1                        amd64        The command line user interface to the Remote Access Controller (RAC).
ii  srvadmin-racdrsc               7.4.0-1                        amd64        Remote Access CLI and Web Plugin to Server Administrator
ii  srvadmin-realssd               7.4.0-1                        amd64        RealSSD package for storage management
ii  srvadmin-rnasoap               7.4.0-1                        amd64        Fluid Cache Management
ii  srvadmin-smcommon              7.4.0-1                        amd64        Storage Management common files for GUI and CLI
ii  srvadmin-storage               7.4.0-1                        amd64        Storage Management accessors package
ii  srvadmin-storage-cli           7.4.0-1                        amd64        Storage Management cli component
ii  srvadmin-storage-snmp          7.4.0-1                        amd64        Storage Management SNMP component
ii  srvadmin-storageservices       7.4.0                          amd64        Meta package for installing the Server Administrator Storage Services feature
ii  srvadmin-storageservices-cli   7.4.0                          amd64        Meta package for storageservices-cli
ii  srvadmin-storageservices-snmp  7.4.0                          amd64        Meta package for storageservices-snmp
ii  srvadmin-storelib              7.4.0-1                        amd64        StoreLib package for storage management
ii  srvadmin-storelib-sysfs        7.4.0-1                        amd64        Metapackage for libsysfs2
ii  srvadmin-xmlsup                7.4.0-1                        amd64        Dell OpenManage XML Support SDK

Maybe it has something to do with idrac settings? I did disable the Automated System Recovery Agent per the wiki as well. Haven't changed much of anything else in iDrac except general identifying info like the hostname.
 
AFAIK, the only option in idrac is ASR.
The watchdog timer is in openmanage.

Maybe can your try to force it to 60s
#omconfig system recovery timer=60

(omconfig don't want lower value).

then try to comment it again in config files.
 
Hi,
I was able to reproduce your problem.

I have updated the wiki
https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x

without comment config file, but setting value to 20 (the minimum allowed).
They it don't seem to override proxmox value after reboot or dataeng service restart.
(I have tested it on 3 different dell server, r710, r630, r815 with last proxmox 4.1 and ipmi module loaded)

Can you test it ?
 
Edit: Speak too fast, after sometime I still have the counters resetted by openmanage ...

I don't known why It's working on 1 of my test server (old poweredge 2950).

I'll ask the dell poweredge mailing list.
 
This script :
/opt/dell/srvadmin/lib/srvadmin-isvc/unregister-isvc.sh

disable a lot of dll (mainly to retrieve infos), but fix the problem.

I need to check inside to find which dll we need to remove
 
Ok, I think I have found a clean way to disable the needed module

/opt/dell/srvadmin/sbin/dcecfg command=removepopalias aliasname=dcifru


I'll keep it running with this all the day, because sometime the timer is rearmed after 15-20 minutes with all modules loaded, I don't known why
 
I'm trying this now. Any update from your end?

To clarify, all I've tried is running the command:

Code:
/opt/dell/srvadmin/sbin/dcecfg command=removepopalias aliasname=dcifru

then rebooting.
 
Last edited:
I just tried it on one node & after a reboot & see this:

Code:
# /opt/dell/srvadmin/bin/idracadm7 getsysinfo -w

Watchdog Information:
Recovery Action         = None
Present countdown value = 475 seconds
Initial countdown value = 480 seconds
 
As usual, I missed one critical step in the wiki, setting /etc/default/pve-ha-manager:

Code:
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog

I set that & rebooted, now I see:

Code:
root@node3:~# /opt/dell/srvadmin/bin/idracadm7 getsysinfo -w

Watchdog Information:
Recovery Action         = Reboot
Present countdown value = 8 seconds
Initial countdown value = 10 seconds

root@node3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec

However, I see my recovery action is to Reboot instead of Power Cycle. How can I change that?
 
https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x

edit the /etc/modprobe.d/ipmi_watchdog.conf (simple create the file):
options ipmi_watchdog action=power_cycle

OK I didn't see THAT part in the wiki. Added that file & now the output looks as it should:

Code:
root@node3:~# /opt/dell/srvadmin/bin/idracadm7 getsysinfo -w

Watchdog Information:
Recovery Action         = Power Cycle
Present countdown value = 9 seconds
Initial countdown value = 10 seconds


root@node3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec
 
Resurrecting an old thread. Has there been any testing done with OMSA 8? It's changed a bit... for example, the idracadm command is no longer there (intentionally).
 
I haven't done extensive testing on this because my cluster is in production. I have a 3 node cluster. One node is using OMSA 8.4, and the other two are still on 7.4. I haven't upgraded the other two due to some issues with 8.4 (commands showing weird output that shouldn't be there, idrac commands missing, etc). Twice now, I've seen this node that's on 8.4 lose it's watchdog timer settings after a reboot. The other two nodes don't have this problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!