How to disable HA and fencing to prevent random reboots

raffael · Feb 3, 2019

Hi

Unfortunately we are experiencing random reboots in our 3 node cluster (about once a day). As far as I can tell it is initiated by the ipmi watchdog. I know that disabling the watchdog and therefore fencing is a bad idea while running HA an so I would prefer to disable HA as well. But how to do that?
I removed all HA resources and groups but I did not find out how to disable HA completely.
And what would be the best way to disable the watchdog?

I'd like to keep the cluster functionality like moving a vm from one host to another or managing all three hosts through the same web interface but I can life with having to move the vms by hand in the rare case of a hardware failure if the random reboots stop.

The reason why I think the reboot is initiated by the ipmi watchdog is this entry in /var/log/auth.log

Feb 2 16:03:32 pm3 systemd-logind[1552]: Power key pressed.
Feb 2 16:03:32 pm3 systemd-logind[1552]: Powering Off...
Feb 2 16:03:32 pm3 systemd-logind[1552]: System is powering down.
Feb 2 16:03:36 pm3 sshd[1933]: Received signal 15; terminating.

And in the Event Log of ipmi i see related watchdog entries at the same time.

Feb 2 16:01:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:01:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:02:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:02:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:03:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Graphical Interface.
Feb 2 16:03:32 pm3 systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Timers.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily PVE download activities.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target RPC Port Mapper.
Feb 2 16:03:32 pm3 systemd[1]: Unmounting RPC Pipe File System...
Feb 2 16:03:32 pm3 systemd[1]: Removed slice system-ceph\x2ddisk.slice.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Multi-User System.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Regular background program processing daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS startup target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS pool import target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph*@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Login Service...

Is there a way to see If this is the watchdog-mux daemon stopped reseting the watchdog timer on purpose (like did not have a connection to the other nodes anymore) or if it was a bug?

EDIT:
Just found out that there are interesting syslog entries a the same time on another node. Does anybody know what exactly this means?

Feb 2 16:03:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:03:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:33 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:03:33 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:03:38 pm1 corosync[2242]: notice [TOTEM ] A processor failed, forming new configuration.
Feb 2 16:03:38 pm1 corosync[2242]: [TOTEM ] A processor failed, forming new configuration.
Feb 2 16:03:40 pm1 corosync[2242]: notice [TOTEM ] A new membership (10.1.0.2:644) was formed. Members left: 3
Feb 2 16:03:40 pm1 corosync[2242]: notice [TOTEM ] Failed to receive the leave message. failed: 3
Feb 2 16:03:40 pm1 corosync[2242]: [TOTEM ] A new membership (10.1.0.2:644) was formed. Members left: 3
Feb 2 16:03:40 pm1 corosync[2242]: [TOTEM ] Failed to receive the leave message. failed: 3
Feb 2 16:03:40 pm1 corosync[2242]: warning [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: warning [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: notice [QUORUM] Members[2]: 1 2
Feb 2 16:03:40 pm1 corosync[2242]: notice [MAIN ] Completed service synchronization, ready to provide service.
Feb 2 16:03:40 pm1 corosync[2242]: [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 pmxcfs[2079]: [dcdb] notice: members: 1/2079, 2/2244
Feb 2 16:03:40 pm1 pmxcfs[2079]: [dcdb] notice: starting data syncronisation
Feb 2 16:03:40 pm1 corosync[2242]: [QUORUM] Members[2]: 1 2
Feb 2 16:03:40 pm1 corosync[2242]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: cpg_send_message retried 1 times
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: members: 1/2079, 2/2244
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: starting data syncronisation
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: received sync request (epoch 1/2079/00000006)
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: received sync request (epoch 1/2079/00000006)
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: received all states
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: leader is 1/2079
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: synced members: 1/2079, 2/2244
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: start sending inode updates
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: sent all (0) updates
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: all data is up to date
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: received all states
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: all data is up to date
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: dfsm_deliver_queue: queue length 11
Feb 2 16:03:51 pm1 pvestatd[2585]: got timeout
Feb 2 16:03:52 pm1 pvestatd[2585]: status update time (6.052 seconds)
Feb 2 16:03:55 pm1 ceph-osd[2714]: 2019-02-02 16:03:55.578187 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:35.578180)
Feb 2 16:03:55 pm1 ceph-osd[2714]: 2019-02-02 16:03:55.578205 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:35.578180)
Feb 2 16:03:55 pm1 ceph-osd[2872]: 2019-02-02 16:03:55.796551 7fa1df31f700 -1 osd.4 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.584432 front 2019-02-02 16:03:35.584432 (cutoff 2019-02-02 16:03:35.796539)
Feb 2 16:03:55 pm1 ceph-osd[2872]: 2019-02-02 16:03:55.796601 7fa1df31f700 -1 osd.4 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.584432 front 2019-02-02 16:03:35.584432 (cutoff 2019-02-02 16:03:35.796539)
Feb 2 16:03:56 pm1 ceph-osd[2714]: 2019-02-02 16:03:56.594112 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:36.594110)
Feb 2 16:03:56 pm1 ceph-osd[2714]: 2019-02-02 16:03:56.594143 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:36.594110)
Feb 2 16:03:57 pm1 kernel: [79008.502749] libceph: osd0 down
Feb 2 16:03:57 pm1 kernel: [79008.502750] libceph: osd1 down
Feb 2 16:04:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:04:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:04:13 pm1 kernel: [79024.260638] libceph: mon2 10.0.0.3:6789 session lost, hunting for new mon
Feb 2 16:04:13 pm1 kernel: [79024.261715] libceph: mon1 10.0.0.2:6789 session established
Feb 2 16:05:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:05:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: successfully acquired lock 'ha_manager_lock'
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: watchdog active
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: status change slave => master
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'online' => 'unknown'
Feb 2 16:06:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:06:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:101': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:105': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:116': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'unknown' => 'fence'
Feb 2 16:06:33 pm1 postfix/pickup[1974582]: 288844C963: uid=0 from=<root>
Feb 2 16:06:33 pm1 postfix/cleanup[2001636]: 288844C963: message-id=<20190202150633.288844C963@pm1>
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 288844C963: from=<root@pm1> size=3159, nrcpt=1 (queue active)
Feb 2 16:06:33 pm1 pvemailforward[2001643]: forward mail to <sysadmin@XXX>
Feb 2 16:06:33 pm1 postfix/pickup[1974582]: 631A14C964: uid=65534 from=<root>
Feb 2 16:06:33 pm1 postfix/cleanup[2001636]: 631A14C964: message-id=<20190202150633.288844C963@pm1>
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 631A14C964: from=<root@pm1>, size=3320, nrcpt=1 (queue active)
Feb 2 16:06:33 pm1 postfix/local[2001642]: 288844C963: to=<root@pm1>, orig_to=<root>, relay=local, delay=0.25, delays=0.01/0.02/0/0.22, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 288844C963: removed
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: successfully acquired lock 'ha_agent_pm3_lock'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: fencing: acknowledged - got agent lock for node 'pm3'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'fence' => 'unknown'
Feb 2 16:06:43 pm1 postfix/pickup[1974582]: 2BCF34CA44: uid=0 from=<root>
Feb 2 16:06:43 pm1 postfix/cleanup[2001636]: 2BCF34CA44: message-id=<20190202150643.2BCF34CA44@pm1>
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:101' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 2BCF34CA44: from=<root@pm1>, size=3241, nrcpt=1 (queue active)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:101': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:105' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:105': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:116' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:116': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pvemailforward[2001934]: forward mail to <sysadmin@XXX>
Feb 2 16:06:43 pm1 postfix/pickup[1974582]: 605A94C9B4: uid=65534 from=<root>
Feb 2 16:06:43 pm1 postfix/cleanup[2001636]: 605A94C9B4: message-id=<20190202150643.2BCF34CA44@pm1>
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 605A94C9B4: from=<root@pm1>, size=3402, nrcpt=1 (queue active)
Feb 2 16:06:43 pm1 postfix/local[2001642]: 2BCF34CA44: to=<root@pm1>, orig_to=<root>, relay=local, delay=0.22, delays=0/0/0/0.22, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 2BCF34CA44: removed
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:07:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:07:00 pm1 systemd[1]: Started Proxmox VE replication runner.

btw. I'm using:
pve-manager/5.3-8/2929af8e (running kernel: 4.15.18-10-pve)

Thanks
Raffael

Alwin · Feb 4, 2019

raffael said:
And in the Event Log of ipmi i see related watchdog entries at the same time.

What are these entries showing? From the log files I would assume that someone/something pushed the power button for shutdown.

A reset takes usually place two minutes after the quorum was lost. But the log from 'pm1' indicates that 'pm3' was removed from the quorum at the same time the shutdown took place. If the watchdog kicked in then the reset would have taken place two minutes after.
More around here:

Feb 2 16:06:43 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'fence' => 'unknown'

raffael · Feb 4, 2019

Hi Alwin

Thanks for your answer.
The logs in the ipmi are this.
Unfortunately the time of ipmi was wrong then so I can not say at what second these were.

Watchdog 2 Out of Spec Definition (0x08) - Assertion
Watchdog 2 Power Cycle - Assertion

I get the exact same entries (also in the auth.log about the Power key), when i kill watchdog-mux with -9 and then wait until the watchdog timer runs out. That's why I'm pretty sure that it was the ipmi watchdog and I like to disable it properly.

Alwin · Feb 4, 2019

What's the IPMI watchdog hardware vendor/baseboard? Are there more corosync entries prior the "Powering Off..."?

raffael · Feb 4, 2019

The server is a all https://www.supermicro.com/products/system/1U/1029/SYS-1029P-WTRT.cfm

Here are all Syslog entries from pm3.

Feb 2 16:03:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:03:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Graphical Interface.
Feb 2 16:03:32 pm3 systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Timers.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily PVE download activities.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target RPC Port Mapper.
Feb 2 16:03:32 pm3 systemd[1]: Unmounting RPC Pipe File System...
Feb 2 16:03:32 pm3 systemd[1]: Removed slice system-ceph\x2ddisk.slice.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Multi-User System.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Regular background program processing daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS startup target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS pool import target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph*@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Login Service...
Feb 2 16:03:32 pm3 systemd[1]: Stopping D-Bus System Message Bus...
Feb 2 16:03:32 pm3 systemd[1]: Stopping LXC Container Monitoring Daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Login Prompts.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Getty on tty1...
Feb 2 16:03:32 pm3 systemd[1]: Stopping PVE guests...
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily apt upgrade and clean activities.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily apt download activities.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Postfix Mail Transport Agent.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...
Feb 2 16:03:32 pm3 systemd[1]: Stopped ZFS file system shares.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Self Monitoring and Reporting Technology (SMART) Daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopping PVE Qemu Event Daemon...
Feb 2 16:03:32 pm3 smartd[1593]: smartd received signal 15: Terminated
Feb 2 16:03:32 pm3 smartd[1593]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2KB240G7-PHYS728400ZY240AGN.ata.state
Feb 2 16:03:32 pm3 systemd[1]: Removed slice qemu.slice.
Feb 2 16:03:32 pm3 smartd[1593]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2KB240G7-PHYS728403CJ240AGN.ata.state
Feb 2 16:03:32 pm3 smartd[1593]: smartd is exiting (exit status 0)
Feb 2 16:03:32 pm3 systemd[1]: Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
Feb 2 16:03:32 pm3 systemd[1]: Stopped PVE Qemu Event Daemon.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Kernel Samepage Merging (KSM) Tuning Daemon.
Feb 2 16:03:32 pm3 systemd[1]: Stopped LXC Container Monitoring Daemon.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Regular background program processing daemon.
Feb 2 16:03:32 pm3 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Feb 2 16:03:32 pm3 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Feb 2 16:03:32 pm3 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Feb 2 16:03:32 pm3 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Feb 2 16:03:32 pm3 systemd[1]: Failed to propagate agent release message: Transport endpoint is not connected
Feb 2 16:03:32 pm3 postfix[1606319]: Postfix is running with backwards-compatible default settings
Feb 2 16:03:32 pm3 postfix[1606319]: See http://www.postfix.org/COMPATIBILITY_README.html for details
Feb 2 16:03:32 pm3 postfix[1606319]: To disable backwards compatibility use "postconf compatibility_level=2" and "postfix reload"
Feb 2 16:03:32 pm3 systemd[1]: Stopped D-Bus System Message Bus.
Feb 2 16:03:32 pm3 postfix/postfix-script[1606327]: stopping the Postfix mail system
Feb 2 16:03:32 pm3 postfix/master[2185]: terminating on signal 15
Feb 2 16:03:32 pm3 systemd[1]: Stopped Getty on tty1.
Feb 2 16:03:32 pm3 systemd[1]: Unmounted RPC Pipe File System.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Postfix Mail Transport Agent (instance -).
Feb 2 16:03:32 pm3 systemd[1]: Removed slice system-postfix.slice.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Permit User Sessions...
Feb 2 16:03:32 pm3 systemd[1]: Removed slice system-getty.slice.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Permit User Sessions.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Login Service.
Feb 2 16:03:33 pm3 pve-guests[1606369]: <root@pam> starting task UPIDm3:001882EE:0073CCA7:5C55B145:stopall::root@pam:
Feb 2 16:03:33 pm3 pve-guests[1606382]: all VMs and CTs stopped
Feb 2 16:03:33 pm3 pve-guests[1606369]: <root@pam> end task UPIDm3:001882EE:0073CCA7:5C55B145:stopall::root@pam: OK
Feb 2 16:03:33 pm3 systemd[1]: Stopped PVE guests.
Feb 2 16:03:33 pm3 systemd[1]: Stopping PVE Status Daemon...
Feb 2 16:03:33 pm3 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon...
Feb 2 16:03:33 pm3 systemd[1]: Stopping PVE SPICE Proxy Server...
Feb 2 16:03:33 pm3 systemd[1]: Stopping Proxmox VE firewall...
Feb 2 16:03:34 pm3 spiceproxy[3411]: received signal TERM
Feb 2 16:03:34 pm3 spiceproxy[3411]: server closing
Feb 2 16:03:34 pm3 spiceproxy[878263]: worker exit
Feb 2 16:03:34 pm3 spiceproxy[3411]: worker 878263 finished
Feb 2 16:03:34 pm3 spiceproxy[3411]: server stopped
Feb 2 16:03:34 pm3 pvestatd[2682]: received signal TERM
Feb 2 16:03:34 pm3 pvestatd[2682]: server closing
Feb 2 16:03:34 pm3 pvestatd[2682]: server stopped
Feb 2 16:03:34 pm3 pve-firewall[2675]: received signal TERM
Feb 2 16:03:34 pm3 pve-firewall[2675]: server closing
Feb 2 16:03:34 pm3 pve-firewall[2675]: clear firewall rules
Feb 2 16:03:34 pm3 pve-ha-lrm[3370]: received signal TERM
Feb 2 16:03:34 pm3 pve-ha-lrm[3370]: got shutdown request with shutdown policy 'conditional'
Feb 2 16:03:34 pm3 pve-ha-lrm[3370]: shutdown LRM, stop all services
Feb 2 16:03:34 pm3 pve-firewall[2675]: server stopped
Feb 2 16:03:35 pm3 systemd[1]: Stopped PVE SPICE Proxy Server.
Feb 2 16:03:35 pm3 systemd[1]: Stopping PVE API Proxy Server...
Feb 2 16:03:35 pm3 systemd[1]: Stopped PVE Status Daemon.
Feb 2 16:03:35 pm3 systemd[1]: Stopped Proxmox VE firewall.
Feb 2 16:03:35 pm3 systemd[1]: Stopping Proxmox VE firewall logger...
Feb 2 16:03:35 pm3 pvefw-logger[1516692]: received terminate request (signal)
Feb 2 16:03:35 pm3 pvefw-logger[1516692]: stopping pvefw logger
Feb 2 16:03:35 pm3 systemd[1]: Stopped Proxmox VE firewall logger.
Feb 2 16:03:35 pm3 pveproxy[3392]: received signal TERM
Feb 2 16:03:35 pm3 pveproxy[3392]: server closing
Feb 2 16:03:35 pm3 pveproxy[878283]: worker exit
Feb 2 16:03:35 pm3 pveproxy[878282]: worker exit
Feb 2 16:03:35 pm3 pveproxy[878281]: worker exit
Feb 2 16:03:35 pm3 pveproxy[3392]: worker 878283 finished
Feb 2 16:03:35 pm3 pveproxy[3392]: worker 878282 finished
Feb 2 16:03:35 pm3 pveproxy[3392]: worker 878281 finished
Feb 2 16:03:35 pm3 pveproxy[3392]: server stopped
Feb 2 16:03:36 pm3 systemd[1]: Stopped PVE API Proxy Server.
Feb 2 16:03:36 pm3 systemd[1]: Stopping OpenBSD Secure Shell server...
Feb 2 16:03:36 pm3 systemd[1]: Stopped OpenBSD Secure Shell server.

There is nothing from corosync. The last entries from corosync are some minute earlier:

Feb 2 15:57:30 pm3 corosync[2468]: notice [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:57:30 pm3 corosync[2468]: [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:57:30 pm3 corosync[2468]: notice [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:57:30 pm3 corosync[2468]: notice [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:57:30 pm3 corosync[2468]: [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:57:30 pm3 corosync[2468]: [TOTEM ] Retransmit List: 15236e 15236f
Feb 2 15:58:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 15:58:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 15:58:01 pm3 corosync[2468]: notice [TOTEM ] Retransmit List: 152634
Feb 2 15:58:01 pm3 corosync[2468]: [TOTEM ] Retransmit List: 152634

Alwin · Feb 4, 2019

Corosync is latency sensitive and the retransmits give the indication that the network has latency issues. Might it be that the corosync network is located on a physical network with a other service (storage, VM/CT traffic, backup)? If so, best separate the corosync network onto its own physical network (not VLAN) and use a second ring.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

raffael said:
And what would be the best way to disable the watchdog?

Once all groups and resources are removed the nodes should not reboot/reset anymore.

raffael · Feb 4, 2019

OK, thanks. Yes the cluster network is not separated right now. And I'm out of nics right now but will look into it in the future. We have a 10GB network for the storage and a 10GB network for the rest.
I assumed that this might be a problem but i expected to see some kind of log from watchdog-mux or corosync when they stop to reset the watchdog timer.

Once all groups and resources are removed the nodes should not reboot/reset anymore.

OK. Since I was a bit in panic mode that these reboots continue i stopped pve-ha-crm pve-ha-lrm and watchdog-mux on all nodes to make sure HA is disabled. No problems occured so far, but was that a bad idea?

Thanks a lot for your help
Raffael

Alwin · Feb 4, 2019

raffael said:
OK, thanks. Yes the cluster network is not separated right now. And I'm out of nics right now but will look into it in the future. We have a 10GB network for the storage and a 10GB network for the rest.

Bandwidth is not an issue with corosync, it's the latency. A 1 GbE link has adequate low latency.

raffael said:
OK. Since I was a bit in panic mode that these reboots continue i stopped pve-ha-crm pve-ha-lrm and watchdog-mux on all nodes to make sure HA is disabled. No problems occured so far, but was that a bad idea?

As long as no resources/groups are defined and the 'pve-ha-lrm' is in 'wait for agent lock', then no resets should occur. Once you reboot the node, the service should go into this state.

AlexLup · Feb 4, 2019

About latency - There a quite a few reports in the forums on nodes auto-rebooting when corosync throws a hissy fit, so it would be in everyones best interest if we could at least have an option to disable the auto-reboot.

Alwin · Feb 5, 2019

AlexLup said:
About latency - There a quite a few reports in the forums on nodes auto-rebooting when corosync throws a hissy fit, so it would be in everyones best interest if we could at least have an option to disable the auto-reboot.

No HA, no auto-reboot. Once a resource is defined as HA, then it is imperative to have a means of assurance. Otherwise you may end up with more then one instance of the same resource.

t.lamprecht · Feb 5, 2019

Alwin said:
Corosync is latency sensitive and the retransmits give the indication that the network has latency issues. Might it be that the corosync network is located on a physical network with a other service (storage, VM/CT traffic, backup)? If so, best separate the corosync network onto its own physical network (not VLAN) and use a second ring.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

Once all groups and resources are removed the nodes should not reboot/reset anymore.

FYI, you need to restart the LRM in this case, as we keep the Watchdog connection active even if (temporarily) no resource is active on a node anymore, but do not start the watchdog if no resource is active from the beginning on.

Code:

# first move or remove all HA services from the desired node, then:
systemctl restart pve-ha-lrm

Search

Search

How to disable HA and fencing to prevent random reboots

raffael

Member

Alwin

Proxmox Retired Staff

raffael

Member

Alwin

Proxmox Retired Staff

raffael

Member

Alwin

Proxmox Retired Staff

raffael

Member

Alwin

Proxmox Retired Staff

AlexLup

Well-Known Member

Alwin

Proxmox Retired Staff

t.lamprecht

Proxmox Staff Member

We value your privacy