Hi
Unfortunately we are experiencing random reboots in our 3 node cluster (about once a day). As far as I can tell it is initiated by the ipmi watchdog. I know that disabling the watchdog and therefore fencing is a bad idea while running HA an so I would prefer to disable HA as well. But how to do that?
I removed all HA resources and groups but I did not find out how to disable HA completely.
And what would be the best way to disable the watchdog?
I'd like to keep the cluster functionality like moving a vm from one host to another or managing all three hosts through the same web interface but I can life with having to move the vms by hand in the rare case of a hardware failure if the random reboots stop.
The reason why I think the reboot is initiated by the ipmi watchdog is this entry in /var/log/auth.log
And in the Event Log of ipmi i see related watchdog entries at the same time.
Is there a way to see If this is the watchdog-mux daemon stopped reseting the watchdog timer on purpose (like did not have a connection to the other nodes anymore) or if it was a bug?
EDIT:
Just found out that there are interesting syslog entries a the same time on another node. Does anybody know what exactly this means?
btw. I'm using:
pve-manager/5.3-8/2929af8e (running kernel: 4.15.18-10-pve)
Thanks
Raffael
Unfortunately we are experiencing random reboots in our 3 node cluster (about once a day). As far as I can tell it is initiated by the ipmi watchdog. I know that disabling the watchdog and therefore fencing is a bad idea while running HA an so I would prefer to disable HA as well. But how to do that?
I removed all HA resources and groups but I did not find out how to disable HA completely.
And what would be the best way to disable the watchdog?
I'd like to keep the cluster functionality like moving a vm from one host to another or managing all three hosts through the same web interface but I can life with having to move the vms by hand in the rare case of a hardware failure if the random reboots stop.
The reason why I think the reboot is initiated by the ipmi watchdog is this entry in /var/log/auth.log
Feb 2 16:03:32 pm3 systemd-logind[1552]: Power key pressed.
Feb 2 16:03:32 pm3 systemd-logind[1552]: Powering Off...
Feb 2 16:03:32 pm3 systemd-logind[1552]: System is powering down.
Feb 2 16:03:36 pm3 sshd[1933]: Received signal 15; terminating.
And in the Event Log of ipmi i see related watchdog entries at the same time.
Feb 2 16:01:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:01:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:02:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:02:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:00 pm3 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:03:00 pm3 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Graphical Interface.
Feb 2 16:03:32 pm3 systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Timers.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily PVE download activities.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Proxmox VE replication runner.
Feb 2 16:03:32 pm3 systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target RPC Port Mapper.
Feb 2 16:03:32 pm3 systemd[1]: Unmounting RPC Pipe File System...
Feb 2 16:03:32 pm3 systemd[1]: Removed slice system-ceph\x2ddisk.slice.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target Multi-User System.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Regular background program processing daemon...
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS startup target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ZFS pool import target.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph*@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Feb 2 16:03:32 pm3 systemd[1]: Stopping Login Service...
Is there a way to see If this is the watchdog-mux daemon stopped reseting the watchdog timer on purpose (like did not have a connection to the other nodes anymore) or if it was a bug?
EDIT:
Just found out that there are interesting syslog entries a the same time on another node. Does anybody know what exactly this means?
Feb 2 16:03:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:03:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:03:33 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:03:33 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:03:38 pm1 corosync[2242]: notice [TOTEM ] A processor failed, forming new configuration.
Feb 2 16:03:38 pm1 corosync[2242]: [TOTEM ] A processor failed, forming new configuration.
Feb 2 16:03:40 pm1 corosync[2242]: notice [TOTEM ] A new membership (10.1.0.2:644) was formed. Members left: 3
Feb 2 16:03:40 pm1 corosync[2242]: notice [TOTEM ] Failed to receive the leave message. failed: 3
Feb 2 16:03:40 pm1 corosync[2242]: [TOTEM ] A new membership (10.1.0.2:644) was formed. Members left: 3
Feb 2 16:03:40 pm1 corosync[2242]: [TOTEM ] Failed to receive the leave message. failed: 3
Feb 2 16:03:40 pm1 corosync[2242]: warning [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: warning [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 corosync[2242]: notice [QUORUM] Members[2]: 1 2
Feb 2 16:03:40 pm1 corosync[2242]: notice [MAIN ] Completed service synchronization, ready to provide service.
Feb 2 16:03:40 pm1 corosync[2242]: [CPG ] downlist left_list: 1 received
Feb 2 16:03:40 pm1 pmxcfs[2079]: [dcdb] notice: members: 1/2079, 2/2244
Feb 2 16:03:40 pm1 pmxcfs[2079]: [dcdb] notice: starting data syncronisation
Feb 2 16:03:40 pm1 corosync[2242]: [QUORUM] Members[2]: 1 2
Feb 2 16:03:40 pm1 corosync[2242]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: cpg_send_message retried 1 times
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: members: 1/2079, 2/2244
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: starting data syncronisation
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: received sync request (epoch 1/2079/00000006)
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: received sync request (epoch 1/2079/00000006)
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: received all states
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: leader is 1/2079
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: synced members: 1/2079, 2/2244
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: start sending inode updates
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: sent all (0) updates
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: all data is up to date
Feb 2 16:03:41 pm1 pmxcfs[2079]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: received all states
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: all data is up to date
Feb 2 16:03:41 pm1 pmxcfs[2079]: [status] notice: dfsm_deliver_queue: queue length 11
Feb 2 16:03:51 pm1 pvestatd[2585]: got timeout
Feb 2 16:03:52 pm1 pvestatd[2585]: status update time (6.052 seconds)
Feb 2 16:03:55 pm1 ceph-osd[2714]: 2019-02-02 16:03:55.578187 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:35.578180)
Feb 2 16:03:55 pm1 ceph-osd[2714]: 2019-02-02 16:03:55.578205 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:35.578180)
Feb 2 16:03:55 pm1 ceph-osd[2872]: 2019-02-02 16:03:55.796551 7fa1df31f700 -1 osd.4 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.584432 front 2019-02-02 16:03:35.584432 (cutoff 2019-02-02 16:03:35.796539)
Feb 2 16:03:55 pm1 ceph-osd[2872]: 2019-02-02 16:03:55.796601 7fa1df31f700 -1 osd.4 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.584432 front 2019-02-02 16:03:35.584432 (cutoff 2019-02-02 16:03:35.796539)
Feb 2 16:03:56 pm1 ceph-osd[2714]: 2019-02-02 16:03:56.594112 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6807 osd.0 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:36.594110)
Feb 2 16:03:56 pm1 ceph-osd[2714]: 2019-02-02 16:03:56.594143 7efc21d7f700 -1 osd.5 1179 heartbeat_check: no reply from 10.0.0.3:6803 osd.1 since back 2019-02-02 16:03:35.060656 front 2019-02-02 16:03:35.060656 (cutoff 2019-02-02 16:03:36.594110)
Feb 2 16:03:57 pm1 kernel: [79008.502749] libceph: osd0 down
Feb 2 16:03:57 pm1 kernel: [79008.502750] libceph: osd1 down
Feb 2 16:04:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:04:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:04:13 pm1 kernel: [79024.260638] libceph: mon2 10.0.0.3:6789 session lost, hunting for new mon
Feb 2 16:04:13 pm1 kernel: [79024.261715] libceph: mon1 10.0.0.2:6789 session established
Feb 2 16:05:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:05:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: successfully acquired lock 'ha_manager_lock'
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: watchdog active
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: status change slave => master
Feb 2 16:05:33 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'online' => 'unknown'
Feb 2 16:06:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:06:00 pm1 systemd[1]: Started Proxmox VE replication runner.
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:101': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:105': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: service 'ct:116': state changed from 'started' to 'fence'
Feb 2 16:06:33 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'unknown' => 'fence'
Feb 2 16:06:33 pm1 postfix/pickup[1974582]: 288844C963: uid=0 from=<root>
Feb 2 16:06:33 pm1 postfix/cleanup[2001636]: 288844C963: message-id=<20190202150633.288844C963@pm1>
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 288844C963: from=<root@pm1> size=3159, nrcpt=1 (queue active)
Feb 2 16:06:33 pm1 pvemailforward[2001643]: forward mail to <sysadmin@XXX>
Feb 2 16:06:33 pm1 postfix/pickup[1974582]: 631A14C964: uid=65534 from=<root>
Feb 2 16:06:33 pm1 postfix/cleanup[2001636]: 631A14C964: message-id=<20190202150633.288844C963@pm1>
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 631A14C964: from=<root@pm1>, size=3320, nrcpt=1 (queue active)
Feb 2 16:06:33 pm1 postfix/local[2001642]: 288844C963: to=<root@pm1>, orig_to=<root>, relay=local, delay=0.25, delays=0.01/0.02/0/0.22, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Feb 2 16:06:33 pm1 postfix/qmgr[2222]: 288844C963: removed
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: successfully acquired lock 'ha_agent_pm3_lock'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: fencing: acknowledged - got agent lock for node 'pm3'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: node 'pm3': state changed from 'fence' => 'unknown'
Feb 2 16:06:43 pm1 postfix/pickup[1974582]: 2BCF34CA44: uid=0 from=<root>
Feb 2 16:06:43 pm1 postfix/cleanup[2001636]: 2BCF34CA44: message-id=<20190202150643.2BCF34CA44@pm1>
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:101' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 2BCF34CA44: from=<root@pm1>, size=3241, nrcpt=1 (queue active)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:101': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:105' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:105': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: recover service 'ct:116' from fenced node 'pm3' to node 'pm2'
Feb 2 16:06:43 pm1 pve-ha-crm[2900]: service 'ct:116': state changed from 'fence' to 'started' (node = pm2)
Feb 2 16:06:43 pm1 pvemailforward[2001934]: forward mail to <sysadmin@XXX>
Feb 2 16:06:43 pm1 postfix/pickup[1974582]: 605A94C9B4: uid=65534 from=<root>
Feb 2 16:06:43 pm1 postfix/cleanup[2001636]: 605A94C9B4: message-id=<20190202150643.2BCF34CA44@pm1>
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 605A94C9B4: from=<root@pm1>, size=3402, nrcpt=1 (queue active)
Feb 2 16:06:43 pm1 postfix/local[2001642]: 2BCF34CA44: to=<root@pm1>, orig_to=<root>, relay=local, delay=0.22, delays=0/0/0/0.22, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Feb 2 16:06:43 pm1 postfix/qmgr[2222]: 2BCF34CA44: removed
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:06:51 pm1 pmxcfs[2079]: [status] notice: received log
Feb 2 16:07:00 pm1 systemd[1]: Starting Proxmox VE replication runner...
Feb 2 16:07:00 pm1 systemd[1]: Started Proxmox VE replication runner.
btw. I'm using:
pve-manager/5.3-8/2929af8e (running kernel: 4.15.18-10-pve)
Thanks
Raffael
Last edited: