Hi everybody,
Yesterday we started installing updates on our 7 node cluster (PVE 7). We installed the updates on one node after the other. After the third node had finished installing all 7 nodes restarted unexpected and without clean shutdown. I think (but did not find evidence) watchdog and fencing did something unexpected here. I also saw that new corosync packages have been included in the list of updates.
When all servers had been online again the filesystems of every single virtual machine (about 100) where broken beyond repair. We had to restore all of them (Proxmox Backup Server was a huge help in that). All the disk images are stored on our external ceph cluster with caching mode writeback.
Questions are:
1. Any ideas on why all cluster nodes have been killed and restarted at once or any hints for tracking this issue down? And even more important: how can this be prevented?
2. Should we stop PVE-HA-LRM and PVE-HA-CRM (to close watchdog) before upgrading corosync?
3. Should we remove caching on the virtual disk images?
Thanks in advance for your help.
Version info:
Yesterday we started installing updates on our 7 node cluster (PVE 7). We installed the updates on one node after the other. After the third node had finished installing all 7 nodes restarted unexpected and without clean shutdown. I think (but did not find evidence) watchdog and fencing did something unexpected here. I also saw that new corosync packages have been included in the list of updates.
When all servers had been online again the filesystems of every single virtual machine (about 100) where broken beyond repair. We had to restore all of them (Proxmox Backup Server was a huge help in that). All the disk images are stored on our external ceph cluster with caching mode writeback.
Questions are:
1. Any ideas on why all cluster nodes have been killed and restarted at once or any hints for tracking this issue down? And even more important: how can this be prevented?
2. Should we stop PVE-HA-LRM and PVE-HA-CRM (to close watchdog) before upgrading corosync?
3. Should we remove caching on the virtual disk images?
Thanks in advance for your help.
Version info:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1