Hello,
Last week, I began rolling out dist-upgrade to our cluster, after having tested it on a few machines with success. The upgrade was intended to resolve a kernel bug in 4.4 that caused the machine to spontaneously reboot. Once I was about 1/3rd finished, however, we had something terrible happen. The entire cluster rebooted at the same time, 28 of 29 machines.
Upon looking at the logs, this is all we see:
corosync began spewing those error messages to the log about 30 seconds prior, with the 'Retransmit List' getting a new ID appended to it every few seconds. After about a minute of this happening, 28 of our 29 machines all instantly rebooted at the same time.
Needless to say, it was a shock to us and we really don't know what to think. I am reaching out here to ask if there are any clues or suggestions on what I can look for on figuring out what triggered this. We initially suspected a bridge loop, but have since ruled that out.
pveversion -v of machines not yet upgraded:
This thread, concerning the reboots in 4.4, includes a diagram of our network topology:
https://forum.proxmox.com/threads/4-4-many-hosts-rebooting-under-load.34549/#post-169450
Last week, I began rolling out dist-upgrade to our cluster, after having tested it on a few machines with success. The upgrade was intended to resolve a kernel bug in 4.4 that caused the machine to spontaneously reboot. Once I was about 1/3rd finished, however, we had something terrible happen. The entire cluster rebooted at the same time, 28 of 29 machines.
Upon looking at the logs, this is all we see:
Code:
May 12 23:03:46 AF002268 corosync[2404]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
May 12 23:03:44 AF002268 corosync[2404]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
May 12 23:03:25 AF002268 CRON[46124]: pam_unix(cron:session): session closed for user root
May 12 23:03:19 AF002268 corosync[2404]: [TOTEM ] Retransmit List: 29b30 29938 29999 29a30 29a32 29a36 29a2e 29a2f 29a31 29a35 29b2f
May 12 23:03:19 AF002268 corosync[2404]: [TOTEM ] Retransmit List: 29b2f 29a2e 29a2f 29a31 29a35 29b30 29938 29999 29a30 29a32 29a36
May 12 23:03:19 AF002268 corosync[2404]: [TOTEM ] Retransmit List: 29a36 29938 29999 29a30 29a32 29b2f 29a2e 29a2f 29a31 29a35 29b30
corosync began spewing those error messages to the log about 30 seconds prior, with the 'Retransmit List' getting a new ID appended to it every few seconds. After about a minute of this happening, 28 of our 29 machines all instantly rebooted at the same time.
Needless to say, it was a shock to us and we really don't know what to think. I am reaching out here to ask if there are any clues or suggestions on what I can look for on figuring out what triggered this. We initially suspected a bridge loop, but have since ruled that out.
pveversion -v of machines not yet upgraded:
Code:
root@AF002268:~# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.10-1~bpo80+1
This thread, concerning the reboots in 4.4, includes a diagram of our network topology:
https://forum.proxmox.com/threads/4-4-many-hosts-rebooting-under-load.34549/#post-169450
Last edited: