We have one cluster in particular where nodes loose quorum and are subsequently fenced. This behaviour was most definitely new since upgrading to PVE 6 but may relate to igb drivers.
We upgraded and restarted corosync yesterday and had a repeat of all nodes loosing quorum simultaneously this morning. Herewith the commands we ran:
Things we've attempted:
The problem has become much better but nodes are still restarting regularly:
3 node cluster:
kvm1:
We upgraded and restarted corosync yesterday and had a repeat of all nodes loosing quorum simultaneously this morning. Herewith the commands we ran:
Code:
apt-get update;
apt-get -y dist-upgrade;
apt-get autoremove;
apt-get autoclean;
# On each node before continuing:
systemctl stop pve-ha-lrm;
systemctl stop pve-ha-crm;
systemctl restart corosync;
# Confirm all links connected before continuing:
corosync-cfgtool -s;
systemctl restart pve-cluster.service;
# Confirm you have all nodes in quorum before continuing:
pvecm status;
systemctl start pve-ha-lrm;
systemctl start pve-ha-crm;
Things we've attempted:
- Disabled Intel micro code updates
- Disabled secauth
- Increased Token timeout to 10,000
The problem has become much better but nodes are still restarting regularly:
Code:
[root@kvm1 ~]# last | grep -i boot
reboot system boot 5.0.21-1-pve Fri Aug 30 05:35
reboot system boot 5.0.18-1-pve Sun Aug 25 03:02
reboot system boot 5.0.15-1-pve Tue Aug 20 09:48
reboot system boot 5.0.15-1-pve Mon Aug 19 12:51
reboot system boot 5.0.15-1-pve Sun Aug 18 12:08
reboot system boot 5.0.15-1-pve Sat Aug 17 07:46
reboot system boot 5.0.15-1-pve Fri Aug 9 09:16
reboot system boot 5.0.15-1-pve Thu Aug 8 08:49
reboot system boot 5.0.15-1-pve Wed Aug 7 09:48
reboot system boot 5.0.15-1-pve Tue Aug 6 07:46
reboot system boot 5.0.15-1-pve Tue Aug 6 02:48
reboot system boot 5.0.15-1-pve Mon Aug 5 20:21
reboot system boot 5.0.15-1-pve Mon Aug 5 16:37
reboot system boot 5.0.15-1-pve Mon Aug 5 16:12
reboot system boot 5.0.15-1-pve Mon Aug 5 15:53
reboot system boot 5.0.15-1-pve Sun Aug 4 22:29
reboot system boot 5.0.15-1-pve Sat Aug 3 21:44
3 node cluster:
kvm1:
Code:
Aug 30 05:29:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:29:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:29:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:30:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:30:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:01 kvm1 CRON[133902]: (root) CMD (if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ] && [ -d "$(grep '^[[:space:]]*[^#]*[[:space:]]*WorkDir' /etc/mrtg
.cfg | awk '{ print $NF }')" ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi)
Aug 30 05:30:03 kvm1 postfix/pickup[128408]: D59741D42: uid=0 from=<root>
Aug 30 05:30:03 kvm1 postfix/cleanup[134039]: D59741D42: message-id=<20190830033003.D59741D42@kvm1.fqdn>
Aug 30 05:30:03 kvm1 postfix/qmgr[2183]: D59741D42: from=<root@kvm1.fqdn>, size=738, nrcpt=1 (queue active)
Aug 30 05:30:04 kvm1 pvemailforward[134042]: mail forward failed: user 'root@pam' does not have a email address
Aug 30 05:30:04 kvm1 postfix/local[134041]: D59741D42: to=<root@kvm1.fqdn>, orig_to=<root>, relay=local, delay=0.52, delays=0.02/0.01/0/0.49, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Aug 30 05:30:04 kvm1 postfix/qmgr[2183]: D59741D42: removed
Aug 30 05:30:41 kvm1 corosync[3707970]: [TOTEM ] Token has not been received in 382 ms
Aug 30 05:30:44 kvm1 corosync[3707970]: [TOTEM ] A processor failed, forming new configuration.
Aug 30 05:30:52 kvm1 corosync[3707970]: [TOTEM ] Token has not been received in 8498 ms
Aug 30 05:31:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:31:02 kvm1 corosync[3707970]: [TOTEM ] Token has not been received in 19153 ms
Aug 30 05:31:05 kvm1 corosync[3707970]: [TOTEM ] A new membership (1:15064) was formed. Members left: 2
Aug 30 05:31:05 kvm1 corosync[3707970]: [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 05:31:13 kvm1 corosync[3707970]: [TOTEM ] Token has not been received in 8041 ms
Aug 30 05:31:15 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:16 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:17 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:18 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
Aug 30 05:31:19 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 50
Aug 30 05:31:20 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 60
Aug 30 05:31:21 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 70
Aug 30 05:31:22 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 80
Aug 30 05:31:23 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 90
Aug 30 05:31:24 kvm1 corosync[3707970]: [TOTEM ] Token has not been received in 18695 ms
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 100
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retried 100 times
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] crit: cpg_send_message failed: 6
Aug 30 05:31:24 kvm1 pve-firewall[2520]: firewall update time (8.043 seconds)
Aug 30 05:31:25 kvm1 watchdog-mux[1127]: client watchdog expired - disable watchdog updates
Aug 30 05:31:25 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:26 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:27 kvm1 corosync[3707970]: [TOTEM ] A new membership (1:15076) was formed. Members
Aug 30 05:31:27 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:28 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'bonding'
Aug 30 05:35:50 kvm1 kernel: [ 0.000000] microcode: microcode updated early to revision 0x1f, date = 2018-05-08
Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'iscsi_tcp'
Aug 30 05:35:50 kvm1 kernel: [ 0.000000] Linux version 5.0.21-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.21-1 (Tue, 20 Aug 2019 17:16:32 +0200) ()
Aug 30 05:35:50 kvm1 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.21-1-pve root=/dev/md0 ro quiet