[SOLVED] - PVE 5.4-11 + Corosync 3.x: major issues | Page 5

D

David Herselman

Renowned Member

Proxmox Subscriber

Aug 30, 2019

#81

We have one cluster in particular where nodes loose quorum and are subsequently fenced. This behaviour was most definitely new since upgrading to PVE 6 but may relate to igb drivers.

We upgraded and restarted corosync yesterday and had a repeat of all nodes loosing quorum simultaneously this morning. Herewith the commands we ran:

Code:

apt-get update;
apt-get -y dist-upgrade;
apt-get autoremove;
apt-get autoclean;

# On each node before continuing:
  systemctl stop pve-ha-lrm;

systemctl stop pve-ha-crm;
systemctl restart corosync;

# Confirm all links connected before continuing:
  corosync-cfgtool -s;

systemctl restart pve-cluster.service;

# Confirm you have all nodes in quorum before continuing:
  pvecm status;

systemctl start pve-ha-lrm;
systemctl start pve-ha-crm;

Things we've attempted:

Disabled Intel micro code updates
Disabled secauth
Increased Token timeout to 10,000

The problem has become much better but nodes are still restarting regularly:

Code:

[root@kvm1 ~]# last | grep -i boot
reboot   system boot  5.0.21-1-pve     Fri Aug 30 05:35
reboot   system boot  5.0.18-1-pve     Sun Aug 25 03:02
reboot   system boot  5.0.15-1-pve     Tue Aug 20 09:48
reboot   system boot  5.0.15-1-pve     Mon Aug 19 12:51
reboot   system boot  5.0.15-1-pve     Sun Aug 18 12:08
reboot   system boot  5.0.15-1-pve     Sat Aug 17 07:46
reboot   system boot  5.0.15-1-pve     Fri Aug  9 09:16
reboot   system boot  5.0.15-1-pve     Thu Aug  8 08:49
reboot   system boot  5.0.15-1-pve     Wed Aug  7 09:48
reboot   system boot  5.0.15-1-pve     Tue Aug  6 07:46
reboot   system boot  5.0.15-1-pve     Tue Aug  6 02:48
reboot   system boot  5.0.15-1-pve     Mon Aug  5 20:21
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:37
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:12
reboot   system boot  5.0.15-1-pve     Mon Aug  5 15:53
reboot   system boot  5.0.15-1-pve     Sun Aug  4 22:29
reboot   system boot  5.0.15-1-pve     Sat Aug  3 21:44

3 node cluster:

kvm1:

Code:

Aug 30 05:29:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:29:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:29:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:30:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:30:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:01 kvm1 CRON[133902]: (root) CMD (if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ] && [ -d "$(grep '^[[:space:]]*[^#]*[[:space:]]*WorkDir' /etc/mrtg
.cfg | awk '{ print $NF }')" ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi)
Aug 30 05:30:03 kvm1 postfix/pickup[128408]: D59741D42: uid=0 from=<root>
Aug 30 05:30:03 kvm1 postfix/cleanup[134039]: D59741D42: message-id=<20190830033003.D59741D42@kvm1.fqdn>
Aug 30 05:30:03 kvm1 postfix/qmgr[2183]: D59741D42: from=<root@kvm1.fqdn>, size=738, nrcpt=1 (queue active)
Aug 30 05:30:04 kvm1 pvemailforward[134042]: mail forward failed: user 'root@pam' does not have a email address
Aug 30 05:30:04 kvm1 postfix/local[134041]: D59741D42: to=<root@kvm1.fqdn>, orig_to=<root>, relay=local, delay=0.52, delays=0.02/0.01/0/0.49, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Aug 30 05:30:04 kvm1 postfix/qmgr[2183]: D59741D42: removed
Aug 30 05:30:41 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 382 ms
Aug 30 05:30:44 kvm1 corosync[3707970]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 05:30:52 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 8498 ms
Aug 30 05:31:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:31:02 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 19153 ms
Aug 30 05:31:05 kvm1 corosync[3707970]:   [TOTEM ] A new membership (1:15064) was formed. Members left: 2
Aug 30 05:31:05 kvm1 corosync[3707970]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 05:31:13 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 8041 ms
Aug 30 05:31:15 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:16 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:17 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:18 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
Aug 30 05:31:19 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 50
Aug 30 05:31:20 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 60
Aug 30 05:31:21 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 70
Aug 30 05:31:22 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 80
Aug 30 05:31:23 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 90
Aug 30 05:31:24 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 18695 ms
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 100
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retried 100 times
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] crit: cpg_send_message failed: 6
Aug 30 05:31:24 kvm1 pve-firewall[2520]: firewall update time (8.043 seconds)
Aug 30 05:31:25 kvm1 watchdog-mux[1127]: client watchdog expired - disable watchdog updates
Aug 30 05:31:25 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:26 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:27 kvm1 corosync[3707970]:   [TOTEM ] A new membership (1:15076) was formed. Members
Aug 30 05:31:27 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:28 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'bonding'
Aug 30 05:35:50 kvm1 kernel: [    0.000000] microcode: microcode updated early to revision 0x1f, date = 2018-05-08
Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'iscsi_tcp'
Aug 30 05:35:50 kvm1 kernel: [    0.000000] Linux version 5.0.21-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.21-1 (Tue, 20 Aug 2019 17:16:32 +0200) ()
Aug 30 05:35:50 kvm1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.21-1-pve root=/dev/md0 ro quiet

D

David Herselman

Renowned Member

Proxmox Subscriber

Aug 30, 2019

#82

kvm2:

Code:

Aug 30 05:29:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:29:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:29:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:30:01 kvm1 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:30:01 kvm1 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:01 kvm1 CRON[133902]: (root) CMD (if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ] && [ -d "$(grep '^[[:space:]]*[^#]*[[:space:]]*WorkDir' /etc/mrtg
.cfg | awk '{ print $NF }')" ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi)
Aug 30 05:30:03 kvm1 postfix/pickup[128408]: D59741D42: uid=0 from=<root>
Aug 30 05:30:03 kvm1 postfix/cleanup[134039]: D59741D42: message-id=<20190830033003.D59741D42@kvm1.fqdn>
Aug 30 05:30:03 kvm1 postfix/qmgr[2183]: D59741D42: from=<root@kvm1.fqdn>, size=738, nrcpt=1 (queue active)
Aug 30 05:30:04 kvm1 pvemailforward[134042]: mail forward failed: user 'root@pam' does not have a email address
Aug 30 05:30:04 kvm1 postfix/local[134041]: D59741D42: to=<root@kvm1.fqdn>, orig_to=<root>, relay=local, delay=0.52, delays=0.02/0.01/0/0.49, d
sn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Aug 30 05:30:04 kvm1 postfix/qmgr[2183]: D59741D42: removed
Aug 30 05:30:41 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 382 ms
Aug 30 05:30:44 kvm1 corosync[3707970]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 05:30:52 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 8498 ms
Aug 30 05:31:00 kvm1 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:31:02 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 19153 ms
Aug 30 05:31:05 kvm1 corosync[3707970]:   [TOTEM ] A new membership (1:15064) was formed. Members left: 2
Aug 30 05:31:05 kvm1 corosync[3707970]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 05:31:13 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 8041 ms
Aug 30 05:31:15 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:16 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:17 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:18 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
Aug 30 05:31:19 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 50
Aug 30 05:31:20 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 60
Aug 30 05:31:21 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 70
Aug 30 05:31:22 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 80
Aug 30 05:31:23 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 90
Aug 30 05:31:24 kvm1 corosync[3707970]:   [TOTEM ] Token has not been received in 18695 ms
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 100
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retried 100 times
Aug 30 05:31:24 kvm1 pmxcfs[3708534]: [status] crit: cpg_send_message failed: 6
Aug 30 05:31:24 kvm1 pve-firewall[2520]: firewall update time (8.043 seconds)
Aug 30 05:31:25 kvm1 watchdog-mux[1127]: client watchdog expired - disable watchdog updates
Aug 30 05:31:25 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:26 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:27 kvm1 corosync[3707970]:   [TOTEM ] A new membership (1:15076) was formed. Members
Aug 30 05:31:27 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:28 kvm1 pmxcfs[3708534]: [status] notice: cpg_send_message retry 40
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'bonding'
Aug 30 05:35:50 kvm1 kernel: [    0.000000] microcode: microcode updated early to revision 0x1f, date = 2018-05-08
Aug 30 05:35:50 kvm1 systemd-modules-load[603]: Inserted module 'iscsi_tcp'
Aug 30 05:35:50 kvm1 kernel: [    0.000000] Linux version 5.0.21-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.21-1 (Tue, 20 Aug 2019 17:16:32 +0200) ()

kvm3:

Code:

Aug 30 05:28:01 kvm3 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:28:35 kvm3 corosync[2421278]:   [TOTEM ] Retransmit List: b26
Aug 30 05:29:00 kvm3 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:29:01 kvm3 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:29:01 kvm3 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:00 kvm3 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:30:01 kvm3 systemd[1]: pvesr.service: Succeeded.
Aug 30 05:30:01 kvm3 systemd[1]: Started Proxmox VE replication runner.
Aug 30 05:30:41 kvm3 corosync[2421278]:   [TOTEM ] Token has not been received in 7987 ms
Aug 30 05:30:44 kvm3 corosync[2421278]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 05:31:00 kvm3 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 05:31:05 kvm3 corosync[2421278]:   [TOTEM ] A new membership (1:15064) was formed. Members left: 2
Aug 30 05:31:05 kvm3 corosync[2421278]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 05:31:08 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:09 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:10 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:11 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 40
Aug 30 05:31:12 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 50
Aug 30 05:31:13 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 60
Aug 30 05:31:14 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 70
Aug 30 05:31:15 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 80
Aug 30 05:31:16 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 90
Aug 30 05:31:17 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 100
Aug 30 05:31:17 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retried 100 times
Aug 30 05:31:17 kvm3 pmxcfs[2421527]: [status] crit: cpg_send_message failed: 6
Aug 30 05:31:18 kvm3 pve-firewall[1928]: firewall update time (7.485 seconds)
Aug 30 05:31:18 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:19 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:20 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 30
Aug 30 05:31:21 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 40
Aug 30 05:31:22 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 50
Aug 30 05:31:23 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 60
Aug 30 05:31:24 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 70
Aug 30 05:31:25 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 80
Aug 30 05:31:26 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 90
Aug 30 05:31:27 kvm3 corosync[2421278]:   [TOTEM ] A new membership (1:15076) was formed. Members
Aug 30 05:31:27 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 100
Aug 30 05:31:27 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retried 100 times
Aug 30 05:31:27 kvm3 pmxcfs[2421527]: [status] crit: cpg_send_message failed: 6
Aug 30 05:31:27 kvm3 pve-firewall[1928]: firewall update time (6.777 seconds)
Aug 30 05:31:28 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 10
Aug 30 05:31:29 kvm3 pmxcfs[2421527]: [status] notice: cpg_send_message retry 20
Aug 30 05:31:30 kvm3 watchdog-mux[851]: client watchdog expired - disable watchdog updates
Aug 30 05:31:30 kvm3 watchdog-mux[851]: client watchdog expired - disable watchdog updates
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 30 05:35:04 kvm3 lvm[476]:   1 logical volume(s) in volume group "pve" monitored
Aug 30 05:35:04 kvm3 systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 30 05:35:04 kvm3 kernel: [    0.000000] microcode: microcode updated early to revision 0x1d, date = 2018-05-11
Aug 30 05:35:04 kvm3 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Aug 30 05:35:04 kvm3 kernel: [    0.000000] Linux version 5.0.21-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.21-1 (Tue, 20 Aug 2019 17:16:32 +0200) ()

F

Fusel

Member

Proxmox Subscriber

Sep 1, 2019

#83

I got a coredump of a signal 11. I'm not allowed to upload lz4 files, so i put the output here

Code:

           PID: 27355 (corosync)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Sun 2019-09-01 04:12:24 CEST (11h ago)
  Command Line: /usr/sbin/corosync -f
    Executable: /usr/sbin/corosync
 Control Group: /system.slice/corosync.service
          Unit: corosync.service
         Slice: system.slice
       Boot ID: 7253b7f158e34181a5a91e6cb0661f82
    Machine ID: eec280213051474d8bfe7e089a86744a
      Hostname: node-27
       Storage: /var/lib/systemd/coredump/core.corosync.0.7253b7f158e34181a5a91e6cb0661f82.27355.1567303944000000.lz4
       Message: Process 27355 (corosync) of user 0 dumped core.
                
                Stack trace of thread 27355:
                #0  0x00007f8eaeeff0f1 n/a (libc.so.6)
                #1  0x000055ec374a1b64 n/a (corosync)
                #2  0x000055ec374995e6 n/a (corosync)
                #3  0x000055ec3749a0e4 n/a (corosync)
                #4  0x000055ec374a4459 n/a (corosync)
                #5  0x00007f8eaed420af n/a (libqb.so.0)
                #6  0x00007f8eaed41c8d qb_loop_run (libqb.so.0)
                #7  0x000055ec3746e0f5 n/a (corosync)
                #8  0x00007f8eaedc409b __libc_start_main (libc.so.6)
                #9  0x000055ec3746e7ba n/a (corosync)
                
                Stack trace of thread 27357:
                #0  0x00007f8eaee997ef epoll_wait (libc.so.6)
                #1  0x00007f8eaefa18d0 n/a (libknet.so.1)
                #2  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #3  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27361:
                #0  0x00007f8eaee997ef epoll_wait (libc.so.6)
                #1  0x00007f8eaef9f15f n/a (libknet.so.1)
                #2  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #3  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27362:
                #0  0x00007f8eaee997ef epoll_wait (libc.so.6)
                #1  0x00007f8eaef9c4f3 n/a (libknet.so.1)
                #2  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #3  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27359:
                #0  0x00007f8eaee66720 __nanosleep (libc.so.6)
                #1  0x00007f8eaee91874 usleep (libc.so.6)
                #2  0x00007f8eaef9b35a n/a (libknet.so.1)
                #3  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #4  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27358:
                #0  0x00007f8eaee997ef epoll_wait (libc.so.6)
                #1  0x00007f8eaefa2270 n/a (libknet.so.1)
                #2  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #3  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27360:
                #0  0x00007f8eaee997ef epoll_wait (libc.so.6)
                #1  0x00007f8eaef9aad0 n/a (libknet.so.1)
                #2  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #3  0x00007f8eaee994cf __clone (libc.so.6)
                
                Stack trace of thread 27363:
                #0  0x00007f8eaee66720 __nanosleep (libc.so.6)
                #1  0x00007f8eaee91874 usleep (libc.so.6)
                #2  0x00007f8eaef9b14f n/a (libknet.so.1)
                #3  0x00007f8eaef68fa3 start_thread (libpthread.so.0)
                #4  0x00007f8eaee994cf __clone (libc.so.6)

D

David Herselman

Renowned Member

Proxmox Subscriber

Sep 2, 2019

#84

I'm not getting core dumps but things are less stable with the latest updates:

Code:

[root@kvm1 log]# last | grep boot
reboot   system boot  5.0.21-1-pve     Sun Sep  1 20:50   still running
reboot   system boot  5.0.21-1-pve     Sat Aug 31 16:06   still running
reboot   system boot  5.0.21-1-pve     Fri Aug 30 05:35   still running
reboot   system boot  5.0.18-1-pve     Sun Aug 25 03:02   still running
reboot   system boot  5.0.15-1-pve     Tue Aug 20 09:48   still running
reboot   system boot  5.0.15-1-pve     Mon Aug 19 12:51   still running
reboot   system boot  5.0.15-1-pve     Sun Aug 18 12:08   still running
reboot   system boot  5.0.15-1-pve     Sat Aug 17 07:46   still running
reboot   system boot  5.0.15-1-pve     Fri Aug  9 09:16   still running
reboot   system boot  5.0.15-1-pve     Thu Aug  8 08:49   still running
reboot   system boot  5.0.15-1-pve     Wed Aug  7 09:48   still running
reboot   system boot  5.0.15-1-pve     Tue Aug  6 07:46   still running
reboot   system boot  5.0.15-1-pve     Tue Aug  6 02:48   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 20:21   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:37   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:12   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 15:53   still running
reboot   system boot  5.0.15-1-pve     Sun Aug  4 22:29   still running
reboot   system boot  5.0.15-1-pve     Sat Aug  3 21:44   still running

Herewith /var/log/syslog entries from one of the 3 nodes from yesterday:

Code:

Sep  1 20:43:31 kvm1 corosync[2278]:   [TOTEM ] Retransmit List: 15b47
Sep  1 20:44:25 kvm1 corosync[2278]:   [TOTEM ] FAILED TO RECEIVE
Sep  1 20:44:51 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16624) was formed. Members left: 2 3
Sep  1 20:44:51 kvm1 corosync[2278]:   [TOTEM ] Failed to receive the leave message. failed: 2 3
Sep  1 20:44:51 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 2 received
Sep  1 20:44:51 kvm1 corosync[2278]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep  1 20:44:51 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:44:51 kvm1 pmxcfs[2155]: [dcdb] notice: members: 1/2155
Sep  1 20:44:51 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  1 20:44:51 kvm1 pmxcfs[2155]: [status] notice: members: 1/2155
Sep  1 20:44:51 kvm1 pmxcfs[2155]: [status] notice: node lost quorum
Sep  1 20:44:51 kvm1 pve-ha-crm[2882]: status change slave => wait_for_quorum
Sep  1 20:45:01 kvm1 pve-ha-lrm[2891]: loop take too long (37 seconds)
Sep  1 20:45:01 kvm1 pve-ha-lrm[2891]: lost lock 'ha_agent_kvm1_lock - cfs lock update failed - Permission denied
Sep  1 20:45:01 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:02 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:03 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:04 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16628) was formed. Members
Sep  1 20:45:04 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 0 received
Sep  1 20:45:04 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:45:04 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  1 20:45:04 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:05 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:06 kvm1 pve-ha-lrm[2891]: status change active => lost_agent_lock
Sep  1 20:45:06 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:07 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:08 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:09 kvm1 pvesr[1431032]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep  1 20:45:10 kvm1 pvesr[1431032]: error with cfs lock 'file-replication_cfg': no quorum!
Sep  1 20:45:10 kvm1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Sep  1 20:45:10 kvm1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Sep  1 20:45:10 kvm1 systemd[1]: Failed to start Proxmox VE replication runner.
Sep  1 20:45:17 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16632) was formed. Members
Sep  1 20:45:17 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 0 received
Sep  1 20:45:17 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:45:17 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  1 20:45:29 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16636) was formed. Members
Sep  1 20:45:29 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 0 received
Sep  1 20:45:29 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:45:29 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  1 20:45:42 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16640) was formed. Members
Sep  1 20:45:42 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 0 received
Sep  1 20:45:42 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:45:42 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  1 20:45:52 kvm1 watchdog-mux[1073]: client watchdog expired - disable watchdog updates
Sep  1 20:45:55 kvm1 corosync[2278]:   [TOTEM ] A new membership (1:16644) was formed. Members
Sep  1 20:45:55 kvm1 corosync[2278]:   [CPG   ] downlist left_list: 0 received
Sep  1 20:45:55 kvm1 corosync[2278]:   [QUORUM] Members[1]: 1
Sep  1 20:45:55 kvm1 corosync[2278]:   [MAIN  ] Completed service synchronization, ready to provide service.
<IPMI watchdog reset system>
Sep  1 20:50:20 kvm1 systemd-modules-load[596]: Inserted module 'bonding'
Sep  1 20:50:20 kvm1 lvm[584]:   1 logical volume(s) in volume group "pve" monitored
Sep  1 20:50:20 kvm1 systemd[1]: Starting Flush Journal to Persistent Storage...
Sep  1 20:50:20 kvm1 systemd[1]: Started Create System Users.
Sep  1 20:50:20 kvm1 systemd[1]: Starting Create Static Device Nodes in /dev...
Sep  1 20:50:20 kvm1 systemd[1]: Started Flush Journal to Persistent Storage.
Sep  1 20:50:20 kvm1 systemd-modules-load[596]: Inserted module 'iscsi_tcp'
Sep  1 20:50:20 kvm1 systemd[1]: Started Create Static Device Nodes in /dev.
Sep  1 20:50:20 kvm1 systemd[1]: Starting udev Kernel Device Manager...
Sep  1 20:50:20 kvm1 systemd-modules-load[596]: Inserted module 'ib_iser'
Sep  1 20:50:20 kvm1 systemd-modules-load[596]: Inserted module 'vhost_net'
Sep  1 20:50:20 kvm1 systemd[1]: Started udev Kernel Device Manager.
Sep  1 20:50:20 kvm1 kernel: [    0.000000] microcode: microcode updated early to revision 0x1f, date = 2018-05-08
Sep  1 20:50:20 kvm1 keyboard-setup.sh[572]: /usr/bin/ckbcomp: Can not find file "symbols/en-us" in any known directory
Sep  1 20:50:20 kvm1 systemd[1]: Started Set the console keyboard layout.
Sep  1 20:50:20 kvm1 kernel: [    0.000000] Linux version 5.0.21-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.21-1 (Tue, 20 Aug 2019 17:16:32 +0200) ()
Sep  1 20:50:20 kvm1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.21-1-pve root=/dev/md0 ro quiet

A

astnwt

Renowned Member

Sep 2, 2019

#85

@spirit

I installed systemd-coredump, .. so I expect to have dumps soon

E

elmacus

Renowned Member

Sep 3, 2019

#86

Today all servers went nuts, had to restart almost all vm and physical servers.
Lots of: pmxcfs[1814]: [dcdb] crit: cpg_send_message failed: 9
This command fixed some servers:
systemctl restart pve-cluster.service

Several days problems now start at circa 08:00, what could that be that runs?
Nothing obvious in cron or systemd timers.

I wish i never upgraded to PVE6, until this mess is fixed.

M

M-SK

Member

Sep 3, 2019

#87

Just to pipe in... we had a severe fallout on Saturday, about one week after upgrading.

The corosync messaging failed (crit: cpg_send_message failed etc.), after which fencing commenced and most nodes rebooted, but even after this the quorum was not formed again, but fragmented, each fragment too small to form a quorum.
This was around 2 AM.

Around 6 AM I tried to restore quorum by restarting corosync/HA, but it didn't work. After restarting all of the hardware nodes, cluster communications proceeded to work without a hitch.

I have inspected switching logs, no outages were recorded, the traffic was minimal during the night and I cannot point the outage at any single thing outside corosync itself. We are also running a Hyper-V cluster for windows VM's on the same switching stack and there was no event recorded of a hitch of any kind.

Currently we are running the cluster with HA services disabled.

E

elmacus

Renowned Member

Sep 3, 2019

#88

>Currently we are running the cluster with HA services disabled.

Absolutely, first thing i also had to disable. I also wish Proxmox could warn user to NOT upgrade yet, i feel sorry for all people that finds this thread after upgrade.
But i guess thats life when running no-subscription.

Does this problem also exist in enterprise repo ? There are many subscribers in this thread.

M

M-SK

Member

Sep 3, 2019

#89

elmacus said:
>Currently we are running the cluster with HA services disabled.

Absolutely, first thing i also had to disable. I also wish Proxmox could warn user to NOT upgrade yet, i feel sorry for all people that finds this thread after upgrade.
But i guess thats life when running no-subscription.

Does this problem also exist in enterprise repo ? There are many subscribers in this thread.

Well, we jumped the gun since there was that annoying "node hands with 100% CPU on LXC reboot" issue that existed in 5.x versions which were giving us serious issues, especially now that we're piloting Proxmox over ansible.

A

Apollon77

Well-Known Member

Sep 3, 2019

#90

@Dominic Maybe also add that to the 5->6 upgrade info page as a known issue and to warn that clusters maybe should not upgraded now?

A

Asano

Well-Known Member

Proxmox Subscriber

Sep 3, 2019

#91

I fear I have to join the line of affected users... From screening through this thread I don't think I can add anything useful, my case looks very similar.

However, is there an easy way to get an email alert when corosync gets killed? One of my clusters just was in degenerated state for two days The affected node however was still able to send me an email about package updates so basically email sending should still work...

Other than that I really hope this gets fixed fast since this is grave but I guess that goes without saying

E

elmacus

Renowned Member

Sep 3, 2019

#92

Proxmox team.
You need to comunicate more with us users, are your engineers working on this problem ? What happens ?

If i buy a subscription and Enterprise repo, will these bugs still affect me ?

tom

Proxmox Staff Member

Staff member

Sep 3, 2019

#93

elmacus said:
Proxmox team.
You need to comunicate more with us users, are your engineers working on this problem ? What happens ?

If i buy a subscription and Enterprise repo, will these bugs still affect me ?

Just follow the pve-devel mailing list to follow latest status of development.

https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

A

ahovda

Active Member

Sep 3, 2019

#94

Same deal here. To mitigate for now, I've added

Code:

[Service]
Restart=on-failure

to /etc/systemd/system/corosync.service.d/override.conf and ran systemctl daemon-reload through ansible on our 16-node cluster.

I have some collected a few coredumps as well if that helps, but they seem to big to attach here.

Code:

           PID: 2577 (corosync)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Tue 2019-09-03 11:19:09 CEST (2h 57min ago)
  Command Line: /usr/sbin/corosync -f
    Executable: /usr/sbin/corosync
Control Group: /system.slice/corosync.service
          Unit: corosync.service
         Slice: system.slice
       Boot ID: d57436690b7d4f909dc01e5554ac69a4
    Machine ID: 6e925d11b497446e8e7f2ff38e7cf891
      Hostname: osl103pve
       Storage: /var/lib/systemd/coredump/core.corosync.0.d57436690b7d4f909dc01e5554ac69a4.2577.1567502349000000.lz4
       Message: Process 2577 (corosync) of user 0 dumped core.

                Stack trace of thread 2577:
                #0  0x00007f4221b89533 n/a (libc.so.6)
                #1  0x000055c59f5eeb64 n/a (corosync)
                #2  0x000055c59f5e65e6 n/a (corosync)
                #3  0x000055c59f5e70e4 n/a (corosync)
                #4  0x000055c59f5f1459 n/a (corosync)
                #5  0x00007f42219cd0af n/a (libqb.so.0)
                #6  0x00007f42219ccc8d qb_loop_run (libqb.so.0)
                #7  0x000055c59f5bb0f5 n/a (corosync)
                #8  0x00007f4221a4f09b __libc_start_main (libc.so.6)
                #9  0x000055c59f5bb7ba n/a (corosync)

                Stack trace of thread 2587:
                #0  0x00007f4221af1720 __nanosleep (libc.so.6)
                #1  0x00007f4221b1c874 usleep (libc.so.6)
                #2  0x00007f4221c2635a n/a (libknet.so.1)
                #3  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #4  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2585:
                #0  0x00007f4221b247ef epoll_wait (libc.so.6)
                #1  0x00007f4221c2c8d0 n/a (libknet.so.1)
                #2  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #3  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2586:
                #0  0x00007f4221b247ef epoll_wait (libc.so.6)
                #1  0x00007f4221c2d270 n/a (libknet.so.1)
                #2  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #3  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2591:
                #0  0x00007f4221af1720 __nanosleep (libc.so.6)
                #1  0x00007f4221b1c874 usleep (libc.so.6)
                #2  0x00007f4221c2614f n/a (libknet.so.1)
                #3  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #4  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2590:
                #0  0x00007f4221b247ef epoll_wait (libc.so.6)
                #1  0x00007f4221c274f3 n/a (libknet.so.1)
                #2  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #3  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2588:
                #0  0x00007f4221b247ef epoll_wait (libc.so.6)
                #1  0x00007f4221c25ad0 n/a (libknet.so.1)
                #2  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #3  0x00007f4221b244cf __clone (libc.so.6)

                Stack trace of thread 2589:
                #0  0x00007f4221b247ef epoll_wait (libc.so.6)
                #1  0x00007f4221c2a15f n/a (libknet.so.1)
                #2  0x00007f4221bf3fa3 start_thread (libpthread.so.0)
                #3  0x00007f4221b244cf __clone (libc.so.6)

Code:

...
Sep 03 08:40:37 osl103pve corosync[2577]:   [CPG   ] downlist left_list: 0 received
Sep 03 08:40:37 osl103pve corosync[2577]:   [CPG   ] downlist left_list: 0 received
Sep 03 08:40:37 osl103pve corosync[2577]:   [CPG   ] downlist left_list: 0 received
Sep 03 08:40:37 osl103pve corosync[2577]:   [QUORUM] Members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sep 03 08:40:37 osl103pve corosync[2577]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 03 08:47:25 osl103pve corosync[2577]:   [TOTEM ] Token has not been received in 363 ms
Sep 03 09:01:17 osl103pve corosync[2577]:   [TOTEM ] Retransmit List: abd6
Sep 03 10:41:38 osl103pve corosync[2577]:   [TOTEM ] Retransmit List: 3ee83
Sep 03 11:19:10 osl103pve systemd[1]: corosync.service: Main process exited, code=dumped, status=11/SEGV
Sep 03 11:19:10 osl103pve systemd[1]: corosync.service: Failed with result 'core-dump'.
Sep 03 11:19:10 osl103pve systemd[1]: corosync.service: Service RestartSec=100ms expired, scheduling restart.
Sep 03 11:19:10 osl103pve systemd[1]: corosync.service: Scheduled restart job, restart counter is at 1.
Sep 03 11:19:10 osl103pve systemd[1]: Stopped Corosync Cluster Engine.
Sep 03 11:19:10 osl103pve systemd[1]: Starting Corosync Cluster Engine...
Sep 03 11:19:10 osl103pve corosync[378701]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
Sep 03 11:19:10 osl103pve corosync[378701]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Sep 03 11:19:10 osl103pve corosync[378701]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Sep 03 11:19:10 osl103pve corosync[378701]:   [MAIN  ] Please migrate config file to nodelist.
Sep 03 11:19:10 osl103pve corosync[378701]:   [TOTEM ] Initializing transport (Kronosnet).
Sep 03 11:19:11 osl103pve corosync[378701]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Sep 03 11:19:11 osl103pve corosync[378701]:   [TOTEM ] totemknet initialized
Sep 03 11:19:11 osl103pve corosync[378701]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QB    ] server name: cmap
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QB    ] server name: cfg
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QB    ] server name: cpg
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Sep 03 11:19:11 osl103pve corosync[378701]:   [WD    ] Watchdog not enabled by configuration
Sep 03 11:19:11 osl103pve corosync[378701]:   [WD    ] resource load_15min missing a recovery key.
Sep 03 11:19:11 osl103pve corosync[378701]:   [WD    ] resource memory_used missing a recovery key.
Sep 03 11:19:11 osl103pve corosync[378701]:   [WD    ] no resources configured.
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QUORUM] Using quorum provider corosync_votequorum
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QB    ] server name: votequorum
Sep 03 11:19:11 osl103pve corosync[378701]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 03 11:19:11 osl103pve corosync[378701]:   [QB    ] server name: quorum
Sep 03 11:19:11 osl103pve corosync[378701]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 0)
Sep 03 11:19:11 osl103pve corosync[378701]:   [KNET  ] host: host: 7 has no active links
...

Code:

root@osl103pve:~# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-1
pve-kernel-5.0.18-1-pve: 5.0.18-3
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Reactions: Asano

E

elmacus

Renowned Member

Sep 3, 2019

#95

tom said:
Just follow the pve-devel mailing list to follow latest status of development.

https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Well, i only meant here in public.
Some info if someone is on the case or not should be enough.
Email list does not help visitors here.

Found this report: https://bugzilla.proxmox.com/show_bug.cgi?id=2326

@ahovda i did that (almost same, see thread above) but did not help my cluster. Please report again after a week if it helps you.

tom

Proxmox Staff Member

Staff member

Sep 3, 2019

#96

elmacus said:
Email list does not help visitors here.

Why not? There is also an archive of all messages, see https://pve.proxmox.com/pipermail/pve-devel/

Depends what you really need to know. If you want in depth explanation for your own situation and a direct and private contact to our devs, consider a support subscription.

E

elmacus

Renowned Member

Sep 3, 2019

#97

tom said:
Depends what you really need to know.

Is anyone working on this bug ?

tom

Proxmox Staff Member

Staff member

Sep 3, 2019

#98

elmacus said:
Is anyone working on this bug ?

Yes and please read details on the links I posted.

Reactions: elmacus

A

Asano

Well-Known Member

Proxmox Subscriber

Sep 3, 2019

#99

tom said:
Why not? There is also an archive of all messages, see https://pve.proxmox.com/pipermail/pve-devel/

@tom I really came to a liking of Proxmox (incl. how you as team handle most of things as well as your support and pricing strategy) over the past one or two years were I started using it more often, but this link is just like throwing a giant ball of cluttered information overload at a poor single specific question

And now to add something more constructive: Maybe linking forum threads in a similar way as marking them [solved] to a git-styled issue tracking system (or the specific mailing list thread if that is what you are using) would improve things (not just for this thread & issue).

Reactions: elmacus and robhost

robhost

Active Member

Proxmox Subscriber

Sep 3, 2019

#100

Again: If the problem only exists within knet, why don't you use udpu instead?

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

Well-Known Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Active Member

Attachments

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Active Member

We value your privacy