[SOLVED] Cluster Fails after one Day - PVE 6.0.4

I've got the same problem. Same Setup: One Clusternode was upgraded from 5.4 to 6, the other node is new. They are NOT in the same network, but since knet this shouldn't be a problem.

you still need a stable, low-latency link even with knet, so it is entirely possible that your network is at fault.

please provide from each host:

"pveversion -v" output
"/etc/corosync/corosync.conf" content
"/etc/pve/corosync.conf" content
"journalctl -u corosync -u pve-cluster -b"
and
full journal of the system from a few minutes before the issue until a few minutes after it has stabilized (either in the error state, or back to normal)
 
bitteschön:)

Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-4
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.50.221
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.50.223
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: sysops-rz
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.50.221
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.50.223
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: sysops-rz
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}
 
please use journalctl - it has all the relevant info.
 
Code:
root@p2:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-4
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1

Code:
root@p2:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: p2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 94.XXX.16.XXX
  }
  node {
    name: proxmox
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 37.XXX.94.XXX
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: ProxCluster
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}
Code:
root@p2:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: p2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 94.XXX.16.XXX
  }
  node {
    name: proxmox
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 37.XXX.94.XXX
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: ProxCluster
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}
 
Looks like i have the same problem
6.0.4
2 nodes clusted.with 1Gbps link.
Fixed by restarting corosync on one of nodes.


Here's exactly the same thing. A cluster of 7 nodes has been upgraded from 5.4 to 6.0.2 and every morning after backup the pvecm status is no longer correct.

Code:
pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-4
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pvenode-20
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.131.20
  }
  node {
    name: pvenode-21
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.131.21
  }
  node {
    name: pvenode-22
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.131.22
  }
  node {
    name: pvenode-23
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.131.23
  }
  node {
    name: pvenode-27
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.131.27
  }
  node {
    name: pvenode-28
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.131.28
  }
  node {
    name: pvenode-29
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.131.29
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: nsd-s1-dev
  config_version: 13
  interface {
    bindnetaddr: 192.168.131.20
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}


Code:
cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pvenode-20
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.131.20
  }
  node {
    name: pvenode-21
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.131.21
  }
  node {
    name: pvenode-22
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.131.22
  }
  node {
    name: pvenode-23
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.131.23
  }
  node {
    name: pvenode-27
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.131.27
  }
  node {
    name: pvenode-28
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.131.28
  }
  node {
    name: pvenode-29
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.131.29
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: nsd-s1-dev
  config_version: 13
  interface {
    bindnetaddr: 192.168.131.20
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Code:
 journalctl -u corosync -u pve-cluster -b
-- Logs begin at Fri 2019-07-26 08:05:26 CEST, end at Mon 2019-07-29 07:23:01 CEST. --
Jul 26 08:05:31 smithers-21 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [quorum] crit: quorum_initialize failed: 2
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [quorum] crit: can't initialize service
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [confdb] crit: cmap_initialize failed: 2
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [confdb] crit: can't initialize service
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [dcdb] crit: cpg_initialize failed: 2
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [dcdb] crit: can't initialize service
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [status] crit: cpg_initialize failed: 2
Jul 26 08:05:32 smithers-21 pmxcfs[4175]: [status] crit: can't initialize service
Jul 26 08:05:34 smithers-21 systemd[1]: Started The Proxmox VE cluster filesystem.
Jul 26 08:05:34 smithers-21 systemd[1]: Starting Corosync Cluster Engine...
Jul 26 08:05:34 smithers-21 corosync[4252]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up
Jul 26 08:05:34 smithers-21 corosync[4252]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 26 08:05:34 smithers-21 corosync[4252]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jul 26 08:05:34 smithers-21 corosync[4252]:   [MAIN  ] Please migrate config file to nodelist.
Jul 26 08:05:34 smithers-21 corosync[4252]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 26 08:05:34 smithers-21 corosync[4252]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Jul 26 08:05:34 smithers-21 corosync[4252]:   [TOTEM ] totemknet initialized
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QB    ] server name: cmap
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QB    ] server name: cfg
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QB    ] server name: cpg
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [WD    ] Watchdog not enabled by configuration
Jul 26 08:05:34 smithers-21 corosync[4252]:   [WD    ] resource load_15min missing a recovery key.
Jul 26 08:05:34 smithers-21 corosync[4252]:   [WD    ] resource memory_used missing a recovery key.
Jul 26 08:05:34 smithers-21 corosync[4252]:   [WD    ] no resources configured.
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QB    ] server name: votequorum
Jul 26 08:05:34 smithers-21 corosync[4252]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 26 08:05:34 smithers-21 corosync[4252]:   [QB    ] server name: quorum
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 1 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 3 has no active links
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 26 08:05:34 smithers-21 corosync[4252]:   [KNET  ] host: host: 7 has no active links
 
@Zumpel @Vladislav Solovei @Fusel please provide the logs/info requested..
I provided all i know how to get it. please give exact instructions how to get the needed information.

I don't understand why there are these Problems with Proxmox Clusters... I just want to configure and move vms vie webinterface.... that's the only part where i really miss vmware
 
Last edited:
addition to the posted config above here are the logs from one node
 

Attachments

  • coroclusterp2.txt
    504.7 KB · Views: 4
  • fulljournal2.txt
    348.7 KB · Views: 1
and the other node. It's easy to identify where the magic happens...
 

Attachments

  • fulljournal1.txt
    71.4 KB · Views: 2
  • corocluster1.txt
    103.8 KB · Views: 4
it looks like your network is not stable/low-latency enough. at 16:08 the communication gets so bad that node 2 leaves the CPG altogether, at which point you need to restart pve-cluster to get it to attempt to re-join.

could you provide output of 'corosync-cmapctl -m stats' on both nodes as well?
 
Also if the network may be to slow/unstable... the node should come back automatically after the network is stable enough.

output for node 1:
Code:
corosync-cmapctl -m stats
stats.ipcs.global.active (u64) = 5
stats.ipcs.global.closed (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.dispatched (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.flow_control (u32) = 0
stats.ipcs.service0.1407.0x557be49144c0.flow_control_count (u64) = 224
stats.ipcs.service0.1407.0x557be49144c0.invalid_request (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.overload (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.procname (str) = pmxcfs
stats.ipcs.service0.1407.0x557be49144c0.queued (u32) = 0
stats.ipcs.service0.1407.0x557be49144c0.queueing (i32) = 0
stats.ipcs.service0.1407.0x557be49144c0.recv_retries (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.requests (u64) = 24
stats.ipcs.service0.1407.0x557be49144c0.responses (u64) = 24
stats.ipcs.service0.1407.0x557be49144c0.send_retries (u64) = 0
stats.ipcs.service0.1407.0x557be49144c0.sent (u32) = 0
stats.ipcs.service0.233341.0x557be491d6b0.dispatched (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.flow_control (u32) = 0
stats.ipcs.service0.233341.0x557be491d6b0.flow_control_count (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.invalid_request (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.overload (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.procname (str) = corosync-cmapct
stats.ipcs.service0.233341.0x557be491d6b0.queued (u32) = 0
stats.ipcs.service0.233341.0x557be491d6b0.queueing (i32) = 0
stats.ipcs.service0.233341.0x557be491d6b0.recv_retries (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.requests (u64) = 54
stats.ipcs.service0.233341.0x557be491d6b0.responses (u64) = 55
stats.ipcs.service0.233341.0x557be491d6b0.send_retries (u64) = 0
stats.ipcs.service0.233341.0x557be491d6b0.sent (u32) = 0
stats.ipcs.service2.1407.0x557be4910160.dispatched (u64) = 20553
stats.ipcs.service2.1407.0x557be4910160.flow_control (u32) = 0
stats.ipcs.service2.1407.0x557be4910160.flow_control_count (u64) = 224
stats.ipcs.service2.1407.0x557be4910160.invalid_request (u64) = 0
stats.ipcs.service2.1407.0x557be4910160.overload (u64) = 0
stats.ipcs.service2.1407.0x557be4910160.procname (str) = pmxcfs
stats.ipcs.service2.1407.0x557be4910160.queued (u32) = 0
stats.ipcs.service2.1407.0x557be4910160.queueing (i32) = 0
stats.ipcs.service2.1407.0x557be4910160.recv_retries (u64) = 0
stats.ipcs.service2.1407.0x557be4910160.requests (u64) = 10257
stats.ipcs.service2.1407.0x557be4910160.responses (u64) = 2
stats.ipcs.service2.1407.0x557be4910160.send_retries (u64) = 0
stats.ipcs.service2.1407.0x557be4910160.sent (u32) = 20553
stats.ipcs.service2.1407.0x557be4911be0.dispatched (u64) = 77623
stats.ipcs.service2.1407.0x557be4911be0.flow_control (u32) = 0
stats.ipcs.service2.1407.0x557be4911be0.flow_control_count (u64) = 224
stats.ipcs.service2.1407.0x557be4911be0.invalid_request (u64) = 0
stats.ipcs.service2.1407.0x557be4911be0.overload (u64) = 0
stats.ipcs.service2.1407.0x557be4911be0.procname (str) = pmxcfs
stats.ipcs.service2.1407.0x557be4911be0.queued (u32) = 0
stats.ipcs.service2.1407.0x557be4911be0.queueing (i32) = 0
stats.ipcs.service2.1407.0x557be4911be0.recv_retries (u64) = 0
stats.ipcs.service2.1407.0x557be4911be0.requests (u64) = 72643
stats.ipcs.service2.1407.0x557be4911be0.responses (u64) = 2
stats.ipcs.service2.1407.0x557be4911be0.send_retries (u64) = 0
stats.ipcs.service2.1407.0x557be4911be0.sent (u32) = 77623
stats.ipcs.service3.1407.0x557be4912a90.dispatched (u64) = 113
stats.ipcs.service3.1407.0x557be4912a90.flow_control (u32) = 0
stats.ipcs.service3.1407.0x557be4912a90.flow_control_count (u64) = 224
stats.ipcs.service3.1407.0x557be4912a90.invalid_request (u64) = 0
stats.ipcs.service3.1407.0x557be4912a90.overload (u64) = 0
stats.ipcs.service3.1407.0x557be4912a90.procname (str) = pmxcfs
stats.ipcs.service3.1407.0x557be4912a90.queued (u32) = 0
stats.ipcs.service3.1407.0x557be4912a90.queueing (i32) = 0
stats.ipcs.service3.1407.0x557be4912a90.recv_retries (u64) = 0
stats.ipcs.service3.1407.0x557be4912a90.requests (u64) = 2
stats.ipcs.service3.1407.0x557be4912a90.responses (u64) = 2
stats.ipcs.service3.1407.0x557be4912a90.send_retries (u64) = 0
stats.ipcs.service3.1407.0x557be4912a90.sent (u32) = 113
stats.knet.handle.rx_compress_time_ave (u64) = 0
stats.knet.handle.rx_compress_time_max (u64) = 0
stats.knet.handle.rx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.rx_compressed_original_bytes (u64) = 0
stats.knet.handle.rx_compressed_packets (u64) = 0
stats.knet.handle.rx_compressed_size_bytes (u64) = 0
stats.knet.handle.rx_crypt_packets (u64) = 485521
stats.knet.handle.rx_crypt_time_ave (u64) = 5478
stats.knet.handle.rx_crypt_time_max (u64) = 327377
stats.knet.handle.rx_crypt_time_min (u64) = 4846
stats.knet.handle.tx_compress_time_ave (u64) = 0
stats.knet.handle.tx_compress_time_max (u64) = 0
stats.knet.handle.tx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.tx_compressed_original_bytes (u64) = 0
stats.knet.handle.tx_compressed_packets (u64) = 0
stats.knet.handle.tx_compressed_size_bytes (u64) = 0
stats.knet.handle.tx_crypt_byte_overhead (u64) = 32429373
stats.knet.handle.tx_crypt_packets (u64) = 669932
stats.knet.handle.tx_crypt_time_ave (u64) = 6204
stats.knet.handle.tx_crypt_time_max (u64) = 279124
stats.knet.handle.tx_crypt_time_min (u64) = 5176
stats.knet.handle.tx_uncompressed_packets (u64) = 0
stats.knet.node1.link0.connected (u8) = 1
stats.knet.node1.link0.down_count (u32) = 0
stats.knet.node1.link0.enabled (u8) = 1
stats.knet.node1.link0.latency_ave (u32) = 0
stats.knet.node1.link0.latency_max (u32) = 0
stats.knet.node1.link0.latency_min (u32) = 4294967295
stats.knet.node1.link0.latency_samples (u32) = 0
stats.knet.node1.link0.mtu (u32) = 65535
stats.knet.node1.link0.rx_data_bytes (u64) = 0
stats.knet.node1.link0.rx_data_packets (u64) = 0
stats.knet.node1.link0.rx_ping_bytes (u64) = 0
stats.knet.node1.link0.rx_ping_packets (u64) = 0
stats.knet.node1.link0.rx_pmtu_bytes (u64) = 0
stats.knet.node1.link0.rx_pmtu_packets (u64) = 0
stats.knet.node1.link0.rx_pong_bytes (u64) = 0
stats.knet.node1.link0.rx_pong_packets (u64) = 0
stats.knet.node1.link0.rx_total_bytes (u64) = 0
stats.knet.node1.link0.rx_total_packets (u64) = 0
stats.knet.node1.link0.rx_total_retries (u64) = 0
stats.knet.node1.link0.tx_data_bytes (u64) = 190340412
stats.knet.node1.link0.tx_data_errors (u32) = 0
stats.knet.node1.link0.tx_data_packets (u64) = 2000794
stats.knet.node1.link0.tx_data_retries (u32) = 0
stats.knet.node1.link0.tx_ping_bytes (u64) = 0
stats.knet.node1.link0.tx_ping_errors (u32) = 0
stats.knet.node1.link0.tx_ping_packets (u64) = 0
stats.knet.node1.link0.tx_ping_retries (u32) = 0
stats.knet.node1.link0.tx_pmtu_bytes (u64) = 0
stats.knet.node1.link0.tx_pmtu_errors (u32) = 0
stats.knet.node1.link0.tx_pmtu_packets (u64) = 0
stats.knet.node1.link0.tx_pmtu_retries (u32) = 0
stats.knet.node1.link0.tx_pong_bytes (u64) = 0
stats.knet.node1.link0.tx_pong_errors (u32) = 0
stats.knet.node1.link0.tx_pong_packets (u64) = 0
stats.knet.node1.link0.tx_pong_retries (u32) = 0
stats.knet.node1.link0.tx_total_bytes (u64) = 190340412
stats.knet.node1.link0.tx_total_errors (u64) = 0
stats.knet.node1.link0.tx_total_packets (u64) = 2000794
stats.knet.node1.link0.up_count (u32) = 1
stats.knet.node2.link0.connected (u8) = 0
stats.knet.node2.link0.down_count (u32) = 73
stats.knet.node2.link0.enabled (u8) = 1
stats.knet.node2.link0.latency_ave (u32) = 1697
stats.knet.node2.link0.latency_max (u32) = 3749
stats.knet.node2.link0.latency_min (u32) = 69
stats.knet.node2.link0.latency_samples (u32) = 54914
stats.knet.node2.link0.mtu (u32) = 1366
stats.knet.node2.link0.rx_data_bytes (u64) = 1564416926
stats.knet.node2.link0.rx_data_packets (u64) = 1564417065
stats.knet.node2.link0.rx_ping_bytes (u64) = 1564418849
stats.knet.node2.link0.rx_ping_packets (u64) = 1564417039
stats.knet.node2.link0.rx_pmtu_bytes (u64) = 808046
stats.knet.node2.link0.rx_pmtu_packets (u64) = 1564419654
stats.knet.node2.link0.rx_pong_bytes (u64) = 1564657449
stats.knet.node2.link0.rx_pong_packets (u64) = 1564428396
stats.knet.node2.link0.rx_total_bytes (u64) = 4694301270
stats.knet.node2.link0.rx_total_packets (u64) = 6257682154
stats.knet.node2.link0.rx_total_retries (u64) = 0
stats.knet.node2.link0.tx_data_bytes (u64) = 1572426599
stats.knet.node2.link0.tx_data_errors (u32) = 0
stats.knet.node2.link0.tx_data_packets (u64) = 1564448728
stats.knet.node2.link0.tx_data_retries (u32) = 0
stats.knet.node2.link0.tx_ping_bytes (u64) = 1565291654
stats.knet.node2.link0.tx_ping_errors (u32) = 0
stats.knet.node2.link0.tx_ping_packets (u64) = 1564430108
stats.knet.node2.link0.tx_ping_retries (u32) = 0
stats.knet.node2.link0.tx_pmtu_bytes (u64) = 1128096
stats.knet.node2.link0.tx_pmtu_errors (u32) = 0
stats.knet.node2.link0.tx_pmtu_packets (u64) = 1564419665
stats.knet.node2.link0.tx_pmtu_retries (u32) = 0
stats.knet.node2.link0.tx_pong_bytes (u64) = 1564421895
stats.knet.node2.link0.tx_pong_errors (u32) = 0
stats.knet.node2.link0.tx_pong_packets (u64) = 1564418591
stats.knet.node2.link0.tx_pong_retries (u32) = 0
stats.knet.node2.link0.tx_total_bytes (u64) = 4703268244
stats.knet.node2.link0.tx_total_errors (u64) = 0
stats.knet.node2.link0.tx_total_packets (u64) = 6257717092
stats.knet.node2.link0.up_count (u32) = 72
stats.pg.msg_queue_avail (u32) = 0
stats.pg.msg_reserved (u32) = 2
stats.srp.avg_backlog_calc (u32) = 0
stats.srp.avg_token_workload (u32) = 0
stats.srp.commit_entered (u64) = 114
stats.srp.commit_token_lost (u64) = 0
stats.srp.consensus_timeouts (u64) = 1
stats.srp.continuous_gather (u32) = 0
stats.srp.continuous_sendmsg_failures (u32) = 0
stats.srp.firewall_enabled_or_nic_failure (u8) = 0
stats.srp.gather_entered (u64) = 115
stats.srp.gather_token_lost (u64) = 0
stats.srp.mcast_retx (u64) = 0
stats.srp.mcast_rx (u64) = 83662
stats.srp.mcast_tx (u64) = 69783
stats.srp.memb_commit_token_rx (u64) = 228
stats.srp.memb_commit_token_tx (u64) = 228
stats.srp.memb_join_rx (u64) = 623
stats.srp.memb_join_tx (u64) = 310
stats.srp.memb_merge_detect_rx (u64) = 225042
stats.srp.memb_merge_detect_tx (u64) = 225042
stats.srp.mtt_rx_token (u32) = 40
stats.srp.operational_entered (u64) = 114
stats.srp.operational_token_lost (u64) = 100
stats.srp.orf_token_rx (u64) = 2128694
stats.srp.orf_token_tx (u64) = 114
stats.srp.recovery_entered (u64) = 114
stats.srp.recovery_token_lost (u64) = 0
stats.srp.rx_msg_dropped (u64) = 0
stats.srp.time_since_token_last_received (u64) = 112
stats.srp.token_hold_cancel_rx (u64) = 48066
stats.srp.token_hold_cancel_tx (u64) = 43768

and for node 2
Code:
corosync-cmapctl -m stats
stats.ipcs.global.active (u64) = 5
stats.ipcs.global.closed (u64) = 5
stats.ipcs.service0.4192.0x55ecfeb927f0.dispatched (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.flow_control (u32) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.flow_control_count (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.invalid_request (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.overload (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.procname (str) = corosync-cmapct
stats.ipcs.service0.4192.0x55ecfeb927f0.queued (u32) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.queueing (i32) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.recv_retries (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.requests (u64) = 27
stats.ipcs.service0.4192.0x55ecfeb927f0.responses (u64) = 28
stats.ipcs.service0.4192.0x55ecfeb927f0.send_retries (u64) = 0
stats.ipcs.service0.4192.0x55ecfeb927f0.sent (u32) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.dispatched (u64) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.flow_control (u32) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.flow_control_count (u64) = 8332
stats.ipcs.service0.942.0x55ecfeb8c7d0.invalid_request (u64) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.overload (u64) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.procname (str) = pmxcfs
stats.ipcs.service0.942.0x55ecfeb8c7d0.queued (u32) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.queueing (i32) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.recv_retries (u64) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.requests (u64) = 24
stats.ipcs.service0.942.0x55ecfeb8c7d0.responses (u64) = 24
stats.ipcs.service0.942.0x55ecfeb8c7d0.send_retries (u64) = 0
stats.ipcs.service0.942.0x55ecfeb8c7d0.sent (u32) = 0
stats.ipcs.service2.942.0x55ecfeb88490.dispatched (u64) = 4081
stats.ipcs.service2.942.0x55ecfeb88490.flow_control (u32) = 0
stats.ipcs.service2.942.0x55ecfeb88490.flow_control_count (u64) = 8088
stats.ipcs.service2.942.0x55ecfeb88490.invalid_request (u64) = 0
stats.ipcs.service2.942.0x55ecfeb88490.overload (u64) = 0
stats.ipcs.service2.942.0x55ecfeb88490.procname (str) = pmxcfs
stats.ipcs.service2.942.0x55ecfeb88490.queued (u32) = 0
stats.ipcs.service2.942.0x55ecfeb88490.queueing (i32) = 0
stats.ipcs.service2.942.0x55ecfeb88490.recv_retries (u64) = 0
stats.ipcs.service2.942.0x55ecfeb88490.requests (u64) = 38
stats.ipcs.service2.942.0x55ecfeb88490.responses (u64) = 2
stats.ipcs.service2.942.0x55ecfeb88490.send_retries (u64) = 0
stats.ipcs.service2.942.0x55ecfeb88490.sent (u32) = 4081
stats.ipcs.service2.942.0x55ecfeb89ec0.dispatched (u64) = 44965
stats.ipcs.service2.942.0x55ecfeb89ec0.flow_control (u32) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.flow_control_count (u64) = 8332
stats.ipcs.service2.942.0x55ecfeb89ec0.invalid_request (u64) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.overload (u64) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.procname (str) = pmxcfs
stats.ipcs.service2.942.0x55ecfeb89ec0.queued (u32) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.queueing (i32) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.recv_retries (u64) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.requests (u64) = 25820
stats.ipcs.service2.942.0x55ecfeb89ec0.responses (u64) = 2
stats.ipcs.service2.942.0x55ecfeb89ec0.send_retries (u64) = 0
stats.ipcs.service2.942.0x55ecfeb89ec0.sent (u32) = 44965
stats.ipcs.service3.942.0x55ecfeb8adc0.dispatched (u64) = 4167
stats.ipcs.service3.942.0x55ecfeb8adc0.flow_control (u32) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.flow_control_count (u64) = 8332
stats.ipcs.service3.942.0x55ecfeb8adc0.invalid_request (u64) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.overload (u64) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.procname (str) = pmxcfs
stats.ipcs.service3.942.0x55ecfeb8adc0.queued (u32) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.queueing (i32) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.recv_retries (u64) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.requests (u64) = 2
stats.ipcs.service3.942.0x55ecfeb8adc0.responses (u64) = 2
stats.ipcs.service3.942.0x55ecfeb8adc0.send_retries (u64) = 0
stats.ipcs.service3.942.0x55ecfeb8adc0.sent (u32) = 4167
stats.knet.handle.rx_compress_time_ave (u64) = 0
stats.knet.handle.rx_compress_time_max (u64) = 0
stats.knet.handle.rx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.rx_compressed_original_bytes (u64) = 0
stats.knet.handle.rx_compressed_packets (u64) = 0
stats.knet.handle.rx_compressed_size_bytes (u64) = 0
stats.knet.handle.rx_crypt_packets (u64) = 577194
stats.knet.handle.rx_crypt_time_ave (u64) = 7836
stats.knet.handle.rx_crypt_time_max (u64) = 341994
stats.knet.handle.rx_crypt_time_min (u64) = 6694
stats.knet.handle.tx_compress_time_ave (u64) = 0
stats.knet.handle.tx_compress_time_max (u64) = 0
stats.knet.handle.tx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.tx_compressed_original_bytes (u64) = 0
stats.knet.handle.tx_compressed_packets (u64) = 0
stats.knet.handle.tx_compressed_size_bytes (u64) = 0
stats.knet.handle.tx_crypt_byte_overhead (u64) = 28739176
stats.knet.handle.tx_crypt_packets (u64) = 615633
stats.knet.handle.tx_crypt_time_ave (u64) = 10247
stats.knet.handle.tx_crypt_time_max (u64) = 274873
stats.knet.handle.tx_crypt_time_min (u64) = 8122
stats.knet.handle.tx_uncompressed_packets (u64) = 0
stats.knet.node1.link0.connected (u8) = 0
stats.knet.node1.link0.down_count (u32) = 1564405645
stats.knet.node1.link0.enabled (u8) = 1
stats.knet.node1.link0.latency_ave (u32) = 1918
stats.knet.node1.link0.latency_max (u32) = 2248
stats.knet.node1.link0.latency_min (u32) = 1918
stats.knet.node1.link0.latency_samples (u32) = 28672
stats.knet.node1.link0.mtu (u32) = 1366
stats.knet.node1.link0.rx_data_bytes (u64) = 8346231
stats.knet.node1.link0.rx_data_packets (u64) = 73964
stats.knet.node1.link0.rx_ping_bytes (u64) = 1458730
stats.knet.node1.link0.rx_ping_packets (u64) = 56105
stats.knet.node1.link0.rx_pmtu_bytes (u64) = 1110106
stats.knet.node1.link0.rx_pmtu_packets (u64) = 1372
stats.knet.node1.link0.rx_pong_bytes (u64) = 1078142
stats.knet.node1.link0.rx_pong_packets (u64) = 41467
stats.knet.node1.link0.rx_total_bytes (u64) = 11993209
stats.knet.node1.link0.rx_total_packets (u64) = 172908
stats.knet.node1.link0.rx_total_retries (u64) = 0
stats.knet.node1.link0.tx_data_bytes (u64) = 76026528
stats.knet.node1.link0.tx_data_errors (u32) = 1564404545
stats.knet.node1.link0.tx_data_packets (u64) = 505373
stats.knet.node1.link0.tx_data_retries (u32) = 0
stats.knet.node1.link0.tx_ping_bytes (u64) = 4175760
stats.knet.node1.link0.tx_ping_errors (u32) = 0
stats.knet.node1.link0.tx_ping_packets (u64) = 52197
stats.knet.node1.link0.tx_ping_retries (u32) = 0
stats.knet.node1.link0.tx_pmtu_bytes (u64) = 862592
stats.knet.node1.link0.tx_pmtu_errors (u32) = 0
stats.knet.node1.link0.tx_pmtu_packets (u64) = 586
stats.knet.node1.link0.tx_pmtu_retries (u32) = 0
stats.knet.node1.link0.tx_pong_bytes (u64) = 4488400
stats.knet.node1.link0.tx_pong_errors (u32) = 1564404411
stats.knet.node1.link0.tx_pong_packets (u64) = 56105
stats.knet.node1.link0.tx_pong_retries (u32) = 0
stats.knet.node1.link0.tx_total_bytes (u64) = 85553280
stats.knet.node1.link0.tx_total_errors (u64) = 3128808956
stats.knet.node1.link0.tx_total_packets (u64) = 614261
stats.knet.node1.link0.up_count (u32) = 45
stats.knet.node2.link0.connected (u8) = 1
stats.knet.node2.link0.down_count (u32) = 0
stats.knet.node2.link0.enabled (u8) = 1
stats.knet.node2.link0.latency_ave (u32) = 0
stats.knet.node2.link0.latency_max (u32) = 0
stats.knet.node2.link0.latency_min (u32) = 4294967295
stats.knet.node2.link0.latency_samples (u32) = 0
stats.knet.node2.link0.mtu (u32) = 65535
stats.knet.node2.link0.rx_data_bytes (u64) = 0
stats.knet.node2.link0.rx_data_packets (u64) = 0
stats.knet.node2.link0.rx_ping_bytes (u64) = 0
stats.knet.node2.link0.rx_ping_packets (u64) = 0
stats.knet.node2.link0.rx_pmtu_bytes (u64) = 0
stats.knet.node2.link0.rx_pmtu_packets (u64) = 0
stats.knet.node2.link0.rx_pong_bytes (u64) = 0
stats.knet.node2.link0.rx_pong_packets (u64) = 0
stats.knet.node2.link0.rx_total_bytes (u64) = 0
stats.knet.node2.link0.rx_total_packets (u64) = 0
stats.knet.node2.link0.rx_total_retries (u64) = 0
stats.knet.node2.link0.tx_data_bytes (u64) = 137001096
stats.knet.node2.link0.tx_data_errors (u32) = 0
stats.knet.node2.link0.tx_data_packets (u64) = 1512342
stats.knet.node2.link0.tx_data_retries (u32) = 0
stats.knet.node2.link0.tx_ping_bytes (u64) = 0
stats.knet.node2.link0.tx_ping_errors (u32) = 0
stats.knet.node2.link0.tx_ping_packets (u64) = 0
stats.knet.node2.link0.tx_ping_retries (u32) = 0
stats.knet.node2.link0.tx_pmtu_bytes (u64) = 0
stats.knet.node2.link0.tx_pmtu_errors (u32) = 0
stats.knet.node2.link0.tx_pmtu_packets (u64) = 0
stats.knet.node2.link0.tx_pmtu_retries (u32) = 0
stats.knet.node2.link0.tx_pong_bytes (u64) = 0
stats.knet.node2.link0.tx_pong_errors (u32) = 0
stats.knet.node2.link0.tx_pong_packets (u64) = 0
stats.knet.node2.link0.tx_pong_retries (u32) = 0
stats.knet.node2.link0.tx_total_bytes (u64) = 137001096
stats.knet.node2.link0.tx_total_errors (u64) = 0
stats.knet.node2.link0.tx_total_packets (u64) = 1512342
stats.knet.node2.link0.up_count (u32) = 1
stats.pg.msg_queue_avail (u32) = 0
stats.pg.msg_reserved (u32) = 2
stats.srp.avg_backlog_calc (u32) = 0
stats.srp.avg_token_workload (u32) = 0
stats.srp.commit_entered (u64) = 4168
stats.srp.commit_token_lost (u64) = 0
stats.srp.consensus_timeouts (u64) = 4050
stats.srp.continuous_gather (u32) = 0
stats.srp.continuous_sendmsg_failures (u32) = 0
stats.srp.firewall_enabled_or_nic_failure (u8) = 0
stats.srp.gather_entered (u64) = 8219
stats.srp.gather_token_lost (u64) = 0
stats.srp.mcast_retx (u64) = 10
stats.srp.mcast_rx (u64) = 74056
stats.srp.mcast_tx (u64) = 50312
stats.srp.memb_commit_token_rx (u64) = 8338
stats.srp.memb_commit_token_tx (u64) = 8336
stats.srp.memb_join_rx (u64) = 102038
stats.srp.memb_join_tx (u64) = 101775
stats.srp.memb_merge_detect_rx (u64) = 243255
stats.srp.memb_merge_detect_tx (u64) = 180196
stats.srp.mtt_rx_token (u32) = 42
stats.srp.operational_entered (u64) = 4168
stats.srp.operational_token_lost (u64) = 54
stats.srp.orf_token_rx (u64) = 1629099
stats.srp.orf_token_tx (u64) = 4053
stats.srp.recovery_entered (u64) = 4168
stats.srp.recovery_token_lost (u64) = 0
stats.srp.rx_msg_dropped (u64) = 0
stats.srp.time_since_token_last_received (u64) = 87
stats.srp.token_hold_cancel_rx (u64) = 32750
stats.srp.token_hold_cancel_tx (u64) = 24125
 
your network is likely too high latency for knet to work reliably - you have fairly consistent ~2ms from node 2 to node 1, and average 1.7ms with spikes up to 3.75ms from node 1 to 2. you can try to increase the 'token' timeout (see man corosync.conf), but beware that you might run into other problems instead (like low pmxcfs performance, lock contention, ...). the recommendation is to put your nodes into the same local network.
 
I'll try this. But as i already mentioned: in my opinion the cluster should be back after network conditions are good enough again. For me it's a bug as long as i need to manually restart a service to work again
 
The Cluster is healthy now for about 6 hours with token: 3000 added to the totem part of config. I will report tomorrow if it still works, but I think also with default und slow/unstable network the cluster should recover as soon as the network is stable enough again
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!