Hello
I have 7 nodes ( currently ) Proxmox cluster.
Some time ago weird things started to happen. Random nodes just drop out and back to cluster. Here is the logs:
And then node is back
Corosync conf:
Important note is that i have some of the nodes on different datacenters, so basic delay is pretty high ( around 1.5ms ). Additionally there is antiddos and firewalls (but they say that doesnt block anything on 5405 ports etc ).
So basically i think maybe i can handle this by timeouts increase?
As u see 5000ms token doesn't make a difference. So what else can i try?
PS: pveversion
I have 7 nodes ( currently ) Proxmox cluster.
Some time ago weird things started to happen. Random nodes just drop out and back to cluster. Here is the logs:
Code:
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] link: host: 7 link: 0 is down
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] sctp: Notifying connect thread that sockfd 33 received a link down event
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 has no active links
Jun 28 19:29:11 prx1 corosync[1867440]: [TOTEM ] Knet host change callback. nodeid: 7 reachable: 0
Code:
Jun 28 19:29:24 prx1 corosync[1867440]: [MAIN ] Member left: r(0) ip(*.*.*.*)
Jun 28 19:29:24 prx1 corosync[1867440]: [TOTEM ] waiting_trans_ack changed to 1
Jun 28 19:29:24 prx1 corosync[1867440]: [SYNC ] call init for locally known services
Jun 28 19:29:24 prx1 corosync[1867440]: [QUORUM] Sync members[6]: 1 2 4 5 6 9
Jun 28 19:29:24 prx1 corosync[1867440]: [QUORUM] Sync left[1]: 7
Jun 28 19:29:24 prx1 corosync[1867440]: [TOTEM ] entering OPERATIONAL state.
Jun 28 19:29:24 prx1 corosync[1867440]: [TOTEM ] A new membership (1.7b34) was formed. Members left: 7
Jun 28 19:29:24 prx1 corosync[1867440]: [TOTEM ] Failed to receive the leave message. failed: 7
And then node is back
Code:
Jun 28 19:34:51 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 50
Jun 28 19:34:52 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 60
Jun 28 19:34:53 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 70
Jun 28 19:34:54 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 80
Jun 28 19:34:55 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 90
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 100
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retried 100 times
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] crit: cpg_send_message failed: 6
Jun 28 19:34:57 prx1 corosync[1867440]: [TOTEM ] got commit token
Jun 28 19:34:57 prx1 corosync[1867440]: [TOTEM ] Saving state aru 7 high seq received 7
Corosync conf:
Code:
logging {
debug: on
to_syslog: yes
}
...
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: prx
config_version: 34
interface {
knet_transport: sctp
linknumber: 0
token: 5000
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Important note is that i have some of the nodes on different datacenters, so basic delay is pretty high ( around 1.5ms ). Additionally there is antiddos and firewalls (but they say that doesnt block anything on 5405 ports etc ).
So basically i think maybe i can handle this by timeouts increase?
As u see 5000ms token doesn't make a difference. So what else can i try?
PS: pveversion
Code:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
Last edited: