Hey All,
I got a complete cluster reset (watchdog based reset of all nodes) in the following scenario.
Got a cluster of 7 hosts.
corosync has 2 rings:
ring0 network 192.168.xx.n/24 using a dedicated cupper switch
rint1 network 192.168.yy.n/24 using a vlan in a 10g fiber.
Here a part of corosync.conf
Now one node (nn) is down to get a new fiber-card. It comes back online only with ring0 connected / ring1 unplugged.
As soon as the node nn is reachable over ring0 all host do net get any heartbeat on lrm any more. as soon as the watchdog timeount (60s) is reached exch host will be resetted.
I can not find my error here in this setup.
the last logs I see before reset is:
Any hints to fix?
pve version is 8.1.4
Regard LM
I got a complete cluster reset (watchdog based reset of all nodes) in the following scenario.
Got a cluster of 7 hosts.
corosync has 2 rings:
ring0 network 192.168.xx.n/24 using a dedicated cupper switch
rint1 network 192.168.yy.n/24 using a vlan in a 10g fiber.
Here a part of corosync.conf
Code:
nodelist {
node {
name: pve40
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.xx.40
ring1_addr: 192.168.yy.40
}
node {
name: pve52
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.xx.52
ring1_addr: 192.168.yy.52
}
... more nodes ...
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: HA-Cluster-A
config_version: 15
interface {
knet_link_priority: 10
linknumber: 0
}
interface {
knet_link_priority: 20
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Now one node (nn) is down to get a new fiber-card. It comes back online only with ring0 connected / ring1 unplugged.
As soon as the node nn is reachable over ring0 all host do net get any heartbeat on lrm any more. as soon as the watchdog timeount (60s) is reached exch host will be resetted.
I can not find my error here in this setup.
the last logs I see before reset is:
Code:
024-03-06T10:16:28.080135+01:00 pve57 corosync[1532]: [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.080844+01:00 pve57 corosync[1532]: [TOTEM ] A new membership (1.1653) was formed. Members
2024-03-06T10:16:28.096054+01:00 pve57 corosync[1532]: [QUORUM] Members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.096288+01:00 pve57 corosync[1532]: [MAIN ] Completed service synchronization, ready to provide service.
2024-03-06T10:16:28.099871+01:00 pve57 watchdog-mux[1076]: exit watchdog-mux with active connections
2024-03-06T10:16:28.100344+01:00 pve57 pve-ha-crm[2729]: loop take too long (63 seconds)
2024-03-06T10:16:28.114142+01:00 pve57 systemd[1]: watchdog-mux.service: Deactivated successfully.
2024-03-06T10:16:28.114414+01:00 pve57 kernel: [ 1142.459534] watchdog: watchdog0: watchdog did not stop!
2024-03-06T10:16:28.168392+01:00 pve57 pmxcfs[1402]: [status] notice: cpg_send_message retried 99 times
2024-03-06T10:16:28.222159+01:00 pve57 pmxcfs[1402]: [status] notice: received log
Any hints to fix?
pve version is 8.1.4
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.11: 7.0-10
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 18.2.1-pve2
ceph-fuse: 18.2.1-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2
Regard LM
Last edited: