Hey all,
I observed a strange reboot off all my cluster nodes as soon as on one specific host cororsync is restarted or this host rebooted.
I have 7 hosts in one cluster
Corosync has 2 links configured. ring0 is on a separate network on separate switch. ring1 is shared as VLAN over 10G fiber interface.
This is a part of my corosync.conf:
When I run ,,corosync-cfgtool -s" on my pve56 (nodeid 7) i get:
This looks the same on all other nodes.
a corosync-cfgtool -n will show all hosts reachable:
As soon as corosync on pve56 (nodeid 7) is stopped /etc/pve will be write protected (it looks to me so) on all other hosts ans watchdog is rebooting all hosts after 60 sec.
I can stop and start corosync on all other nodes with no problem.
What I do not understand is
1. why corosync-cfgtool reports pve56 as node 6 (localhost) and on ring1 node 7 as disconnected?
2. why could this change during one of the last upgrades. I have to state, the cluster was running without any problems before. I unfortunately can not review whis upgrade changed the cluster this way..
My host are all upgraded to this packages:
Any help to identify the problem and fix it will be welcome
Thanks in advance
Lukas
I observed a strange reboot off all my cluster nodes as soon as on one specific host cororsync is restarted or this host rebooted.
I have 7 hosts in one cluster
Corosync has 2 links configured. ring0 is on a separate network on separate switch. ring1 is shared as VLAN over 10G fiber interface.
This is a part of my corosync.conf:
Code:
.....
node {
name: pve56
nodeid: 7
quorum_votes: 1
ring0_addr: 192.168.24.56
ring1_addr: 192.168.25.56
}
....
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: HA-Cluster-A
config_version: 13
interface {
knet_link_priority: 10
linknumber: 0
}
interface {
knet_link_priority: 20
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
When I run ,,corosync-cfgtool -s" on my pve56 (nodeid 7) i get:
Code:
Local node ID 7, transport knet
LINK ID 0
addr = 192.168.24.56
status:
nodeid: 2: connected
nodeid: 3: connected
nodeid: 5: connected
nodeid: 6: localhost
nodeid: 7: connected
nodeid: 8: connected
nodeid: 9: connected
LINK ID 1
addr = 192.168.25.56
status:
nodeid: 2: connected
nodeid: 3: connected
nodeid: 5: connected
nodeid: 6: localhost
nodeid: 7: disconnected
nodeid: 8: connected
nodeid: 9: connected
This looks the same on all other nodes.
a corosync-cfgtool -n will show all hosts reachable:
Code:
Local node ID 7, transport knet
nodeid: 2 reachable
LINK: 0 (192.168.24.56->192.168.24.59) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.59) enabled connected mtu: 1397
nodeid: 3 reachable
LINK: 0 (192.168.24.56->192.168.24.52) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.52) enabled connected mtu: 1397
nodeid: 5 reachable
LINK: 0 (192.168.24.56->192.168.24.54) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.54) enabled connected mtu: 1397
nodeid: 6 reachable
LINK: 0 (192.168.24.56->192.168.24.55) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.55) enabled connected mtu: 1397
nodeid: 8 reachable
LINK: 0 (192.168.24.56->192.168.24.57) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.57) enabled connected mtu: 1397
nodeid: 9 reachable
LINK: 0 (192.168.24.56->192.168.24.58) enabled connected mtu: 1397
LINK: 1 (192.168.25.56->192.168.25.58) enabled connected mtu: 1397
As soon as corosync on pve56 (nodeid 7) is stopped /etc/pve will be write protected (it looks to me so) on all other hosts ans watchdog is rebooting all hosts after 60 sec.
I can stop and start corosync on all other nodes with no problem.
What I do not understand is
1. why corosync-cfgtool reports pve56 as node 6 (localhost) and on ring1 node 7 as disconnected?
2. why could this change during one of the last upgrades. I have to state, the cluster was running without any problems before. I unfortunately can not review whis upgrade changed the cluster this way..
My host are all upgraded to this packages:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.151-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-10
pve-kernel-helper: 6.4-10
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.151-1-pve: 5.4.151-1
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1
Any help to identify the problem and fix it will be welcome
Thanks in advance
Lukas