So I upgraded from 7 to 8 following the directions and it has completely killed my cluster.
Basic setup with 4 nodes, NFS backed for all the storage (no ceph), a linux bridge and two vlans on the bridge. When the nodes startup they seem to start ok, but logging into each one and looking at cluster info it seems that every node shows something different as far as status. Syslog is constantly throwing link up/down for each host and hosts are randomly responsive and unresponsive. I have confirmed that network connectivity is good as everything pings and when the cluster portion is going haywire about network, I can go through each node and hit the shell of every other node in the cluster. At times I find the corosync service pegged @ 99% CPU and then that happens, the web gui is unresponsive I am also seeing the pveproxy dying a lot.
I have been trouble shooting the for more than a day.
pvecm status from each machine:
I have also verified that the corosync.conf is the same on every machine in the corosync and pve folders.
Basic setup with 4 nodes, NFS backed for all the storage (no ceph), a linux bridge and two vlans on the bridge. When the nodes startup they seem to start ok, but logging into each one and looking at cluster info it seems that every node shows something different as far as status. Syslog is constantly throwing link up/down for each host and hosts are randomly responsive and unresponsive. I have confirmed that network connectivity is good as everything pings and when the cluster portion is going haywire about network, I can go through each node and hit the shell of every other node in the cluster. At times I find the corosync service pegged @ 99% CPU and then that happens, the web gui is unresponsive I am also seeing the pveproxy dying a lot.
I have been trouble shooting the for more than a day.
pvecm status from each machine:
Code:
root@pve-3060-1:~# pvecm status
Cluster information
-------------------
Name: PM-MECH
Config Version: 11
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Aug 5 16:44:00 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.217c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 NA,NV,NMW 192.168.1.69 (local)
0x00000002 1 NA,NV,NMW 192.168.1.70
0x00000003 1 NA,NV,NMW 192.168.1.71
0x00000004 1 NA,NV,NMW 192.168.1.72
0x00000000 0 Qdevice (votes 0)
root@pve-3060-2:~# pvecm status
Cluster information
-------------------
Name: PM-MECH
Config Version: 11
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Aug 5 16:45:13 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 1.217c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 NA,NV,NMW 192.168.1.69
0x00000002 1 NA,NV,NMW 192.168.1.70 (local)
0x00000003 1 NA,NV,NMW 192.168.1.71
0x00000004 1 NA,NV,NMW 192.168.1.72
0x00000000 0 Qdevice (votes 0)
root@pve-3060-3:~# pvecm status
Cluster information
-------------------
Name: PM-MECH
Config Version: 11
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Aug 5 16:45:44 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 1.217c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 NA,NV,NMW 192.168.1.69
0x00000002 1 NA,NV,NMW 192.168.1.70
0x00000003 1 NA,NV,NMW 192.168.1.71 (local)
0x00000004 1 NA,NV,NMW 192.168.1.72
0x00000000 0 Qdevice (votes 0)
root@pve-3060-4:~# pvecm status
Cluster information
-------------------
Name: PM-MECH
Config Version: 11
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Aug 5 16:46:39 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1.217c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 NA,NV,NMW 192.168.1.69
0x00000002 1 NA,NV,NMW 192.168.1.70
0x00000003 1 NA,NV,NMW 192.168.1.71
0x00000004 1 NA,NV,NMW 192.168.1.72 (local)
0x00000000 0 Qdevice (votes 0)
I have also verified that the corosync.conf is the same on every machine in the corosync and pve folders.
Code:
root@pve-3060-4:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve-3060-1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.69
}
node {
name: pve-3060-2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.70
}
node {
name: pve-3060-3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.1.71
}
node {
name: pve-3060-4
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.1.72
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: PM-MECH
config_version: 11
interface {
linknumber: 0
}
ip_version: ipv4
link_mode: passive
secauth: on
version: 2
}