Hello all,
I am running into a seemingly unique issue where my cluster will splitbrain on its own. This happened on last reboot and seemed to start because the /etc/hosts file no longer defined the correct member ips and host names. I fixed that and the cluster will work for about 3 minutes when they are all started at the same time, however no matter what the cluster will just fragment. I cannot migrate my vms off of the lot onto one, and I cannot seem to do anything once it loses quorum. I also now have a stuck migration that will not let me qm unlock.
Here is my corosync.conf:
Its important to note that proxmox-02 is only in the cluster for ease of management, but does not serve critical decision making/file serving tasks. This node is off site over a vpn.
Here is my ceph.conf
Here is a good and bad pvecm status
If anything else would be helpful please let me know, but as it stands now proxmox is completely unusable and I feel like I've tried everything.
I am running into a seemingly unique issue where my cluster will splitbrain on its own. This happened on last reboot and seemed to start because the /etc/hosts file no longer defined the correct member ips and host names. I fixed that and the cluster will work for about 3 minutes when they are all started at the same time, however no matter what the cluster will just fragment. I cannot migrate my vms off of the lot onto one, and I cannot seem to do anything once it loses quorum. I also now have a stuck migration that will not let me qm unlock.
Here is my corosync.conf:
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: proxmox-02
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.69.153
}
node {
name: proxmox-03
nodeid: 3
quorum_votes: 1
ring0_addr: 199.74.35.5
}
node {
name: proxmox-04
nodeid: 4
quorum_votes: 1
ring0_addr: 199.74.35.6
}
node {
name: proxmox-05
nodeid: 5
quorum_votes: 1
ring0_addr: 199.74.35.7
}
node {
name: proxmox-06
nodeid: 6
quorum_votes: 1
ring0_addr: 199.74.35.8
}
node {
name: proxmox-07
nodeid: 7
quorum_votes: 1
ring0_addr: 199.74.35.9
}
node {
name: pvt-pyle
nodeid: 2
quorum_votes: 1
ring0_addr: 199.74.35.4
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: peckservers
config_version: 16
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Its important to note that proxmox-02 is only in the cluster for ease of management, but does not serve critical decision making/file serving tasks. This node is off site over a vpn.
Here is my ceph.conf
Code:
root@pvt-pyle:~# cat /etc/ceph/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 199.74.35.0/24
fsid = 16980b31-630f-43fe-9f4e-ba6a4ea4b9b5
mon_allow_pool_delete = true
mon_host = 199.74.35.4 199.74.35.6 199.74.35.7
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 199.74.35.0/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.proxmox-03]
host = proxmox-03
mds_standby_for_name = pve
[mds.proxmox-04]
host = proxmox-04
mds_standby_for_name = pve
[mds.pvt-pyle]
host = pvt-pyle
mds_standby_for_name = pve
[mon.proxmox-04]
public_addr = 199.74.35.6
[mon.proxmox-05]
public_addr = 199.74.35.7
[mon.pvt-pyle]
public_addr = 199.74.35.4
[osd]
osd_max_backfills = 20
osd_mclock_scheduler_background_recovery_lim = 0.8
osd_mclock_scheduler_background_recovery_res = 0.8
osd_recovery_max_active = 20
Here is a good and bad pvecm status
Code:
root@pvt-pyle:~# pvecm status # not working and many nodes physically on but will not join cluster
Cluster information
-------------------
Name: peckservers
Config Version: 16
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Aug 28 20:52:04 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2.152eb
Quorate: No
Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 2
Quorum: 4 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 199.74.35.4 (local)
0x00000007 1 199.74.35.9
root@pvt-pyle:~# pvecm status # after running systemctl restart corosync.service 8 million times on each node
Cluster information
-------------------
Name: peckservers
Config Version: 16
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Aug 28 20:57:54 2024
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000002
Ring ID: 1.1539d
Quorate: Yes
Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.69.153
0x00000002 1 199.74.35.4 (local)
0x00000003 1 199.74.35.5
0x00000004 1 199.74.35.6
0x00000005 1 199.74.35.7
0x00000006 1 199.74.35.8
0x00000007 1 199.74.35.9
Last edited: