Proxmox cluster nodes will not stay in one piece

codym

Member
Jul 11, 2022
17
3
8
Hello all,

I am running into a seemingly unique issue where my cluster will splitbrain on its own. This happened on last reboot and seemed to start because the /etc/hosts file no longer defined the correct member ips and host names. I fixed that and the cluster will work for about 3 minutes when they are all started at the same time, however no matter what the cluster will just fragment. I cannot migrate my vms off of the lot onto one, and I cannot seem to do anything once it loses quorum. I also now have a stuck migration that will not let me qm unlock.

Here is my corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: proxmox-02
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.69.153
  }
  node {
    name: proxmox-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 199.74.35.5
  }
  node {
    name: proxmox-04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 199.74.35.6
  }
  node {
    name: proxmox-05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 199.74.35.7
  }
  node {
    name: proxmox-06
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 199.74.35.8
  }
  node {
    name: proxmox-07
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 199.74.35.9
  }
  node {
    name: pvt-pyle
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 199.74.35.4
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: peckservers
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Its important to note that proxmox-02 is only in the cluster for ease of management, but does not serve critical decision making/file serving tasks. This node is off site over a vpn.

Here is my ceph.conf

Code:
root@pvt-pyle:~# cat /etc/ceph/ceph.conf 
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 199.74.35.0/24
        fsid = 16980b31-630f-43fe-9f4e-ba6a4ea4b9b5
        mon_allow_pool_delete = true
        mon_host = 199.74.35.4 199.74.35.6 199.74.35.7
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 199.74.35.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.proxmox-03]
        host = proxmox-03
        mds_standby_for_name = pve

[mds.proxmox-04]
        host = proxmox-04
        mds_standby_for_name = pve

[mds.pvt-pyle]
        host = pvt-pyle
        mds_standby_for_name = pve

[mon.proxmox-04]
        public_addr = 199.74.35.6

[mon.proxmox-05]
        public_addr = 199.74.35.7

[mon.pvt-pyle]
        public_addr = 199.74.35.4

[osd]
        osd_max_backfills = 20
        osd_mclock_scheduler_background_recovery_lim = 0.8
        osd_mclock_scheduler_background_recovery_res = 0.8
        osd_recovery_max_active = 20

Here is a good and bad pvecm status
Code:
root@pvt-pyle:~# pvecm status # not working and many nodes physically on but will not join cluster
Cluster information
-------------------
Name:             peckservers
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Aug 28 20:52:04 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2.152eb
Quorate:          No

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      2
Quorum:           4 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 199.74.35.4 (local)
0x00000007          1 199.74.35.9
root@pvt-pyle:~# pvecm status # after running systemctl restart corosync.service 8 million times on each node
Cluster information
-------------------
Name:             peckservers
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Aug 28 20:57:54 2024
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000002
Ring ID:          1.1539d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.69.153
0x00000002          1 199.74.35.4 (local)
0x00000003          1 199.74.35.5
0x00000004          1 199.74.35.6
0x00000005          1 199.74.35.7
0x00000006          1 199.74.35.8
0x00000007          1 199.74.35.9
If anything else would be helpful please let me know, but as it stands now proxmox is completely unusable and I feel like I've tried everything.
 
Last edited:
I would like to update this and say that this all started when a centurylink tech came and ripped apart the network at my offsite location. I was not made aware of this other than the fact that once we got our vpn tunnel back up and working, the node would not come back online. I guess at some point I forced it back or me rebooting the rest of the datacenter triggered it to come back but I have just run systemctl stop corosync.service and the rest of the cluster has returned to full functionality. I am really curious what could have possibly gone wrong to make this happen and any help would be appreciated!
 
Does nobody here know what is happening? I have provided all the information I can find and done as much independent research as possible but a month later its crickets.