codym

Member
Jul 11, 2022
8
0
6
Hello all,

I am running into a seemingly unique issue where my cluster will splitbrain on its own. This happened on last reboot and seemed to start because the /etc/hosts file no longer defined the correct member ips and host names. I fixed that and the cluster will work for about 3 minutes when they are all started at the same time, however no matter what the cluster will just fragment. I cannot migrate my vms off of the lot onto one, and I cannot seem to do anything once it loses quorum. I also now have a stuck migration that will not let me qm unlock.

Here is my corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: proxmox-02
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.69.153
  }
  node {
    name: proxmox-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 199.74.35.5
  }
  node {
    name: proxmox-04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 199.74.35.6
  }
  node {
    name: proxmox-05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 199.74.35.7
  }
  node {
    name: proxmox-06
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 199.74.35.8
  }
  node {
    name: proxmox-07
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 199.74.35.9
  }
  node {
    name: pvt-pyle
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 199.74.35.4
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: peckservers
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Its important to note that proxmox-02 is only in the cluster for ease of management, but does not serve critical decision making/file serving tasks. This node is off site over a vpn.

Here is my ceph.conf

Code:
root@pvt-pyle:~# cat /etc/ceph/ceph.conf 
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 199.74.35.0/24
        fsid = 16980b31-630f-43fe-9f4e-ba6a4ea4b9b5
        mon_allow_pool_delete = true
        mon_host = 199.74.35.4 199.74.35.6 199.74.35.7
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 199.74.35.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.proxmox-03]
        host = proxmox-03
        mds_standby_for_name = pve

[mds.proxmox-04]
        host = proxmox-04
        mds_standby_for_name = pve

[mds.pvt-pyle]
        host = pvt-pyle
        mds_standby_for_name = pve

[mon.proxmox-04]
        public_addr = 199.74.35.6

[mon.proxmox-05]
        public_addr = 199.74.35.7

[mon.pvt-pyle]
        public_addr = 199.74.35.4

[osd]
        osd_max_backfills = 20
        osd_mclock_scheduler_background_recovery_lim = 0.8
        osd_mclock_scheduler_background_recovery_res = 0.8
        osd_recovery_max_active = 20

Here is a good and bad pvecm status
Code:
root@pvt-pyle:~# pvecm status # not working and many nodes physically on but will not join cluster
Cluster information
-------------------
Name:             peckservers
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Aug 28 20:52:04 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2.152eb
Quorate:          No

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      2
Quorum:           4 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 199.74.35.4 (local)
0x00000007          1 199.74.35.9
root@pvt-pyle:~# pvecm status # after running systemctl restart corosync.service 8 million times on each node
Cluster information
-------------------
Name:             peckservers
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Aug 28 20:57:54 2024
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000002
Ring ID:          1.1539d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.69.153
0x00000002          1 199.74.35.4 (local)
0x00000003          1 199.74.35.5
0x00000004          1 199.74.35.6
0x00000005          1 199.74.35.7
0x00000006          1 199.74.35.8
0x00000007          1 199.74.35.9
If anything else would be helpful please let me know, but as it stands now proxmox is completely unusable and I feel like I've tried everything.
 
Last edited:
I would like to update this and say that this all started when a centurylink tech came and ripped apart the network at my offsite location. I was not made aware of this other than the fact that once we got our vpn tunnel back up and working, the node would not come back online. I guess at some point I forced it back or me rebooting the rest of the datacenter triggered it to come back but I have just run systemctl stop corosync.service and the rest of the cluster has returned to full functionality. I am really curious what could have possibly gone wrong to make this happen and any help would be appreciated!
 
Does nobody here know what is happening? I have provided all the information I can find and done as much independent research as possible but a month later its crickets.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!