proxmox 7.2 issue, cluster not joining back together

vernimmen · Sep 13, 2022

Hi,

I'm running a cluster with 28 nodes. After a full cluster reboot (all nodes restarted) none of the nodes want to join up. This is what each of the nodes sees:

Code:

Cluster information
-------------------
Name:             pvenl02
Config Version:   57
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep 12 23:40:54 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3.5498
Quorate:          No

Votequorum information
----------------------
Expected votes:   28
Highest expected: 28
Total votes:      1
Quorum:           15 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.40.6.3 (local)

Rebooting individual nodes doesn't help, even though corosync seems to be ok:

Code:

# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
    addr    = 10.40.6.3
    status:
        nodeid:          1:    connected
        nodeid:          2:    connected
        nodeid:          3:    localhost
        nodeid:          4:    connected
        nodeid:          5:    connected
        nodeid:          6:    connected
        nodeid:          7:    connected
        nodeid:          8:    connected
        nodeid:          9:    connected
        nodeid:         10:    connected
        nodeid:         11:    connected
        nodeid:         12:    connected
        nodeid:         13:    connected
        nodeid:         14:    connected
        nodeid:         15:    connected
        nodeid:         16:    connected
        nodeid:         17:    connected
        nodeid:         18:    connected
        nodeid:         19:    connected
        nodeid:         20:    connected
        nodeid:         21:    connected
        nodeid:         22:    connected
        nodeid:         23:    connected
        nodeid:         24:    connected
        nodeid:         25:    connected
        nodeid:         26:    connected
        nodeid:         27:    connected
        nodeid:         28:    connected

Only when I bring down all the hosts (physical power off), and the bring them back online one-by-done do the nodes cluster back up. If I leave one or two hosts powered on it doensn't work. They really all need to be powered down.

Has anyone seen this before? Any ideas what is preventing the nodes from joining together despite "corosync-cfgtool -s" showing they can talk?

mira · Sep 13, 2022

Please provide the output of pveversion -v for all nodes that differ, and at least from one.
In addition, please provide the journal or syslog for 2 nodes containing both a boot and reboot of nodes where corosync failed.

vernimmen · Sep 14, 2022

I have attached the information. Please let me know if anything is missing or more/different information is required.

mira · Sep 14, 2022

Thank you for the logs!

The following lines should no be happening:

Code:

Sep 12 21:39:36 pve-02 corosync[2021]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 12 21:39:36 pve-02 corosync[2021]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable

Is the corosync network stable?
How are the times between the nodes?

Could you provide the Corosync config? (/etc/pve/corosync.conf)

Does a simple restart of Corosync help, or is a reboot of the whole node required? (systemctl restart corosync.service)

vernimmen · Sep 14, 2022

In this case one or more STP TCN's caused issues on the network, which caused all hypervisors in the cluster to temporarily lose network long enough to trigger fencing and self reboots. That started this whole situation. But once those TCNs have been handled and the STP topology is stable again, there are no network issues. pings show all nodes can be reached from all others without packetloss and with low latency (<0.2ms in most cases).
During normal operations the cluster is stable and remains stable for months. But if all nodes reboot (due to power, or due to a network issue and all nodes reboot due to fencing) then the cluster doesn't come back without manual work.

All nodes are using chrony for time synching instead of timesyncd because it is more accurate and can do time stepping which timesyncd can't. There could be a chance that after a full cluster reboot not all machines are not immediately running with synchronized clocks yet. There is no evidence for this in the logs though.

corosync.conf:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.40.6.1
  }
  node {
    name: pve-02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.40.6.2
  }
  node {
    name: pve-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.40.6.3
  }
  node {
    name: pve-04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.40.6.4
  }
  node {
    name: pve-05
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.40.6.9
  }
  node {
    name: pve-06
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.40.6.14
  }
  node {
    name: pve-07
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.40.6.17
  }
  node {
    name: pve-10
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 10.40.6.27
  }
  node {
    name: pve-14
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 10.40.6.40
  }
  node {
    name: pve-p01
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.40.6.5
  }
  node {
    name: pve-p02
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.40.6.6
  }
  node {
    name: pve-p03
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.40.6.7
  }
  node {
    name: pve-p04
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.40.6.8
  }
  node {
    name: pve-p05
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.40.6.10
  }
  node {
    name: pve-p06
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.40.6.16
  }
  node {
    name: pve-p07
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.40.6.15
  }
  node {
    name: pvec-01
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.40.6.11
  }
  node {
    name: pvec-02
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.40.6.12
  }
  node {
    name: pvec-03
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.40.6.18
  }
  node {
    name: pvec-04
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 10.40.6.29
  }
  node {
    name: pvec-05
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 10.40.6.30
  }
  node {
    name: pvec-06
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 10.40.6.31
  }
  node {
    name: pveh-01
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.40.6.19
  }
  node {
    name: pveh-02
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 10.40.6.24
  }
  node {
    name: pveh-03
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 10.40.6.32
  }
  node {
    name: pveh-04
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 10.40.6.33
  }
  node {
    name: pvep-10
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 10.40.6.28
  }
  node {
    name: pvep-11
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 10.40.6.41
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pvenl02
  config_version: 57
  interface {
    knet_link_priority: 1
    linknumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Restarting corosync does not resolve the issue. Neither does rebooting individual nodes. The only solution is to power off all hypervisors and then bring them back one by one. If even 1 hypervisor is not powered down then the cluster can't be brought back online.

mira · Sep 14, 2022

If restarting Corosync is not enough, then maybe it has something to do with the NIC itself.
Have you tried updating the firmware of the NIC?

vernimmen · Sep 14, 2022

Yes, we recently completed a firmware update round to bring all firmwares of all system components to recent versions. It made no difference. Besides, communication between nodes is working fine at the moment where the nodes can't re-join the cluster. Since 'corosync-cfgtool -s' shows the nodes are there and connected, but 'pvecm status' doesn't show them, I suspect this is a software issue in either corosync or proxmox.

mira · Sep 14, 2022

None that we know of at least. It's strange that your Corosync gets `EAGAIN` on the loopback interface.

Would it be possible to enable debug logging for Corosync? Maybe this can give some hints as to what happens.
According to the logs the connection is there to at least some of the nodes.

vernimmen · Sep 14, 2022

yes I can enable debug logging on corosync when I run into this issue again. I can also pretty reliably reproduce the issue. The sad part is that it takes a lot of time to bring everything back online after such an event. Therefore I prefer to wait until it happens again before turning on the debug. That could be a few months though.
What method do you recommend to turn on debug logging?
Are there any other tests, settings, commands or logs you'd like to see applied or captured during such an event to make finding the root cause easier?

vernimmen · Sep 15, 2022

Today the whole cluster went down (all nodes rebooted) right after 1 node was shut down. Again the cluster nodes did not rejoin automatically.
I set 'debug: on' in corosync.conf and ran it manually in the foreground on 2 nodes. The logs are attached. We will now proceed to power all machines down again and bring them up 1-by-1.

mira · Sep 15, 2022

Based on the logs it and what you've described, it seems that corosync perhaps couldn't handle some topology changes in your network?

Code:

Sep 15 12:31:52.299 debug   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 15 12:31:52.299 debug   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 15 12:31:52.299 debug   [KNET  ] rx: Source host 26 not reachable yet. Discarding packet.
Sep 15 12:31:52.299 debug   [TOTEM ] Knet socket ERROR notification called: txrx=1, error=-1, errorno=11

I'd suggest to open a new issue upstream [0], explain the situation and include the debug logs. It's definitely strange, especially that the loopback device is temporarily unavailable.

[0] https://github.com/corosync/corosync

vernimmen · Sep 16, 2022

I will do that. But regardless of what triggers the full cluster reboot, corosync seems to recover fine afterwards, as shown by the output of `corosync-cfgtool -s`. But pvecm status does not show those other nodes. How does the proxmox software work with the corosync code? What needs to happen for a node to show up in `pvecm nodes` after it is shown as 'connected' in the output of `corosync-cfgtool -s`?

fabian · Sep 20, 2022

corosync-cfgtool only checks on the link layer, pvecm nodes/status on the quorum layer. the former is a requirement for the latter to work, but in your case based on the logs something seems to go wrong either network-wise or inside corosync. we'll continue investigating together with upstream.

Steve Dake · Nov 12, 2022

@fabian been forever! Can you please inform me where the source tree is for proxmox fencing?

Feel free to reach out personally.

linkedin.com/in/sdake

fabian · Nov 14, 2022

hi - there's a name I haven't seen in quite a while

feel free to catch me on IRC if you want to have a chat (or write me an email at f.gruenbichler@proxmox.com, or write to pve-devel@lists.proxmox.com for the public mailing list)

the high level overview of how PVE fencing works looks like this:
- we setup Corosync using a pretty standard config (knet, auth+enc, standard timeouts)
- there is a process running on each node that uses CPG to run distributed state machines for broadcasting status information and synchronizing changes to an sqlite database, with the contents of said sqlite database exposed as fuse file system on /etc/pve, and both the sqlite data and the status information exposed via ipcc/libqb (pmxcfs)
-- important aspect of this: write access to /etc/pve is blocked if the node is not part of the quorate partition, or the state machine hasn't properly synced yet.
- the rest of the PVE stack uses this IPCC and fuse access to read and write cluster-wide information (mostly config files, a few state files, status queries and a distributed locking mechanism)
- the HA stack consists of two services, a node-local and a cluster-wide resource manager (pve-ha-lrm , pve-ha-crm)
- on each node, the LRM writes a timestamp and status info to /etc/pve every few seconds to signify that this node is still alive and part of the quorum
- on each node, a watchdog is armed if HA is active (watchdog-mux)
- on each node, the LRM and CRM communicate with watchdog-mux (if that node is part of the quorate partition, check via pmxcfs access) to pull up the watchdog (or, in certain circumstances, tell it to disarm the watchdog)
- if the watchdog is not pulled up, watchdog-mux attempts to sync the journal (so that messages regarding fencing make it to disk) and the watchdog expires, fencing that node
- if other nodes are still up and quorate, the CRM (which is only "active" on one node at a time) will notice that that node's LRM timestamp is not updated anymore, and wait for the fencing timeouts to have passed, afterwards any HA resources (virtual guests) that were located on the fenced node are recovered on the remaining, quorate partition of the cluster

so the most direct cause of a "full cluster fence" event is pmxcfs not working on any node (either because of a bug in pmxcfs that causes the state machines to not sync up and allow writes again, or indirectly by an issue with/bug in CPG or corosync altogether).

I hope the above summary is somewhat understandable - if you have questions, don't hesitate to ask

Steve Dake · Nov 16, 2022

Fabian,

Such an excellent technical summary; you have provided exactly what I needed. Thank you.

Does the state machine depend on virtual synchrony of membership messages? In other words, If the state machine receives membership messages in different ordering on different nodes, does the state machine resolve the inconsistencies correctly, or is there an expectation of ordering of membership messages?

Thank you.
Steve

fabian · Nov 17, 2022

the state machine re-synchronizes upon membership changes:
https://git.proxmox.com/?p=pve-clus...637a83bcd2fe4fecc48206e8ca1c1b4;hb=HEAD#l1099

the basic process is as follows (unless indicated otherwise, each step happens on *each node*):
- config change triggers epoch bump (epoch contains a counter, a timestamp, the local pid and the local nodeid)
- any previous sync state is cleared
- the mode/state is set to start_sync
- if the node has the lowest nodeid among the members, it sends a SYNC_START message
- upon receipt of the SYNC_START message, the contained sync epoch is stored, and local state is collected and broadcast as STATE message
- upon receipt of a STATE message (all nodes broadcast their local state to all nodes), the state is stored
- once all members have sent their state, the mode/state is switched to SYNCED or UPDATE (where UPDATE means that the process group leader sends further UPDATE messages followed by an UPDATE_COMPLETE message, upon which the state switches to SYNCED as well)

I left out some edge case (e.g., membership consisting of a single node), what happens with non-sync/update related messages while a sync or update is ongoing (there's a queue that is also synced, and any normal messages like those broadcasted for actual content updates are queued in the meantime and processed once synced up) as well as error handling for simplicity's sake. the code in question is not that long, but as always with these kind of things, it's tricky

the sync epoch is used to discard outdated messages during a sync/update (a mismatching epoch is only accepted for the SYNC_START message, since those obviously should have a different one).

if a node receives a STATE message from a node that it doesn't see as member, or if it receives two STATE messages from the same member during a single sync run, it will leave the group (thus restarting the sync, since after a leave + cleanup, the state machine is restarted and joins again). AFAICT the sync epoch check and this additional safeguard should handle mis-ordering of membership change messages - but I am not the original author of this code.

Search

Search

proxmox 7.2 issue, cluster not joining back together

vernimmen

Member

mira

Proxmox Staff Member

vernimmen

Member

Attachments

mira

Proxmox Staff Member

vernimmen

Member

mira

Proxmox Staff Member

vernimmen

Member

mira

Proxmox Staff Member

vernimmen

Member

vernimmen

Member

Attachments

mira

Proxmox Staff Member

vernimmen

Member

fabian

Proxmox Staff Member

Steve Dake

New Member

fabian

Proxmox Staff Member

Steve Dake

New Member

fabian

Proxmox Staff Member