proxmox 7.2 issue, cluster not joining back together

vernimmen

Member
Apr 15, 2020
11
0
6
45
Hi,

I'm running a cluster with 28 nodes. After a full cluster reboot (all nodes restarted) none of the nodes want to join up. This is what each of the nodes sees:
Code:
Cluster information
-------------------
Name:             pvenl02
Config Version:   57
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep 12 23:40:54 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3.5498
Quorate:          No

Votequorum information
----------------------
Expected votes:   28
Highest expected: 28
Total votes:      1
Quorum:           15 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.40.6.3 (local)

Rebooting individual nodes doesn't help, even though corosync seems to be ok:
Code:
# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
    addr    = 10.40.6.3
    status:
        nodeid:          1:    connected
        nodeid:          2:    connected
        nodeid:          3:    localhost
        nodeid:          4:    connected
        nodeid:          5:    connected
        nodeid:          6:    connected
        nodeid:          7:    connected
        nodeid:          8:    connected
        nodeid:          9:    connected
        nodeid:         10:    connected
        nodeid:         11:    connected
        nodeid:         12:    connected
        nodeid:         13:    connected
        nodeid:         14:    connected
        nodeid:         15:    connected
        nodeid:         16:    connected
        nodeid:         17:    connected
        nodeid:         18:    connected
        nodeid:         19:    connected
        nodeid:         20:    connected
        nodeid:         21:    connected
        nodeid:         22:    connected
        nodeid:         23:    connected
        nodeid:         24:    connected
        nodeid:         25:    connected
        nodeid:         26:    connected
        nodeid:         27:    connected
        nodeid:         28:    connected

Only when I bring down all the hosts (physical power off), and the bring them back online one-by-done do the nodes cluster back up. If I leave one or two hosts powered on it doensn't work. They really all need to be powered down.

Has anyone seen this before? Any ideas what is preventing the nodes from joining together despite "corosync-cfgtool -s" showing they can talk?
 
Last edited:
Please provide the output of pveversion -v for all nodes that differ, and at least from one.
In addition, please provide the journal or syslog for 2 nodes containing both a boot and reboot of nodes where corosync failed.
 
I have attached the information. Please let me know if anything is missing or more/different information is required.
 

Attachments

  • pveversion.txt
    39 KB · Views: 4
  • pve-02.syslog.reboot.txt
    290.1 KB · Views: 3
  • pve-02.syslog.boot.txt
    261.9 KB · Views: 2
  • pve-03.syslog.reboot.txt
    597.3 KB · Views: 2
  • pve-03.syslog.boot.txt
    289.6 KB · Views: 0
Thank you for the logs!

The following lines should no be happening:
Code:
Sep 12 21:39:36 pve-02 corosync[2021]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 12 21:39:36 pve-02 corosync[2021]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Is the corosync network stable?
How are the times between the nodes?

Could you provide the Corosync config? (/etc/pve/corosync.conf)

Does a simple restart of Corosync help, or is a reboot of the whole node required? (systemctl restart corosync.service)
 
In this case one or more STP TCN's caused issues on the network, which caused all hypervisors in the cluster to temporarily lose network long enough to trigger fencing and self reboots. That started this whole situation. But once those TCNs have been handled and the STP topology is stable again, there are no network issues. pings show all nodes can be reached from all others without packetloss and with low latency (<0.2ms in most cases).
During normal operations the cluster is stable and remains stable for months. But if all nodes reboot (due to power, or due to a network issue and all nodes reboot due to fencing) then the cluster doesn't come back without manual work.

All nodes are using chrony for time synching instead of timesyncd because it is more accurate and can do time stepping which timesyncd can't. There could be a chance that after a full cluster reboot not all machines are not immediately running with synchronized clocks yet. There is no evidence for this in the logs though.

corosync.conf:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.40.6.1
  }
  node {
    name: pve-02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.40.6.2
  }
  node {
    name: pve-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.40.6.3
  }
  node {
    name: pve-04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.40.6.4
  }
  node {
    name: pve-05
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.40.6.9
  }
  node {
    name: pve-06
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.40.6.14
  }
  node {
    name: pve-07
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.40.6.17
  }
  node {
    name: pve-10
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 10.40.6.27
  }
  node {
    name: pve-14
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 10.40.6.40
  }
  node {
    name: pve-p01
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.40.6.5
  }
  node {
    name: pve-p02
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.40.6.6
  }
  node {
    name: pve-p03
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.40.6.7
  }
  node {
    name: pve-p04
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.40.6.8
  }
  node {
    name: pve-p05
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.40.6.10
  }
  node {
    name: pve-p06
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.40.6.16
  }
  node {
    name: pve-p07
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.40.6.15
  }
  node {
    name: pvec-01
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.40.6.11
  }
  node {
    name: pvec-02
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.40.6.12
  }
  node {
    name: pvec-03
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.40.6.18
  }
  node {
    name: pvec-04
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 10.40.6.29
  }
  node {
    name: pvec-05
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 10.40.6.30
  }
  node {
    name: pvec-06
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 10.40.6.31
  }
  node {
    name: pveh-01
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.40.6.19
  }
  node {
    name: pveh-02
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 10.40.6.24
  }
  node {
    name: pveh-03
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 10.40.6.32
  }
  node {
    name: pveh-04
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 10.40.6.33
  }
  node {
    name: pvep-10
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 10.40.6.28
  }
  node {
    name: pvep-11
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 10.40.6.41
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pvenl02
  config_version: 57
  interface {
    knet_link_priority: 1
    linknumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Restarting corosync does not resolve the issue. Neither does rebooting individual nodes. The only solution is to power off all hypervisors and then bring them back one by one. If even 1 hypervisor is not powered down then the cluster can't be brought back online.
 
If restarting Corosync is not enough, then maybe it has something to do with the NIC itself.
Have you tried updating the firmware of the NIC?
 
Yes, we recently completed a firmware update round to bring all firmwares of all system components to recent versions. It made no difference. Besides, communication between nodes is working fine at the moment where the nodes can't re-join the cluster. Since 'corosync-cfgtool -s' shows the nodes are there and connected, but 'pvecm status' doesn't show them, I suspect this is a software issue in either corosync or proxmox.
 
None that we know of at least. It's strange that your Corosync gets `EAGAIN` on the loopback interface.

Would it be possible to enable debug logging for Corosync? Maybe this can give some hints as to what happens.
According to the logs the connection is there to at least some of the nodes.
 
yes I can enable debug logging on corosync when I run into this issue again. I can also pretty reliably reproduce the issue. The sad part is that it takes a lot of time to bring everything back online after such an event. Therefore I prefer to wait until it happens again before turning on the debug. That could be a few months though.
What method do you recommend to turn on debug logging?
Are there any other tests, settings, commands or logs you'd like to see applied or captured during such an event to make finding the root cause easier?
 
Today the whole cluster went down (all nodes rebooted) right after 1 node was shut down. Again the cluster nodes did not rejoin automatically.
I set 'debug: on' in corosync.conf and ran it manually in the foreground on 2 nodes. The logs are attached. We will now proceed to power all machines down again and bring them up 1-by-1.
 

Attachments

  • Archive.zip
    670.1 KB · Views: 11
Based on the logs it and what you've described, it seems that corosync perhaps couldn't handle some topology changes in your network?
Code:
Sep 15 12:31:52.299 debug   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 15 12:31:52.299 debug   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 15 12:31:52.299 debug   [KNET  ] rx: Source host 26 not reachable yet. Discarding packet.
Sep 15 12:31:52.299 debug   [TOTEM ] Knet socket ERROR notification called: txrx=1, error=-1, errorno=11

I'd suggest to open a new issue upstream [0], explain the situation and include the debug logs. It's definitely strange, especially that the loopback device is temporarily unavailable.



[0] https://github.com/corosync/corosync
 
I will do that. But regardless of what triggers the full cluster reboot, corosync seems to recover fine afterwards, as shown by the output of `corosync-cfgtool -s`. But pvecm status does not show those other nodes. How does the proxmox software work with the corosync code? What needs to happen for a node to show up in `pvecm nodes` after it is shown as 'connected' in the output of `corosync-cfgtool -s`?
 
corosync-cfgtool only checks on the link layer, pvecm nodes/status on the quorum layer. the former is a requirement for the latter to work, but in your case based on the logs something seems to go wrong either network-wise or inside corosync. we'll continue investigating together with upstream.
 
@fabian been forever! Can you please inform me where the source tree is for proxmox fencing?

Feel free to reach out personally.

linkedin.com/in/sdake
 
hi - there's a name I haven't seen in quite a while ;)

feel free to catch me on IRC if you want to have a chat (or write me an email at f.gruenbichler@proxmox.com, or write to pve-devel@lists.proxmox.com for the public mailing list)

the high level overview of how PVE fencing works looks like this:
- we setup Corosync using a pretty standard config (knet, auth+enc, standard timeouts)
- there is a process running on each node that uses CPG to run distributed state machines for broadcasting status information and synchronizing changes to an sqlite database, with the contents of said sqlite database exposed as fuse file system on /etc/pve, and both the sqlite data and the status information exposed via ipcc/libqb (pmxcfs)
-- important aspect of this: write access to /etc/pve is blocked if the node is not part of the quorate partition, or the state machine hasn't properly synced yet.
- the rest of the PVE stack uses this IPCC and fuse access to read and write cluster-wide information (mostly config files, a few state files, status queries and a distributed locking mechanism)
- the HA stack consists of two services, a node-local and a cluster-wide resource manager (pve-ha-lrm , pve-ha-crm)
- on each node, the LRM writes a timestamp and status info to /etc/pve every few seconds to signify that this node is still alive and part of the quorum
- on each node, a watchdog is armed if HA is active (watchdog-mux)
- on each node, the LRM and CRM communicate with watchdog-mux (if that node is part of the quorate partition, check via pmxcfs access) to pull up the watchdog (or, in certain circumstances, tell it to disarm the watchdog)
- if the watchdog is not pulled up, watchdog-mux attempts to sync the journal (so that messages regarding fencing make it to disk) and the watchdog expires, fencing that node
- if other nodes are still up and quorate, the CRM (which is only "active" on one node at a time) will notice that that node's LRM timestamp is not updated anymore, and wait for the fencing timeouts to have passed, afterwards any HA resources (virtual guests) that were located on the fenced node are recovered on the remaining, quorate partition of the cluster

so the most direct cause of a "full cluster fence" event is pmxcfs not working on any node (either because of a bug in pmxcfs that causes the state machines to not sync up and allow writes again, or indirectly by an issue with/bug in CPG or corosync altogether).

I hope the above summary is somewhat understandable - if you have questions, don't hesitate to ask :)
 
  • Like
Reactions: Steve Dake and UdoB
Fabian,

Such an excellent technical summary; you have provided exactly what I needed. Thank you.

Does the state machine depend on virtual synchrony of membership messages? In other words, If the state machine receives membership messages in different ordering on different nodes, does the state machine resolve the inconsistencies correctly, or is there an expectation of ordering of membership messages?

Thank you.
Steve
 
the state machine re-synchronizes upon membership changes:
https://git.proxmox.com/?p=pve-clus...637a83bcd2fe4fecc48206e8ca1c1b4;hb=HEAD#l1099

the basic process is as follows (unless indicated otherwise, each step happens on *each node*):
- config change triggers epoch bump (epoch contains a counter, a timestamp, the local pid and the local nodeid)
- any previous sync state is cleared
- the mode/state is set to start_sync
- if the node has the lowest nodeid among the members, it sends a SYNC_START message
- upon receipt of the SYNC_START message, the contained sync epoch is stored, and local state is collected and broadcast as STATE message
- upon receipt of a STATE message (all nodes broadcast their local state to all nodes), the state is stored
- once all members have sent their state, the mode/state is switched to SYNCED or UPDATE (where UPDATE means that the process group leader sends further UPDATE messages followed by an UPDATE_COMPLETE message, upon which the state switches to SYNCED as well)

I left out some edge case (e.g., membership consisting of a single node), what happens with non-sync/update related messages while a sync or update is ongoing (there's a queue that is also synced, and any normal messages like those broadcasted for actual content updates are queued in the meantime and processed once synced up) as well as error handling for simplicity's sake. the code in question is not that long, but as always with these kind of things, it's tricky ;)

the sync epoch is used to discard outdated messages during a sync/update (a mismatching epoch is only accepted for the SYNC_START message, since those obviously should have a different one).

if a node receives a STATE message from a node that it doesn't see as member, or if it receives two STATE messages from the same member during a single sync run, it will leave the group (thus restarting the sync, since after a leave + cleanup, the state machine is restarted and joins again). AFAICT the sync epoch check and this additional safeguard should handle mis-ordering of membership change messages - but I am not the original author of this code.
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!