Hi all!
TLDR: we've got a new corosync (and libraries) package on pvetest which includes the proposed upstream fix for a bug which is behind of (most) 'PVE process XYZ stuck for more than 120 seconds'. Those affected, please test corsync version 2-4-2-pve5 to find an eventual problems with this solution!
Longer story: We had quite a few reports of PVE Daemons stuck in D state (uninterruptible IO wait), resulting in reduced usable systems, the most classic and reported process was pve-ha-lrm, but the lrm (or the other daemons, for that matter) had no fault on their own. We suspected that since the beginning of such reports, but could only be sure once we finally could reproduce it on our test systems, at first by accident on a ceph test system but that allowed Fabian an me to sit together and find a real reproducer (much thanks for his help on this).
Basically the problem goes back to corosync's Closed Process Groups (CPG) library, an building stone of our distributed configuration filesystem (pmxcfs). An artificial reproducer can be easily setup in a virtual cluster (i.e., nested PVE instgances) of size 3 (or bigger).
The following sequences are needed on a healthy cluster:
1. Pause Node with lowest node id
2. create/touch/change a file in /etc/pve on another node
3. Resume node with lowest node id
4. try to redo step 2 - it will hang now. (btw, a quick way out is to restart pve-cluster and corosync on node 1)
Step 1 and 3 are normally caused by high IO, this may not need to be minutes or hours of high IO load, bad timed seconds may be enough. Corosync, while marked as an realtime process, may not get scheduled thanks to this spike which is like a pause/resume.
Why the hang? We must be confident in CPG to tell us about configuration changes (i.e. members joined, or left the node). And normally in corosync and pmxcfs lower member ID's mean that they are chosen over higher member id's for temporary masters. So in the above case node 1 would need to get a leave of the current process group and join to the new (by the other members created) process group callback. This would allow node 1 to get hold of the change and sync up inode updates.
But, while the remaining (unpaused) nodes act correctly - they see node 1 vanish and create a new process group with node 2 as master, node 1 doesn't gets the leave nor the join message and thinks 'hey nothing happened' and thought it was still the master, this only went well as long as nothing was changed, then a now out of sync semaphore deadlocked on every future filesystem request.
(condensed technical explanation)
We could reduce this to a more simple testcase without pmxcfs and reported to corosync upstream, which confirmed that we found "a pretty serious bug". A security issue reported around the same time was understandably deemed more important and thus we only got a proposed fix a few days ago.
This fix is packaged and in our test repository (pvetest), with corosync (and automatic pulled dependency friends) in version 2.4.2-pve5 (normally only libcpg4 should be needed, but it's best to update all).
Note: naturally 'process xyz stuck for longer than' and D state can still happen, this was just a resulting symptom here which can be caused by a lot of ways (e.g. faulty hardware, kernel bugs, ...)
This affects just daemons and tools operating on /etc/pve which are mostly just our own PVE daemons.
If we get some testing exposure, it would really help us and upstream to bless the proposed solution as OK and better rule out that regressions are caused by it.
TLDR: we've got a new corosync (and libraries) package on pvetest which includes the proposed upstream fix for a bug which is behind of (most) 'PVE process XYZ stuck for more than 120 seconds'. Those affected, please test corsync version 2-4-2-pve5 to find an eventual problems with this solution!
Longer story: We had quite a few reports of PVE Daemons stuck in D state (uninterruptible IO wait), resulting in reduced usable systems, the most classic and reported process was pve-ha-lrm, but the lrm (or the other daemons, for that matter) had no fault on their own. We suspected that since the beginning of such reports, but could only be sure once we finally could reproduce it on our test systems, at first by accident on a ceph test system but that allowed Fabian an me to sit together and find a real reproducer (much thanks for his help on this).
Basically the problem goes back to corosync's Closed Process Groups (CPG) library, an building stone of our distributed configuration filesystem (pmxcfs). An artificial reproducer can be easily setup in a virtual cluster (i.e., nested PVE instgances) of size 3 (or bigger).
The following sequences are needed on a healthy cluster:
1. Pause Node with lowest node id
2. create/touch/change a file in /etc/pve on another node
3. Resume node with lowest node id
4. try to redo step 2 - it will hang now. (btw, a quick way out is to restart pve-cluster and corosync on node 1)
Step 1 and 3 are normally caused by high IO, this may not need to be minutes or hours of high IO load, bad timed seconds may be enough. Corosync, while marked as an realtime process, may not get scheduled thanks to this spike which is like a pause/resume.
Why the hang? We must be confident in CPG to tell us about configuration changes (i.e. members joined, or left the node). And normally in corosync and pmxcfs lower member ID's mean that they are chosen over higher member id's for temporary masters. So in the above case node 1 would need to get a leave of the current process group and join to the new (by the other members created) process group callback. This would allow node 1 to get hold of the change and sync up inode updates.
But, while the remaining (unpaused) nodes act correctly - they see node 1 vanish and create a new process group with node 2 as master, node 1 doesn't gets the leave nor the join message and thinks 'hey nothing happened' and thought it was still the master, this only went well as long as nothing was changed, then a now out of sync semaphore deadlocked on every future filesystem request.
(condensed technical explanation)
We could reduce this to a more simple testcase without pmxcfs and reported to corosync upstream, which confirmed that we found "a pretty serious bug". A security issue reported around the same time was understandably deemed more important and thus we only got a proposed fix a few days ago.
This fix is packaged and in our test repository (pvetest), with corosync (and automatic pulled dependency friends) in version 2.4.2-pve5 (normally only libcpg4 should be needed, but it's best to update all).
Note: naturally 'process xyz stuck for longer than' and D state can still happen, this was just a resulting symptom here which can be caused by a lot of ways (e.g. faulty hardware, kernel bugs, ...)
This affects just daemons and tools operating on /etc/pve which are mostly just our own PVE daemons.
If we get some testing exposure, it would really help us and upstream to bless the proposed solution as OK and better rule out that regressions are caused by it.