corosync update available, to address most "PVE XYZ Process stuck for ..." and LRM hangs

t.lamprecht · Apr 27, 2018

Hi all!

TLDR: we've got a new corosync (and libraries) package on pvetest which includes the proposed upstream fix for a bug which is behind of (most) 'PVE process XYZ stuck for more than 120 seconds'. Those affected, please test corsync version 2-4-2-pve5 to find an eventual problems with this solution!

Longer story: We had quite a few reports of PVE Daemons stuck in D state (uninterruptible IO wait), resulting in reduced usable systems, the most classic and reported process was pve-ha-lrm, but the lrm (or the other daemons, for that matter) had no fault on their own. We suspected that since the beginning of such reports, but could only be sure once we finally could reproduce it on our test systems, at first by accident on a ceph test system but that allowed Fabian an me to sit together and find a real reproducer (much thanks for his help on this).

Basically the problem goes back to corosync's Closed Process Groups (CPG) library, an building stone of our distributed configuration filesystem (pmxcfs). An artificial reproducer can be easily setup in a virtual cluster (i.e., nested PVE instgances) of size 3 (or bigger).

The following sequences are needed on a healthy cluster:

1. Pause Node with lowest node id
2. create/touch/change a file in /etc/pve on another node
3. Resume node with lowest node id
4. try to redo step 2 - it will hang now. (btw, a quick way out is to restart pve-cluster and corosync on node 1)

Step 1 and 3 are normally caused by high IO, this may not need to be minutes or hours of high IO load, bad timed seconds may be enough. Corosync, while marked as an realtime process, may not get scheduled thanks to this spike which is like a pause/resume.

Why the hang? We must be confident in CPG to tell us about configuration changes (i.e. members joined, or left the node). And normally in corosync and pmxcfs lower member ID's mean that they are chosen over higher member id's for temporary masters. So in the above case node 1 would need to get a leave of the current process group and join to the new (by the other members created) process group callback. This would allow node 1 to get hold of the change and sync up inode updates.
But, while the remaining (unpaused) nodes act correctly - they see node 1 vanish and create a new process group with node 2 as master, node 1 doesn't gets the leave nor the join message and thinks 'hey nothing happened' and thought it was still the master, this only went well as long as nothing was changed, then a now out of sync semaphore deadlocked on every future filesystem request.
(condensed technical explanation)

We could reduce this to a more simple testcase without pmxcfs and reported to corosync upstream, which confirmed that we found "a pretty serious bug". A security issue reported around the same time was understandably deemed more important and thus we only got a proposed fix a few days ago.

This fix is packaged and in our test repository (pvetest), with corosync (and automatic pulled dependency friends) in version 2.4.2-pve5 (normally only libcpg4 should be needed, but it's best to update all).

Note: naturally 'process xyz stuck for longer than' and D state can still happen, this was just a resulting symptom here which can be caused by a lot of ways (e.g. faulty hardware, kernel bugs, ...)
This affects just daemons and tools operating on /etc/pve which are mostly just our own PVE daemons.

If we get some testing exposure, it would really help us and upstream to bless the proposed solution as OK and better rule out that regressions are caused by it.

t.lamprecht · May 2, 2018

This fix is now also available in our non-production package repository (pve-no-subscription), please report if you run into a new issues which could look like a corosync/clustering problem, after updating.

DenisKlimek_WT · Nov 4, 2019

Dear Proxmox Team,
we're also affected from this problem which is always happening after adding a new node into the existing cluster. The first 10-12 cluster joins were successfully but then every new node triggered that problem. I think that we're able to reproduce it to give you more debug information if required. The current system is running Proxmox 6.0-9/508dce00 with corosync 3.0.2-pve4 (knet_transport changed to sctp to have a stable running corosync in our environment. Finally we shutted down every node and booted every node seperatly up when the previous node joined the cluster.

Code:

[421215.262783] INFO: task pveproxy worker:4932 blocked for more than 120 seconds.
[421215.263610]       Tainted: P           O      5.0.21-3-pve #1
[421215.264348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[421215.265072] pveproxy worker D    0  4932   1048 0x00000000
[421215.265784] Call Trace:
[421215.266481]  __schedule+0x2d4/0x870
[421215.267174]  schedule+0x2c/0x70
[421215.267990]  rwsem_down_read_failed+0xe6/0x170
[421215.268644]  call_rwsem_down_read_failed+0x18/0x30
[421215.269284]  down_read+0x20/0x40
[421215.269908]  lookup_slow+0x2c/0x60
[421215.270517]  walk_component+0x1bf/0x330
[421215.271128]  ? link_path_walk.part.43+0x29e/0x540
[421215.271814]  path_lookupat.isra.47+0x6d/0x220
[421215.272385]  filename_lookup.part.61+0xa0/0x170
[421215.272943]  ? __check_object_size+0x16b/0x17c
[421215.273487]  ? strncpy_from_user+0x57/0x1c0
[421215.274017]  user_path_at_empty+0x3e/0x50
[421215.274543]  vfs_statx+0x76/0xe0
[421215.275081]  ? _cond_resched+0x19/0x30
[421215.275720]  __do_sys_newstat+0x3d/0x70
[421215.276237]  ? _cond_resched+0x19/0x30
[421215.276747]  ? task_work_run+0x46/0xc0
[421215.277248]  __x64_sys_newstat+0x16/0x20
[421215.277746]  do_syscall_64+0x5a/0x110
[421215.278231]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[421215.278719] RIP: 0033:0x7fb8d075eaa5
[421215.279292] Code: Bad RIP value.
[421215.279737] RSP: 002b:00007ffce1b1c8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[421215.280182] RAX: ffffffffffffffda RBX: 0000561207ca5260 RCX: 00007fb8d075eaa5
[421215.280617] RDX: 0000561207ca5478 RSI: 0000561207ca5478 RDI: 0000561207e3d660
[421215.281039] RBP: 000056120e7237f0 R08: 0000000000000000 R09: 000000000000000f
[421215.281449] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[421215.281845] R13: 0000561207c2c811 R14: 0000561207e3d660 R15: 0000000000000000
[421215.282231] INFO: task pvesr:16378 blocked for more than 120 seconds.
[421215.282606]       Tainted: P           O      5.0.21-3-pve #1
[421215.282986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[421215.283509] pvesr           D    0 16378      1 0x00000000
[421215.283889] Call Trace:
[421215.284259]  __schedule+0x2d4/0x870
[421215.284626]  ? path_parentat.isra.44+0x3f/0x80
[421215.284994]  schedule+0x2c/0x70
[421215.285357]  rwsem_down_write_failed+0x160/0x340
[421215.285721]  call_rwsem_down_write_failed+0x17/0x30
[421215.286089]  down_write+0x2d/0x40
[421215.286445]  filename_create+0x8e/0x180
[421215.286814]  do_mkdirat+0x59/0x110
[421215.287307]  __x64_sys_mkdir+0x1b/0x20
[421215.287641]  do_syscall_64+0x5a/0x110
[421215.287964]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[421215.288283] RIP: 0033:0x7f15368c70d7
[421215.288596] Code: Bad RIP value.
[421215.288904] RSP: 002b:00007ffc5ff62198 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[421215.289228] RAX: ffffffffffffffda RBX: 00005565d7c47260 RCX: 00007f15368c70d7
[421215.289556] RDX: 00005565d6d191f4 RSI: 00000000000001ff RDI: 00005565db67d140
[421215.289887] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000f
[421215.290318] R10: 0000000000000000 R11: 0000000000000246 R12: 00005565d9351078
[421215.290837] R13: 00005565db67d140 R14: 00005565db637878 R15: 00000000000001ff

t.lamprecht · Nov 4, 2019

DenisKlimek_WT said:
knet_transport changed to sctp to have a stable running corosync in our environment

I'd suggest ensuring that your hosts all have libknet1 in version 1.13-pve1 (or later) and try to use UDP.
This version includes relevant fixes which could cause crashes after some time on higher load.
Or is there another reaosn for why you opted to use SCTP?

Anyway, do you overcomit CPU, is there quite a bit CPU load on the system? what's:

Code:

head /proc/pressure/*

Then, are the corosync/kronosnet links on a network with high(er) load - not only traffic near the networks upper limit is bad, it starts to get bad earlier, as corosync is time (latency) sensitive.
We got to run ~25 nodes fine in a cluster with knet and udp, where each node had some CPU resources left for corosync.

DenisKlimek_WT · Nov 6, 2019

libknet1 is already the latest version with 1.13-pve1 installed on all nodes inside of the cluster. The reason why I changed transport from udp to tcp is that we had massive problems after adding the 7-8th node into the cluster. Everytime the whole cluster communication crashed. With tcp it's running stable over our L2 network. It's a bigger switched network with about 25 switches across a city and connected directly with fibers.
There are not any VMs running on that cluster or even installed. It's a complete new and clean installation of Proxmox.

Inside of this network the link utilization is very low. It's only used for some control and management traffic (< 1Mbit/s). Latency between each host is < 1ms (tested with icmp ping from/to each node).

t.lamprecht · Nov 6, 2019

Quite the interesting setup. Hmm, the characteristics of the network seem OK.
Albeit, I have to admit that we never tested this on a such a network topology, having >20 switches.

Are there any log messages from corosync/knet in the journal, or just the kernel stuck-task errors?

DenisKlimek_WT · Nov 6, 2019

I also tried a full re-installation of all nodes, since this i not productive we can do that everytime. After installing I updated every node to the lastest 6.0-11 version and the problem is still present on a fresh installation. No adjustments to corosync.conf has been made. Using out-of-the-box config. This time it happened after adding the 10th node into the cluster. Whole cluster hangs and every node got the pvesr kernel stucks errors inside. The journal and/or syslog shows us:

Code:

Nov  6 14:57:20 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:57:33 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:57:46 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:57:59 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:58:00 pop005-sniffer systemd[1]: Starting Proxmox VE replication runner...
Nov  6 14:58:14 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:58:25 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:58:42 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:58:55 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:59:11 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:59:26 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:59:41 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 14:59:54 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:00:09 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943886] INFO: task pvesr:2331 blocked for more than 120 seconds.
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943926]       Tainted: P           O      5.0.21-4-pve #1
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943963] pvesr           D    0  2331      1 0x00000000
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943975] Call Trace:
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943987]  __schedule+0x2d4/0x870
Nov  6 15:00:15 pop005-sniffer kernel: [  725.943997]  ? path_parentat.isra.44+0x3f/0x80
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944007]  schedule+0x2c/0x70
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944015]  rwsem_down_write_failed+0x160/0x340
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944026]  call_rwsem_down_write_failed+0x17/0x30
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944037]  down_write+0x2d/0x40
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944044]  filename_create+0x8e/0x180
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944054]  do_mkdirat+0x59/0x110
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944062]  __x64_sys_mkdir+0x1b/0x20
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944071]  do_syscall_64+0x5a/0x110
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944080]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944092] RIP: 0033:0x7f5586bfc0d7
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944103] Code: Bad RIP value.
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944111] RSP: 002b:00007ffdee7869f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944126] RAX: ffffffffffffffda RBX: 00005586731d9260 RCX: 00007f5586bfc0d7
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944140] RDX: 000055867242c1f4 RSI: 00000000000001ff RDI: 0000558676c61590
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944154] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000f
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944168] R10: 0000000000000000 R11: 0000000000000246 R12: 00005586749182c8
Nov  6 15:00:15 pop005-sniffer kernel: [  725.944182] R13: 0000558676c61590 R14: 0000558676c0fba0 R15: 00000000000001ff
Nov  6 15:00:22 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 316 ms
Nov  6 15:00:37 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:00:50 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:01:05 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:01:16 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:01:31 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:01:42 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:01:57 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms
Nov  6 15:02:08 pop005-sniffer corosync[1738]:   [TOTEM ] Token has not been received in 315 ms

t.lamprecht · Nov 8, 2019

Can you please also post your /etc/corosync/corosync.conf ?

If you use only a single link you could try apply the settings of following post:
https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/page-11#post-269060

Search

Search

corosync update available, to address most "PVE XYZ Process stuck for ..." and LRM hangs

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

DenisKlimek_WT

Member

t.lamprecht

Proxmox Staff Member

DenisKlimek_WT

Member

t.lamprecht

Proxmox Staff Member

DenisKlimek_WT

Member

t.lamprecht

Proxmox Staff Member