rcu_sched self-detected stall on cpu during the backup

kbechler · Jun 18, 2024

Hello,

We've got 6 node cluster with a bunch on VMs and CTs on it.
All nodes are updated to PVE 8.2.2 (pve-manager/8.2.2/9355359cd7afbae4, kernel 6.5.13-5-pve) and work (almost) smoothy.

Unfortunately, there are some weird issues with "rcu_sched self-detected stall on cpu" during the backup.
A backup (mode=snapshot, compression=none) of a single VM (type=kvm64, VirtIO SCSI controller) makes this VM unavailable for couple of seconds/minutes.
I see no errors on a Proxmox node itself, but quite a lot of information in VM's dmesg (attached file)

Also, I was SSHed to the VM and received some messages right after VM was "unfreezed":

Message from syslogd@fast at Jun 18 20:31:32 ...
kernel:[ 1292.104018] watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [sshd:580]
Message from syslogd@fast at Jun 18 20:31:37 ...
kernel:[ 1297.151211] watchdog: BUG: soft lockup - CPU#3 stuck for 72s! [kworker/3:2:138]
Message from syslogd@fast at Jun 18 20:31:37 ...
kernel:[ 1297.151432] Uhhuh. NMI received for unknown reason 20 on CPU 2.
Message from syslogd@fast at Jun 18 20:31:37 ...
kernel:[ 1297.151432] Do you have a strange power saving mode enabled?
Message from syslogd@fast at Jun 18 20:31:37 ...
kernel:[ 1297.151433] Dazed and confused, but trying to continue

What can I do to get rid of this issue?

Thanks,
Konrad

gfngfn256 · Jun 18, 2024

kbechler said:
PVE 8.2.2 (pve-manager/8.2.2/9355359cd7afbae4, kernel 6.5.13-5-pve)

I see you kept it back at kernel 6.5. Reason?

Don't know what HW/CPU you've got but there were quite a few reports of similar behavior to yours with similar kernel. See here.

kbechler · Jun 18, 2024

gfngfn256 said:
I see you kept it back at kernel 6.5. Reason?

Don't know what HW/CPU you've got but there were quite a few reports of similar behavior to yours with similar kernel. See here.

It's because we use NFS for keeping backups and hit an error with memory leak in 6.8 kernel: https://bugzilla.kernel.org/show_bug.cgi?id=218671
So 6.5.13-5-pve is the newest one that works without this issue.
Thread you mentioned says that rcu issue was fixed around 6.5.11-4-pve or 6.5.11-5-pve

gfngfn256 · Jun 18, 2024

How long have you been suffering the current issue (soft-lockup) ? I'm thinking that looking at the bug report you linked, there may actually be a link between the two problems. On the bug report there is no evidence that kernel 6.5 doesn't have the "bug". I realize your machine isn't rebooting itself in an OOM situation, but possibly your kernel is soft-locking because of it. We must take into consideration that the NFS is the common denominator here. Try backing up to a different location (non-NFS) & see if this mitigates the issue?

Another way to test, would be to go back to an earlier kernel version, before 6.5 . Its a shame that although a patch exists for this bug, it won't be incorporated before at least kernel 6.9 .

kbechler · Jun 18, 2024

gfngfn256 said:
How long have you been suffering the current issue (soft-lockup) ? I'm thinking that looking at the bug report you linked, there may actually be a link between the two problems. On the bug report there is no evidence that kernel 6.5 doesn't have the "bug". I realize your machine isn't rebooting itself in an OOM situation, but possibly your kernel is soft-locking because of it. We must take into consideration that the NFS is the common denominator here. Try backing up to a different location (non-NFS) & see if this mitigates the issue?

Another way to test, would be to go back to an earlier kernel version, before 6.5 . Its a shame that although a patch exists for this bug, it won't be incorporated before at least kernel 6.9 .

The whole environment was migrated to 8.2.2 PVE couple of days ago. I was fighting with NFS bug since then and, honestly, didn't notice rcu one. So it's hard to tell how long it's an issue for us.

I see no problems on Proxmox host itself (dmesg is clear). Other VMs/CTs also works fine (i tried to test it), even when the "problematic" VM is stuck. So, I'm observing any problems only inside the VM that has backup running. It freezes for couple of seconds/minutes, throws some error message to dmesg and works fine after that. No reboots or anything like that. VM is running Debian with 5.10.0-30-amd64 kernel if it matters.

We can give it a try, but I'm not sure which kernel prior to 6.5 would be compatible with PVE 8.2.2?

Also, I will try to make a backup to some local disk (instead of NFS) of course.

gfngfn256 · Jun 18, 2024

kbechler said:
I see no problems on Proxmox host itself

To clarify, where are you getting the "rcu_sched self-detected stall on cpu" messages? On the PVE host or the VM?
Who is doing the backups that causes them? PVE backups (of VM) or VM run (internal) backups?

kbechler · Jun 18, 2024

gfngfn256 said:
To clarify, where are you getting the "rcu_sched self-detected stall on cpu" messages? On the PVE host or the VM?
Who is doing the backups that causes them? PVE backups (of VM) or VM run (internal) backups?

Messages are from VM, not PVE host. PVE seems to be working fine.
Backups are invoked by Backup Job on PVE cluster:

Code:

INFO: starting new backup job: vzdump 4031 --notification-mode auto --node pk05 --remove 0 --compress 0 --storage backup --mode snapshot --notes-template '{{guestname}}'
INFO: Starting Backup of VM 4031 (qemu)
INFO: Backup started at 2024-06-18 20:14:28
INFO: status = running
INFO: VM Name: fast
INFO: include disk 'scsi0' 'data:vm-4031-disk-0' 32G
INFO: include disk 'scsi1' 'data:vm-4031-disk-1' 100G
INFO: include disk 'scsi2' 'data:vm-4031-disk-2' 15G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-4031-2024_06_18-20_14_28.vma'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'cfb45035-bf2a-4ba5-818b-b8e7eb2fabfd'
INFO: resuming VM again
INFO:   7% (11.7 GiB of 147.0 GiB) in 3s, read: 3.9 GiB/s, write: 3.8 GiB/s

and so on...

What may be important, the "freeze" occurs _after_ VM was resumed.

kbechler · Jun 19, 2024

Hello again,

I've made some tests. And now I'm even more confused.

First backup (compress=0), remote NFS:
backup log
----------

Code:

INFO: starting new backup job: vzdump 4031 --mode snapshot --notes-template '{{guestname}}' --remove 0 --storage backup --compress 0 --notification-mode auto --node pk05
INFO: Starting Backup of VM 4031 (qemu)
INFO: Backup started at 2024-06-19 20:32:48
INFO: status = running
INFO: VM Name: fast
INFO: include disk 'scsi0' 'data:vm-4031-disk-0' 32G
INFO: include disk 'scsi1' 'data:vm-4031-disk-1' 100G
INFO: include disk 'scsi2' 'data:vm-4031-disk-2' 15G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-4031-2024_06_19-20_32_48.vma'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'acc97345-0292-4129-95a5-ac79b3b58b7a'
INFO: resuming VM again
INFO:   7% (11.1 GiB of 147.0 GiB) in 3s, read: 3.7 GiB/s, write: 3.6 GiB/s
INFO:  10% (15.6 GiB of 147.0 GiB) in 7s, read: 1.1 GiB/s, write: 1.1 GiB/s
[...]
INFO:  94% (138.4 GiB of 147.0 GiB) in 16m 34s, read: 284.8 MiB/s, write: 108.5 MiB/s
INFO:  95% (139.7 GiB of 147.0 GiB) in 16m 39s, read: 264.5 MiB/s, write: 113.0 MiB/s
INFO:  96% (142.0 GiB of 147.0 GiB) in 16m 59s, read: 118.4 MiB/s, write: 112.9 MiB/s
INFO:  97% (143.1 GiB of 147.0 GiB) in 17m 10s, read: 101.5 MiB/s, write: 95.3 MiB/s            <- it froze right here!!!!!
INFO: 100% (147.0 GiB of 147.0 GiB) in 19m 26s, read: 29.6 MiB/s, write: 40.6 KiB/s
INFO: backup is sparse: 20.15 GiB (13%) total zero data
INFO: transferred 147.00 GiB in 1166 seconds (129.1 MiB/s)
INFO: archive file size: 126.87GB
INFO: adding notes to backup
INFO: Finished Backup of VM 4031 (00:19:27)
INFO: Backup finished at 2024-06-19 20:52:15
INFO: Backup job finished successfully
TASK OK

dmesg (from VM)
---------------

Code:

[śro cze 19 20:50:18 2024] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[śro cze 19 20:50:18 2024] rcu:     2-...0: (0 ticks this GP) idle=19e/1/0x4000000000000000 softirq=2396314/2396314 fqs=2586
[śro cze 19 20:50:18 2024]     (detected by 0, t=5252 jiffies, g=7290713, q=8938)
[śro cze 19 20:50:18 2024] Sending NMI from CPU 0 to CPUs 2:
[śro cze 19 20:50:28 2024] rcu: rcu_sched kthread starved for 2463 jiffies! g7290713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[śro cze 19 20:50:28 2024] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[śro cze 19 20:50:28 2024] rcu: RCU grace-period kthread stack dump:
[śro cze 19 20:50:28 2024] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00004000
[śro cze 19 20:50:28 2024] Call Trace:
[śro cze 19 20:50:28 2024]  __schedule+0x282/0x870
[śro cze 19 20:50:28 2024]  schedule+0x46/0xb0
[śro cze 19 20:50:28 2024]  schedule_timeout+0x8b/0x150
[śro cze 19 20:50:28 2024]  ? __next_timer_interrupt+0x110/0x110
[śro cze 19 20:50:28 2024]  rcu_gp_kthread+0x57b/0xc30
[śro cze 19 20:50:28 2024]  ? rcu_cpu_kthread+0x190/0x190
[śro cze 19 20:50:28 2024]  kthread+0x11b/0x140
[śro cze 19 20:50:28 2024]  ? __kthread_bind_mask+0x60/0x60
[śro cze 19 20:50:28 2024]  ret_from_fork+0x22/0x30
[śro cze 19 20:51:22 2024] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[śro cze 19 20:51:22 2024] rcu:     2-...0: (0 ticks this GP) idle=19e/1/0x4000000000000000 softirq=2396314/2396314 fqs=2913
[śro cze 19 20:51:22 2024]     (detected by 3, t=21052 jiffies, g=7290713, q=8940)
[śro cze 19 20:51:22 2024] Sending NMI from CPU 3 to CPUs 2:
[śro cze 19 20:51:32 2024] rcu: rcu_sched kthread starved for 12667 jiffies! g7290713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[śro cze 19 20:51:32 2024] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[śro cze 19 20:51:32 2024] rcu: RCU grace-period kthread stack dump:
[śro cze 19 20:51:32 2024] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00004000
[śro cze 19 20:51:32 2024] Call Trace:
[...]

Another attempt (also compress=0 and the same NFS dir), dmesg errors at exactly the same moment (at 97%), but this time there was no freeze and a bit different messages:

Code:

[śro cze 19 21:44:31 2024] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[śro cze 19 21:44:31 2024] rcu:     3-...0: (0 ticks this GP) idle=21a/1/0x4000000000000000 softirq=17872/17872 fqs=6394
[śro cze 19 21:44:31 2024]     (detected by 0, t=36765 jiffies, g=88329, q=70875)
[śro cze 19 21:44:31 2024] Sending NMI from CPU 0 to CPUs 3:
[śro cze 19 21:44:41 2024] rcu: rcu_sched kthread starved for 2485 jiffies! g88329 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[śro cze 19 21:44:41 2024] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[śro cze 19 21:44:41 2024] rcu: RCU grace-period kthread stack dump:
[śro cze 19 21:44:41 2024] task:rcu_sched       state:I stack:    0 pid:   12 ppid:     2 flags:0x00004000
[śro cze 19 21:44:41 2024] Call Trace:
[...]
[śro cze 19 21:44:42 2024] Uhhuh. NMI received for unknown reason 30 on CPU 3.
[śro cze 19 21:44:42 2024] Do you have a strange power saving mode enabled?
[śro cze 19 21:44:42 2024] Dazed and confused, but trying to continue
[śro cze 19 21:44:42 2024] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 157648399809 wd_nsec: 157648400850
[śro cze 19 21:44:42 2024] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
[śro cze 19 21:44:42 2024] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[śro cze 19 21:44:42 2024] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.
[śro cze 19 21:44:42 2024] systemd[1]: Stopping Flush Journal to Persistent Storage...
[śro cze 19 21:44:42 2024] systemd[1]: systemd-journal-flush.service: Succeeded.
[śro cze 19 21:44:42 2024] systemd[1]: Stopped Flush Journal to Persistent Storage.
[śro cze 19 21:44:42 2024] systemd[1]: Stopped Journal Service.
[śro cze 19 21:44:42 2024] systemd[1]: Starting Journal Service...
[śro cze 19 21:44:42 2024] systemd-journald[669]: File /var/log/journal/5238e5674a574d93a15c08abdf606970/system.journal corrupted or uncleanly shut down, renaming and replacing.
[śro cze 19 21:44:42 2024] systemd[1]: Started Journal Service.
[śro cze 19 21:44:42 2024] systemd-journald[669]: Received client request to flush runtime journal.

Third one, compress=0, local NVMe disks. Two minutes instead of ~20 minutes, no issues:

Code:

INFO: starting new backup job: vzdump 4031 --compress 0 --notification-mode auto --node pk05 --remove 0 --storage local --mode snapshot --notes-template '{{guestname}}'
INFO: Starting Backup of VM 4031 (qemu)
INFO: Backup started at 2024-06-19 21:47:34
INFO: status = running
INFO: VM Name: fast
INFO: include disk 'scsi0' 'data:vm-4031-disk-0' 32G
INFO: include disk 'scsi1' 'data:vm-4031-disk-1' 100G
INFO: include disk 'scsi2' 'data:vm-4031-disk-2' 15G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-4031-2024_06_19-21_47_34.vma'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'b2228869-1f5a-47de-89af-c03b7591bf9b'
INFO: resuming VM again
INFO:   5% (7.9 GiB of 147.0 GiB) in 3s, read: 2.6 GiB/s, write: 2.6 GiB/s
[...]
INFO: 100% (147.0 GiB of 147.0 GiB) in 1m 24s, read: 4.4 GiB/s, write: 401.9 MiB/s
INFO: backup is sparse: 19.39 GiB (13%) total zero data
INFO: transferred 147.00 GiB in 84 seconds (1.8 GiB/s)
INFO: archive file size: 127.63GB
INFO: adding notes to backup
INFO: Finished Backup of VM 4031 (00:01:25)
INFO: Backup finished at 2024-06-19 21:48:59
INFO: Backup job finished successfully
TASK OK

Next one: compression=lzo, remote NFS server, no errors
And one more: compression=zstd, remote NFS server, no errors

We use compression=0 because backup storage has a deduplication feature and it works better when backups are not compressed.
But now - why it works fine with comporession enabled and freezes without compresion? Is it a matter of CPU load (w/o compression should be lower, so it doesn't make sense to me). Or maybe amount of data transferred via NFS (also seems weird). I'm out of ideas :-\

Regards,
Konrad

gfngfn256 · Jun 20, 2024

kbechler said:
backup storage has a deduplication feature

Just a theory: assuming the deduplication is on the fly (inline process) - maybe the backup storage "stutters" while examining the large incoming backup, but with a compressed one - it first saves it & then "tries" to analyze its contents for deduplication (unsuccessfully sometimes).

kbechler · Jun 20, 2024

gfngfn256 said:
Just a theory: assuming the deduplication is on the fly (inline process) - maybe the backup storage "stutters" while examining the large incoming backup, but with a compressed one - it first saves it & then "tries" to analyze its contents for deduplication (unsuccessfully sometimes).

That's not it. In this case, all tests were made with pure ext4/RAID1 on two HDD disks as NFS target. No deduplication is used now. Since backup is (more or less) linear write, it should not matter if it is HDD, SSD or NVMe drives. Also, I ran atop and iostat on target machine (NFS server): plenty of free RAM, iostat shows %util below 20%, no hiccups.

In general, the backup job uses "copy before write" to have consistent backup. If there's a lot of writes, backup process could potentially "slow down the IO" (because it has to write blocks to backup before saving new data). But I would expect IO operations slowed down, not the problem with rcu_sched.

wavesound · Aug 27, 2024

We just ran into this issue on 8.2.2 as well…any fixes?

kbechler · Aug 27, 2024

Hello.

Unfortunately, nothing new here.
The only workaround for us was to enable compression when using remote storage.

Search

Search

rcu_sched self-detected stall on cpu during the backup

kbechler

Active Member

Attachments

gfngfn256

Renowned Member

kbechler

Active Member

gfngfn256

Renowned Member

kbechler

Active Member

gfngfn256

Renowned Member

kbechler

Active Member

kbechler

Active Member

gfngfn256

Renowned Member

kbechler

Active Member

wavesound

Member

kbechler

Active Member