PVE8 freeze (stuck) with kernel > 6.2 during backup job

ErikPVE

Renowned Member
Sep 16, 2017
1
0
66
27
Proxmox VE and OMV stall (freeze) when backup job run.

Hi,

I'm running Openmediavault (OMV) already for a long time as an VM guest at Proxmox VE (PVE) host, at an HP Proliant GEN8 microserver (XEON E3-1260L 16GB).
Attempts to run PVE with kernels newer then 6.2 at PVE8 fail. Aim is to run kernel 6.14 (bpo) at PVE8, so migration to PVE9 is possible.

Proxmox VE (PVE):
  • PVE Version '8.4.14'
  • Kernel '6.2.16-20-pve' with kernel option 'intel_iommu=off iommu=pt'
  • No 'ZFS' and no 'PCI-passthrouhgh'
  • VT-d in BIOS disabled
  • Storage type NFS, target host is OMV VM guest at this PVE host.

Openmediavault (OMV) VM guest:
  • virtio0: /dev/disk/by-id/ata-...,backup=no,size=...
  • virtio1: /dev/disk/by-id/ata-...,backup=no,size=...
  • Data disks in OMV are setup as RAID1 (Mirror) pair
  • /dev/md0 consisting of /dev/vda, /dev/vdb
  • OMV Version '7.7.4-1 (Sandworm)'
  • Kernel 'Linux 6.8.12-4-pve' with or without kernel option 'intel_iommu=off iommu=pt'
  • NFS server at OMV as backup storage.

Backup jobs run fine when using kernel version '6.2.16-20-pve' for the PVE host (or versions before).

However, when using any newer kernel version then '6.2.16-20-pve' for the PVE host, when running a backup job (vzdump) of an LXC container from PVE host to the NFS server at OMV, the OMV guest and the PVE host become unresponsive and the other LXC and VM guests 'slowly die'.

No issues with kernel 6.2.16-20-pve

With all following kernels versions issue with stalled PVE host during backup.
6.5.13-6-pve
6.8.12-3-pve
6.14.8-3-bpo12-pve

example of 'journalctl -f' output when freeze occurs:

PVE

Dec 31 12:27:16 proxmox pvedaemon[58406]: INFO: starting new backup job: vzdump 101 --notification-mode auto --notes-template '{{guestname}}' --node proxmox --remove 0 --storage OMV7 --compress zstd --mode snapshot
Dec 31 12:27:16 proxmox pvedaemon[58406]: INFO: Starting Backup of VM 101 (lxc)
Dec 31 12:27:57 proxmox pvestatd[1391]: status update time (9.252 seconds)
Dec 31 12:28:42 proxmox pvestatd[1391]: status update time (24.992 seconds)
Dec 31 12:28:54 proxmox pvestatd[1391]: got timeout

nu futher messages, PVE becomes unresponsive

OMV

Dec 31 12:28:30 omv7 kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1723726897 wd_nsec: 1723726856
Dec 31 12:30:00 omv7 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 31 12:30:00 omv7 kernel: rcu: 1-...0: (1 GPs behind) idle=4acc/1/0x4000000000000000 softirq=41847/41848 fqs=11986
Dec 31 12:30:00 omv7 kernel: rcu: hardirqs softirqs csw/system
Dec 31 12:30:00 omv7 kernel: rcu: number: 0 0 0
Dec 31 12:30:00 omv7 kernel: rcu: cputime: 0 0 0 ==> 29998(ms)
Dec 31 12:30:00 omv7 kernel: rcu: (detected by 0, t=60002 jiffies, g=58745, q=125 ncpus=2)
Dec 31 12:30:00 omv7 kernel: Sending NMI from CPU 0 to CPUs 1:
Dec 31 12:30:00 omv7 kernel: nmi_backtrace_stall_check: CPU 1: NMIs are not reaching exc_nmi() handler, last activity: 4297378640 jiffies ago.

no futher messages. OMV becomes unresponsive

Output is not always exactly the same, sometimes PVE also reports "unable to activate storage 'OMV7' - directory '/mnt/pve/OMV7' does not exist" when system stalls, while OMV7 was online when backup job started.

Any recommendations for a fix or how to further diagnose these feezes/stalls?

thanks, Erik