NMI Watchdog BUG: soft lockup proxmox backup

OliverFdez

New Member
Jan 6, 2025
3
0
1
Hi everyone,

I'm experiencing an issue with one of my VMs when performing a backup using Proxmox Backup Server.

The backup process starts without any problems, but after approximately 10 GB of data has been copied, the VM enters a "CPU Stuck" state. This causes the VM to become unresponsive, effectively killing the server.

1737571138749.png

It's important to note that this issue only occurs when I initiate a backup for this particular VM.

Additional context:

  • I have other nodes with multiple VMs that are successfully backed up daily using Proxmox Backup Server.
  • This issue appears to be isolated to this specific node and its VMs.
Has anyone encountered a similar issue or have any suggestions for resolving this? Any help or insights would be greatly appreciated.

Thank you!
 
Welcome to the Proxmox forum, OliverFdez!

The NMI watchdog makes sure to repeatedly test each CPU core if it can correctly respond to timer interrupts. If it occurs it means that there's something occupying a CPU core for too much time and can either indicate a software or hardware bug.

Is there more information in the journal or syslog (e.g. a kernel stacktrace and surrounding errors)?

On another note, since sendmail seems like a random process which doesn't do that much, I suspect there might be something wrong with the hardware setup. If a reboot doesn't solve the problem and this happens periodically, could you check whether there are any loose or defective cables, PCI cards sit tightly and the CPU (and other hardware) is cooled properly? Do also check for any BIOS updates and CPU microcode updates as they sometimes contain fixes for such issues.

Hope this helps in getting your problem fixed!
 
  • Like
Reactions: OliverFdez
Welcome to the Proxmox forum, OliverFdez!

The NMI watchdog makes sure to repeatedly test each CPU core if it can correctly respond to timer interrupts. If it occurs it means that there's something occupying a CPU core for too much time and can either indicate a software or hardware bug.

Is there more information in the journal or syslog (e.g. a kernel stacktrace and surrounding errors)?

On another note, since sendmail seems like a random process which doesn't do that much, I suspect there might be something wrong with the hardware setup. If a reboot doesn't solve the problem and this happens periodically, could you check whether there are any loose or defective cables, PCI cards sit tightly and the CPU (and other hardware) is cooled properly? Do also check for any BIOS updates and CPU microcode updates as they sometimes contain fixes for such issues.

Hope this helps in getting your problem fixed!
Hi,

Thank you very much for the prompt reply.

This is a log fragment that appears in the proxmox ve syslog when the backup task starts.

Code:
Jan 22 11:19:19 RETHYPERSUCU315 pvedaemon[202881]: <root@pam> starting task UPID:RETHYPERSUCU315:00033159:0076256F:6790FE67:vncproxy:315:root@pam:
Jan 22 11:20:23 RETHYPERSUCU315 pvedaemon[202881]: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
Jan 22 11:20:30 RETHYPERSUCU315 pvestatd[1109]: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
Jan 22 11:20:31 RETHYPERSUCU315 pvestatd[1109]: status update time (8.471 seconds)
Jan 22 11:20:41 RETHYPERSUCU315 pvestatd[1109]: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
Jan 22 11:20:42 RETHYPERSUCU315 pvestatd[1109]: status update time (10.227 seconds)
Jan 22 11:20:48 RETHYPERSUCU315 pvedaemon[202881]: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
Jan 22 11:20:50 RETHYPERSUCU315 pvestatd[1109]: status update time (7.708 seconds)
Jan 22 11:21:00 RETHYPERSUCU315 pvestatd[1109]: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
Jan 22 11:21:00 RETHYPERSUCU315 pvestatd[1109]: status update time (8.519 seconds)

I am using Dell T140 Servers
BIOS version: 2.12.2
iDRAC firmware version: 6.10.30.20
With disks in RAID1

Could my hardware be the problem?

thanks a lot for the help
 
I have the same Issue,

after pve backup over nfs targed via rsync "watchdog BUG soft lockup"

pve kernel: [2843989.432012] watchdog: BUG: soft lockup - CPU#77 stuck for 26s! [kworker/u386:2:4167535]
2025-01-25T10:33:23.565647+01:00 pve kernel: [2843989.432733] Modules linked in: dm_snapshot tcp_diag inet_diag nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs veth ebt
able_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables 8021q garp mrp softdog bonding tls sunrpc nfnetlink_log nfnetlink binfmt_misc intel_r
apl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac skx_edac_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_pc ipmi_ssif plat
form_profile kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi rapl acpi_power_meter isst_if_m
box_pci isst_if_mmio cxl_core dell_wmi cmdlinepart dell_smbios video ipmi_si intel_th_gth spi_nor mei_me acpi_ipmi sparse_keymap ipmi_devintf intel_th_pci dcdbas mgag200 intel_cstate dell_wm
i_descriptor pcspkr einj mei wmi_bmof mtd intel_pch_thermal isst_if_common intel_th intel_vsec ipmi_msghandler mac_hid zfs(PO) spl(O)
2025-01-25T10:33:23.565668+01:00 pve kernel: [2843989.432786] vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool
dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci igb xhci_pci_renesas i2c_i801 i2c_algo_bit crc32_pclmul ahci megaraid_sas dca spi_intel_pci i2c_mux xhci_hcd tg3 bnxt_en libahci
spi_intel i2c_smbus wmi

with pve 6.11.0-2-pve on
DELL PowerEdge R650
BIOS-Version 1.13.2

rsync PVE 3.2.7-1 and nfs targed the same Version.
 
Hi everyone,

After further investigation, I found that this issue only occurs on servers that are backed up over VPN.

Key Observations:​

  1. The error appears when trying to back up VMs over VPN: VM 315 qmp command failed - VM 315 qmp command 'query-proxmox-support' failed - unable to connect to VM 315 qmp socket - timeout after 51 retries
  2. Even when limiting the backup speed to 1 MiB/s or 5 MiB/s, the issue persists.
  3. In some VPN-connected servers, backups work at 1-2 MiB/s, but on this specific one, they always fail.
  4. The bandwidth at the branch offices is 10 MB/s, but traffic monitoring shows that it's nowhere near the limit when the problem occurs.
  5. Once the error appears, the VM completely drops from the network, stops responding to pings, and the connection becomes slow.
  6. The VM does not recover until the backup is canceled both on Proxmox Backup Server and Proxmox VE.
  7. Increasing the local bandwidth to 30 MiB/s allowed the backup to work fine at 15 MiB/s, but this is only a temporary solution (we cannot leave the network at 30 MiB/s permanently).
  8. Servers connected via fiber optic can perform backups without any issues, even at unlimited speeds.

Possible Causes and Questions:​

  • The VM being backed up was migrated from an ESXi environment and is using the VMware vmxnet3 network adapter.
    → Should I switch to VirtIO for better stability in Proxmox?
  • The switch at the branch offices is configured with an MTU of 1400, while Proxmox defaults to 1500 MTU.
    → Could the MTU mismatch be causing packet fragmentation and network instability?


    Any suggestions or additional troubleshooting steps would be greatly appreciated.
    Thanks again for your help!