VM freezes during snapshot/backup on NFS (Proxmox 9, NetApp AFF, no errors in logs)

joserosa

New Member
May 19, 2025
5
0
1
I have a Proxmox VE 9.1 production environment with 2 nodes and 1 qdevice (for quorum) using shared NFS storage.

During snapshot or backup operations, VMs (especially Windows) become completely unresponsive:

  • VM freezes completely
  • Loses network connectivity (no ping)
  • Console is frozen
  • VM recovers only after the snapshot/backup finishes

This can take up to 3–5 minutes, and makes it impossible to work with the VMs (e.g. SQL servers).

my environment is:

  • Proxmox VE: 9.1.5
  • Kernel: 6.17.4-2-pve
  • Disk format: qcow2
  • Storage: NFS 4.1
  • Backend: NetApp AFF C190
  • Network: 10Gb SFP+ (Cisco Nexus)
  • MTU: 9000 end-to-end
  • Dedicated VLAN for NFS traffic

I currently have the NFS configured as follows, following netapp recomendations and some others parameters:

https://docs.netapp.com/us-en/netap...ox-ontap-nfs.html#storage-administrator-tasks

Bash:
nfs: TEST_DS_PROXMOX
        export /TEST_DS_PROXMOX
        path /mnt/pve/TEST_DS_PROXMOX
        server x.x.x.x
        content images
        options vers=4.1,nconnect=4,timeo=600,retrans=2,_netdev,x-systemd.automount
        prune-backups keep-all=1

NFS its also a dedicated vlan only for storage comunication.

Observed behavior
  • Snapshot starts → VM freezes immediately
  • No response to ping or console
  • No errors in:
    • journalctl
    • dmesg
  • Task finishes successfully
The VM appears to be completely stalled during the operation (no CPU activity, no I/O progress, no network response).

What I tested
  • Network verified (no drops, no saturation)
  • NFS works correctly outside snapshot operations
  • Issue is consistently reproducible
  • Happens mainly on Windows VMs
Additional context

I have been researching similar issues and found multiple discussions and reports related to:
  • VM freezes during snapshot on NFS
  • NFS performance degradation under load
  • possible kernel regressions (6.14 / 6.17) affecting NFS behavior
From what I understand, this could be related to synchronous I/O (fsync) behavior during snapshot operations over NFS, but I am not sure if this is expected or indicates a problem.

Some of the references I reviewed:

However, I have not found a clear root cause or confirmed solution.

The behavior I observe (VM completely unresponsive during snapshot without any errors in logs) seems more like an I/O stall during synchronous write/flush operations rather than a failure.

Questions
  • Is this expected behavior when using NFS + qcow2 + snapshots?
  • Are there recommended configurations that allow reliable snapshots without VM freeze?
  • Is NFS suitable for this type of workload in production?
  • What storage architecture is typically used in 24/7 production environments where snapshots are mandatory?
Requirements (important)

  • VM snapshots are a mandatory requirement in this environment.
Due to operational and application constraints, it is not possible to:

  • avoid snapshots
  • use stop-mode backups

The expected behavior is that snapshot operations should not cause prolonged VM unresponsiveness, especially in production workloads.

Aditional information
pveversion -v

Code:
proxmox-ve: 9.1.0 (running kernel: 6.17.4-2-pve)
pve-manager: 9.1.5 (running version: 9.1.5/80cf92a64bef6889)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.4-2-pve-signed: 6.17.4-2
proxmox-kernel-6.17: 6.17.4-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20251111.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.7
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.2-1
proxmox-backup-file-restore: 4.1.2-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.5
pve-cluster: 9.0.7
pve-container: 6.1.0
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.1.0
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-5
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.4
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1
 
I have been researching similar issues and found multiple discussions and reports related to:

  • VM freezes during snapshot on NFS
  • NFS performance degradation under load
  • possible kernel regressions (6.14 / 6.17) affecting NFS behavior
From what I understand, this could be related to synchronous I/O (fsync) behavior during snapshot operations over NFS, but I am not sure if this is expected or indicates a problem.
Did you test with kernel 6.14 to rule out a possible issue with kernel 6.17? There have been reports of performance issues in kernel 6.17 [3] involving TCP stack behavior with MTU 9000.