VM freezes during snapshot/backup on NFS (Proxmox 9, no errors in logs)

joserosa

New Member
May 19, 2025
10
0
1
I have a Proxmox VE 9.1 production environment with 2 nodes and 1 qdevice (for quorum) using shared NFS storage.

During snapshot or backup operations, VMs (especially Windows) become completely unresponsive:

  • VM freezes completely
  • Loses network connectivity (no ping)
  • Console is frozen
  • VM recovers only after the snapshot/backup finishes

This can take up to 3–5 minutes, and makes it impossible to work with the VMs (e.g. SQL servers).

my environment is:

  • Proxmox VE: 9.1.5
  • Kernel: 6.17.4-2-pve
  • Disk format: qcow2
  • Storage: NFS 4.1
  • Backend: NetApp AFF C190
  • Network: 10Gb SFP+ (Cisco Nexus)
  • MTU: 9000 end-to-end
  • Dedicated VLAN for NFS traffic

I currently have the NFS configured as follows, following netapp recomendations and some others parameters:

https://docs.netapp.com/us-en/netap...ox-ontap-nfs.html#storage-administrator-tasks

Bash:
nfs: TEST_DS_PROXMOX
        export /TEST_DS_PROXMOX
        path /mnt/pve/TEST_DS_PROXMOX
        server x.x.x.x
        content images
        options vers=4.1,nconnect=4,timeo=600,retrans=2,_netdev,x-systemd.automount
        prune-backups keep-all=1

NFS its also a dedicated vlan only for storage comunication.

Observed behavior
  • Snapshot starts → VM freezes immediately
  • No response to ping or console
  • No errors in:
    • journalctl
    • dmesg
  • Task finishes successfully
The VM appears to be completely stalled during the operation (no CPU activity, no I/O progress, no network response).

What I tested
  • Network verified (no drops, no saturation)
  • NFS works correctly outside snapshot operations
  • Issue is consistently reproducible
  • Happens mainly on Windows VMs
Additional context

I have been researching similar issues and found multiple discussions and reports related to:
  • VM freezes during snapshot on NFS
  • NFS performance degradation under load
  • possible kernel regressions (6.14 / 6.17) affecting NFS behavior
From what I understand, this could be related to synchronous I/O (fsync) behavior during snapshot operations over NFS, but I am not sure if this is expected or indicates a problem.

Some of the references I reviewed:

However, I have not found a clear root cause or confirmed solution.

The behavior I observe (VM completely unresponsive during snapshot without any errors in logs) seems more like an I/O stall during synchronous write/flush operations rather than a failure.

Questions
  • Is this expected behavior when using NFS + qcow2 + snapshots?
  • Are there recommended configurations that allow reliable snapshots without VM freeze?
  • Is NFS suitable for this type of workload in production?
  • What storage architecture is typically used in 24/7 production environments where snapshots are mandatory?
Requirements (important)

  • VM snapshots are a mandatory requirement in this environment.
Due to operational and application constraints, it is not possible to:

  • avoid snapshots
  • use stop-mode backups

The expected behavior is that snapshot operations should not cause prolonged VM unresponsiveness, especially in production workloads.

Aditional information
pveversion -v

Code:
proxmox-ve: 9.1.0 (running kernel: 6.17.4-2-pve)
pve-manager: 9.1.5 (running version: 9.1.5/80cf92a64bef6889)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.4-2-pve-signed: 6.17.4-2
proxmox-kernel-6.17: 6.17.4-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20251111.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.7
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.2-1
proxmox-backup-file-restore: 4.1.2-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.5
pve-cluster: 9.0.7
pve-container: 6.1.0
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.1.0
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-5
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.4
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1
 
I have been researching similar issues and found multiple discussions and reports related to:

  • VM freezes during snapshot on NFS
  • NFS performance degradation under load
  • possible kernel regressions (6.14 / 6.17) affecting NFS behavior
From what I understand, this could be related to synchronous I/O (fsync) behavior during snapshot operations over NFS, but I am not sure if this is expected or indicates a problem.
Did you test with kernel 6.14 to rule out a possible issue with kernel 6.17? There have been reports of performance issues in kernel 6.17 [3] involving TCP stack behavior with MTU 9000.
 
Hey @YaZoal , thank you for your response.

From what I have seen, similar issues have also been reported with kernel 6.14, especially related to NFS freezes and I/O stalls.

For example:
- https://forum.proxmox.com/threads/s...-6-14-8-2-pve-when-mounting-nfs-shares.169571
- https://forum.proxmox.com/threads/bad-nfs-performance-with-proxmox-9.174881

This makes me think the issue might not be limited to kernel 6.17 specifically, but could be related more generally to NFS behavior under synchronous I/O workloads (e.j. snapshot/fsync).

I have also seen discussions (including some involving Proxmox staff on the forum and mailing lists on the lore page), but I haven’t found a clear root cause or a confirmed solution yet.

What makes this more confusing is that the issue does not seem to be related to the backup tool itself, but specifically to the moment when the snapshot is taken.

I also considered whether this could be related to VirtIO drivers or the guest agent (VSS interaction), but I have already tested with the latest stable versions without any change in behavior.

At this point, I am trying to understand whether this is:
- expected behavior under certain storage conditions
- a limitation of NFS with synchronous I/O
- or a kernel/storage interaction issue

Any insights or suggestions would be greatly appreciated.


If anyone from the Proxmox team (e.g. @fiona or @Maximiliano ) has any input or guidance on this type of issue, it would be very helpful.
 
I have been running additional tests to better understand the issue.

I deployed 10 Windows Server 2025 VMs generating workload (around 30–35k IOPS total) to simulate a stressed environment during snapshot operations.

The goal was to reproduce the issue under load and compare behavior across different storage configurations.

I tested the same scenario across 4 different datastores:

- Datastore 1: NFS 4.1 + volume chain enabled
- Datastore 2: NFS 4.1 without volume chain
- Datastore 3: NFS 3 + volume chain enabled
- Datastore 4: LVM (iSCSI LUN from NetApp with multipath) + volume chain

---

### Results

Datastore 1 (NFS 4.1 + volume chain)
With the exact same environment (same kernel, same workload), enabling volume chain completely resolves the issue.

- Snapshots are created normally under load
- VM only loses 1 ping (expected behavior)
- No freeze observed
- Snapshot creation and deletion behave smoothly

The behavior is very similar to what I would expect from VMware.

---

Datastore 2 (NFS 4.1 without volume chain)
The issue reappears:

- VM freezes during snapshot creation and deletion
- Freeze duration is significantly long

---

Datastore 3 (NFS 3 + volume chain)
Same result as datastore 1:

- issue solved.

---
Datastore 4 (LVM + volume chain, NetApp LUN via multipath)

I performed two different tests:

Test 1:
- Storage was initially configured without volume chain
- Only RAW disk format was allowed to move
- Snapshot functionality was not available
- Enabling volume chain afterwards did not change this behavior

Test 2:
- I moved the VM disks to another datastore
- Removed datastore 4
- Recreated it with volume chain enabled from the beginning

Result:
- VM disks could be migrated in qcow2 format
- Snapshots were possible
- The issue did not occur (no VM freeze)

---

### Conclusion / Current situation

These tests leave me in a difficult position, as I cannot clearly identify the root cause:

- Volume chain resolves the issue → but it is still in preview and not recommended for production (and there are known issues, e.g. TPM-related bugs)
- With the same kernel and environment, behavior changes completely depending on whether volume chain is enabled
- This allows me to reasonably rule out:
- VSS / guest agent issues
- network-related problems

---

### Question

At this point, I am unsure how to proceed.

Is there any known explanation for this behavior when using standard NFS (without volume chain)?

Is there any recommended approach to achieve reliable snapshots in production environments without relying on a preview feature?

Any insights or ideas is be greatly appreciated.
 
Hi, just join in this conversation. Having the same issue, if all guests are in. I have to test further. I'm also not sure if we had it this "impressive" from the start with 8.x. But just noticed a nearly 2minute hickup from our exchange - removing the snapshot. Same on creating. without RAM.
Thought first, that it is related to PBS 4.1.4 because that changed recently, but since it now occurs on snapshot only..

Stumbled upon this (but this is backup):
https://forum.proxmox.com/threads/pve-backups-causing-vm-stalls.178941/#post-849688 with the hint to https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_vm_backup_fleecing

We're having PVE 9.1.6
  • Storage: NFS 4.2
  • Backend: NetApp AFF 250
  • Network: 25Gb
  • MTU: 9000 end-to-end
  • Dedicated VLAN for NFS traffic
 
Is there any known explanation for this behavior when using standard NFS (without volume chain)?
First of all, thank you very much for sharing the detailed information and the provided test results.

This is a known behavior for VMs whose disks uses qcow2 format on file-level storage (without snapshot-as-volume-chain) [0].
Taking or deleting a large snapshot can take a considerable amount of time ranging from minutes to hours, especially when using NFS. While the snapshot is being created or deleted, the VM is blocked and may appear to be offline.

The timeout for snapshot operations was increased from 10 minutes to 1 hour in qemu-server version 9.1.1:
https://git.proxmox.com/?p=qemu-server.git;a=commit;h=e5f7156b1d2a0c2a1d8705924aed1c22b7fa210f
This change does not reduce the duration of the snapshot operation or prevent the VM from being blocked; it only allows the task to wait longer for completion.
Is there any recommended approach to achieve reliable snapshots in production environments without relying on a preview feature?
You might consider other storage type that support snapshot and not file based level e.g LVM or the recommended ceph storage. See [1] for more details.

[0] https://git.proxmox.com/?p=qemu-server.git;a=commit;h=e5f7156b1d2a0c2a1d8705924aed1c22b7fa210f
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_types
 
Hello @katamadone,

Have you found a solution? Or any clues or conclusions? Anything that might point to the cause or reason?

I'm still running tests and trying to find the root cause to fix it, but I haven't had any luck yet :(
 
Hi everyone,

I wanted to post an update.

We have successfully upgraded to Proxmox VE 9.2.3, which also updated the PVE host kernel, QEMU-related packages, guest agent components, and everything involved.

At first, we thought the problem had been resolved, but it turns out that it has not. The issue still persists.

I noticed that when I migrate a VM to a datastore that does not have the “Allow Snapshots as Volume-Chain” checkbox enabled, snapshots initially seem to work correctly without freezing the VM.

This led me to mistakenly believe that everything was working fine after the update, because the process of creating and deleting the snapshot took only a few seconds and only caused the loss of 1–2 pings, without leaving the VM frozen.

However, after keeping the VM on that datastore for several days, we ran the same test again and the problem appeared again. The VM becomes inoperable / frozen for the duration of the snapshot process. The same thing happens when the snapshot is deleted.

This also happens even when the VM has no workload.

Has anyone else encountered this problem? Has anyone found a solution?

How is Proxmox typically set up in home lab and production environments?

I think our problem may come from running VMs over NFS. Maybe there are fewer issues when VMs are stored on:

  • Local disks with ZFS
  • ZFS-based storage
  • Proxmox hyperconverged infrastructure / HCI with Ceph
But of course, if you have a NetApp, HPE, IBM, Hitachi, or similar storage array with NFS datastores, without “Allow Snapshots as Volume-Chain” as still inpreview, what options or solutions are available?
 
Hi everyone,

I wanted to post an update.

We have successfully upgraded to Proxmox VE 9.2.3, which also updated the PVE host kernel, QEMU-related packages, guest agent components, and everything involved.

At first, we thought the problem had been resolved, but it turns out that it has not. The issue still persists.

I noticed that when I migrate a VM to a datastore that does not have the “Allow Snapshots as Volume-Chain” checkbox enabled, snapshots initially seem to work correctly without freezing the VM.

This led me to mistakenly believe that everything was working fine after the update, because the process of creating and deleting the snapshot took only a few seconds and only caused the loss of 1–2 pings, without leaving the VM frozen.

However, after keeping the VM on that datastore for several days, we ran the same test again and the problem appeared again. The VM becomes inoperable / frozen for the duration of the snapshot process. The same thing happens when the snapshot is deleted.

This also happens even when the VM has no workload.

Has anyone else encountered this problem? Has anyone found a solution?

How is Proxmox typically set up in home lab and production environments?

I think our problem may come from running VMs over NFS. Maybe there are fewer issues when VMs are stored on:

  • Local disks with ZFS
  • ZFS-based storage
  • Proxmox hyperconverged infrastructure / HCI with Ceph
But of course, if you have a NetApp, HPE, IBM, Hitachi, or similar storage array with NFS datastores, without “Allow Snapshots as Volume-Chain” as still inpreview, what options or solutions are available?
We have seen similar issues on VMs that live on NFS storage(NetApp). Version or mount options does not change anything, we've tried NFS3, 4, 4.1, 4.2.

As @YaZoal pointed out this is a known issue, and is mentioned in the docs. Hopefully snapshots as volume-chain will mature and be usable in the future, but untill that happens we just have to work around the issue. We do it by not taking snapshots of large VMs when they are online, and simply shut them down before taking a snapshot. Other VMs can accept the hickup and if that's the case we take online snapshots of them.

Anecdotally I've noticed that the issue seem to be "worse" on slower networks and/or slower NFS storage, but that's to be expected I guess.

The VMs we have that live on our Ceph-cluster does not present the same issue.
 
online internal qcow2 snapshot creation/deletion (which is still the standard) can be slow esp. on heavy duty VMs / slow NFS links / RAM snapshots and the VM/disks need to be also finally freezed - with external snapshots / snapshots as volume chains this should be not the case