Understanding VM backups and behavior on ZFS

phip

New Member
Aug 13, 2024
17
1
3
Hi everyone,

I've recently started facing issues with a large MySQL test database (20TB VM disk) and MySQL replication. Fuller details on the issue on the DB side can be found in another thread, but should be mostly irrelevant to the questions here: https://dba.stackexchange.com/quest...-uninterruptible-mysql-query-from-replication

During testing, I found that the MySQL replication failure and subsequent "pseudo-freeze" of the MySQL server process correlates with backup runs of the VM disk. What I've observed:
  1. MySQL replication is normally running
  2. The Proxmox backup starts, with an external PBS as target (10G link)
  3. There is a short hiccup in the VM at the beginning of the backup due to fsfreeze/thaw, but otherwise it runs normally
  4. The dirty bitmap finds some 500-1000GB of changed data (out of 20TB total disk size and >10TB used space)
  5. The backup data is written at wildly varying speeds, from 30 to 500MBytes/s
  6. At some time during the backup, the MySQL replication will suddenly stall; not with an error, it just waits while having one CPU core loaded at 100%
  7. At some (other) time during the backup, the DB will fail to handle application requests
  8. Before or at about the time when the backup finishes, the DB will become responsive to select/insert/update/delete queries again, making the application resume work; data can be read from and written to the DB just fine and files can also normally be written and read on both the VM's root and the DB data disk
  9. BUT: Even then, the MySQL replication stays stalled
  10. More severely, the MySQL replication can't be stopped anymore, a "stop replica;" query just stays there indefinitely
  11. Due to the unstoppable replication, the MySQL server can't be shut down properly; a SIGTERM will just lead to a SHUTDOWN message in the MySQL log, but the process stays alive, causing CPU load (but no or very low disk I/O, as far as I've seen)
  12. The only way to get out of this situation is to forcefully "kill -KILL" the MySQL server process, after which it will restart, do some cleanups and work again normally, until the next backup run
To investigate further, I manually started a backup, waited for the issue to show up and then tried to see if I can get the DB up and running before the backup is finished: The result? I can't. "kill -KILL" leads to the process showing up in ps as "defunct", but it doesn't go away. So I then stopped the backup process and sure enough, a few seconds later the process vanished and a new DB server instance was started and running. While this could be a coincidence, I doubt that.

Before facing the issues, I were under the impression that PVE backups make use of ZFS snapshots (like live migrations do), but according to Backup modes and especially the linked description Efficient VM backup for qemu, I see that this is not the case. What caught my attention in the guide is this statement about fleecing:

"With backup fleecing, such old data is cached in a fleecing image rather than sent directly to the backup target. This can help guest IO performance and even prevent hangs in certain scenarios, at the cost of requiring more storage space."

I didn't have fleecing enabled so far, but did so now for a test. The backup is still running, but so far, the VM doesn't show signs of a problem. The fleece image size seems to cycle around 2-3GB, so if this really is a solution, I suppose I can work with that even with the fairly small "local-zfs" storage (only a few 10GB).

That brings me to my main questions to hopefully better understand what's going on:
  • How does the backup process interact with the VM, that could lead to keeping a killed process from being cleaned up until the end of the backup?
  • I see that the interception of VM writes by the backup layer causes delays inside the VM, but has this been found to cause other trouble than just a slowdown with guest processes (especially DBs) in the past already?
  • Does backup fleecing change the behavior of these VM interactions, except for providing a quicker response to writes?
  • Since it is mentioned in the docs that fleecing can "even prevent hangs in certain scenarios", is there an explanation or at least a theory about what could cause these hangs in the first place without fleecing? Or was "hang" in this case just about the VM lagging behind while the backup runs due to slow write performance?
  • Are there plans to leverage ZFS snapshots for backups, as they are already used for VM live migration? Or has there been a change in this direction already since PVE 8.3?
Thank you and best regards,
Philipp
 
  • I see that the interception of VM writes by the backup layer causes delays inside the VM, but has this been found to cause other trouble than just a slowdown with guest processes (especially DBs) in the past already?
  • Does backup fleecing change the behavior of these VM interactions, except for providing a quicker response to writes?
The main issue that happens here is if the backup target (storage, network, ...) is not fast enough to ACK a write. As the backup will halt the write operation of the VM until it got the ACK that this part of the VM disk in question has been written to the backup. So that can lead to slow IO from the point of view of the guest.
With the fleecing image enabled, those writes can be handled and ACKed a lot quicker as they will happen on local storage before it is being sent to the backup target.

I hope that explains this part.

Are there plans to leverage ZFS snapshots for backups, as they are already used for VM live migration? Or has there been a change in this direction already since PVE 8.3?
No, that never was the case for VMs and we don't have any plans, as the current approach is completely agnostic to the underlying storage technology, which simplifies the implementation by a lot.
 
Thanks for the quick response, Aaron!

The main issue that happens here is if the backup target (storage, network, ...) is not fast enough to ACK a write. As the backup will halt the write operation of the VM until it got the ACK that this part of the VM disk in question has been written to the backup. So that can lead to slow IO from the point of view of the guest.
With the fleecing image enabled, those writes can be handled and ACKed a lot quicker as they will happen on local storage before it is being sent to the backup target.
Absolutely clear, I see why it would become slow. But I don't see how this could lead to a permanently broken process inside the VM. I would fully understand if the DB replication would just slow down and lag behind during the backup, but not what could happen to completely freeze it and lead to an unkillable process. While I could blame a MySQL misconfiguration or bug to cause the frozen replication, I feel that something must really go extremely wrong for a process to become unkillable via "kill -KILL". Plus, why it just magically disappears as soon as the backup is complete. It feels a bit as if at least one write would be blocked until the very end of the backup process, instead of causing the affected block(s) to be written as fast as possible to the backup server and then acknowledge the operation. Could that be the case?

No, that never was the case for VMs and we don't have any plans, as the current approach is completely agnostic to the underlying storage technology, which simplifies the implementation by a lot.
Fair enough, even though I'd love to see that implemented some time in the future. While it would avoid such a situation as mine in the first place (because there would be no need to tamper with live VM writes at all), it could also gain some performance. When I check our daily ZFS snapshots of the DB disk, I see that they take up some 20-40GB per day. I've now started the backup for the third time within a single day (btw with no more DB failures since enabling fleecing!) for testing and each time it detected some 600-800GB of changed data that subsequently needed to be transferred to the backup server. Sure, the ZFS figures refer to compressed data, while the dirty-bitmap checks the uncompressed VM disk, but it still feels like a very large volume for just a bunch of appended data in a DB. What's the granularity of the dirty-bitmap?

Please don't get me wrong, PVE is absolutely great and amazing software, I'm not saying that anything has been done wrong. It's just that I feel that it could make great use of some ZFS features here and there, as this appears to be a quite common setup. And it already benefits from it elsewhere, like in live migration, which uses native ZFS snapshots and zfs send/recv.
 
Well... "Zu früh gefreut" :( After having three successful backup runs that were manually started (using the job's "Run now" button), the scheduler kicked in for the regular backup and BANG, the replication froze again, with the same symptom of the unkillable process. So the fleecing also doesn't seem to help, or at least not always :(

Out of curiosity, I now switched over to "Suspend" mode and manually started a backup without fleecing to test. This time I really see that the VM is slowed down and MySQL replication lags behind, but so far it continues. I'll keep testing this way to see if this mode helps. What I've discovered so far is an apparent bug in the ordering of operations (see note in log output):

Code:
INFO: starting new backup job: vzdump 302 --node storage-2 --notes-template '{{guestname}}' --storage backup-1 --remove 0 --mode suspend --notification-mode auto
INFO: Starting Backup of VM 302 (qemu)
INFO: Backup started at 2026-01-07 18:23:03
INFO: status = running
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: VM Name: db-1-vm
INFO: include disk 'scsi0' 'local-nvme-pool:vm-302-disk-0' 400G
INFO: include disk 'scsi1' 'local-nvme-pool:vm-302-disk-1' 20T
INFO: suspending guest
INFO: skip unused drive 'hddpool-s2:vm-302-disk-0' (not included into backup)
INFO: skip unused drive 'hddpool-s2:vm-302-disk-0-old' (not included into backup)
INFO: skip unused drive 'local-nvme-pool:vm-302-disk-2' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/302/2026-01-07T17:23:03Z'

### HERE: ###
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?


INFO: started backup task '74ae522e-dab2-4d0c-a30d-9f9904c66a66'
INFO: resuming VM again after 8 seconds
INFO: scsi0: dirty-bitmap status: OK (204.0 MiB of 400.0 GiB dirty)
INFO: scsi1: dirty-bitmap status: OK (676.9 GiB of 20.0 TiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 677.1 GiB dirty of 20.4 TiB total
INFO:   0% (1.3 GiB of 677.1 GiB) in 3s, read: 457.3 MiB/s, write: 453.3 MiB/s
INFO:   1% (6.9 GiB of 677.1 GiB) in 17s, read: 406.9 MiB/s, write: 406.9 MiB/s
...
INFO:  98% (663.6 GiB of 677.1 GiB) in 35m 59s, read: 502.8 MiB/s, write: 502.8 MiB/s
INFO:  99% (670.6 GiB of 677.1 GiB) in 36m 13s, read: 512.0 MiB/s, write: 512.0 MiB/s
INFO: 100% (677.1 GiB of 677.1 GiB) in 36m 26s, read: 513.5 MiB/s, write: 513.2 MiB/s
INFO: Waiting for server to finish backup validation...
INFO: backup is sparse: 976.00 MiB (0%) total zero data
INFO: backup was done incrementally, reused 19.73 TiB (96%)
INFO: transferred 677.11 GiB in 2189 seconds (316.7 MiB/s)
INFO: adding notes to backup

### And by the way: What's this supposed to do? The VM has been resumed long ago. ###
INFO: resume vm


INFO: Finished Backup of VM 302 (00:36:38)
INFO: Backup finished at 2026-01-07 18:59:41
INFO: Backup job finished successfully
TASK OK

I don't know whether the filesystem freeze is implicit when suspending, but it certainly makes little sense to try to connect to the guest agent after the VM has been suspended.
 
Absolutely clear, I see why it would become slow. But I don't see how this could lead to a permanently broken process inside the VM. I would fully understand if the DB replication would just slow down and lag behind during the backup, but not what could happen to completely freeze it and lead to an unkillable process.
I assume this is a Linux guest right?
If you run ps -auxwf for example, you have the STAT column. If a process is in D state, it is waiting for IO to finish. See man 1 ps in the PROCESS STATE CODES section.
D uninterruptible sleep (usually IO)

If a process is in that state you cannot kill it.

Plus, why it just magically disappears as soon as the backup is complete. It feels a bit as if at least one write would be blocked until the very end of the backup process, instead of causing the affected block(s) to be written as fast as possible to the backup server and then acknowledge the operation. Could that be the case?
Hard to say without taking a very deep dive into what kind of access patterns are happening.


As mentioned, we don't have any plans in the foreseeable future to fundamentally change how Proxmox VE takes backups of VMs. The current mechanism has been like that from the very beginning. Fleecing on a fast local storage should alleviate the performance issues.
I didn't ask but on what kind of storage is the VM located (ZFS backed by HDDs? SSDs? If so, consumer or enterprise SSDs?). The fleecing storage used is what, if it is not on the same storage?

When I check our daily ZFS snapshots of the DB disk, I see that they take up some 20-40GB per day. I've now started the backup for the third time within a single day (btw with no more DB failures since enabling fleecing!) for testing and each time it detected some 600-800GB of changed data that subsequently needed to be transferred to the backup server. Sure, the ZFS figures refer to compressed data, while the dirty-bitmap checks the uncompressed VM disk, but it still feels like a very large volume for just a bunch of appended data in a DB. What's the granularity of the dirty-bitmap?
The dirty bitmap tracks which parts of the virtual disk have seen a write operation. When those areas are read in chunks (4 MiB for block based backup sources), the result is hashed and only sent to the backup server if the data actually changed since the last backup. And it is automatically also compressed with zstd before it is sent to the backup server. If the Proxmox Backup Server gets a chunk it already has, it will not store it twice.

I don't know whether the filesystem freeze is implicit when suspending, but it certainly makes little sense to try to connect to the guest agent after the VM has been suspended.
That seems right. I found the following bug in our bugtracker https://bugzilla.proxmox.com/show_bug.cgi?id=3858
Feel free to chime in there so we see for how many people this might be an issue. This helps us prioritize what to work on next.

One thing that could cause issues is if both MySQL VMs are in the same cluster and suspend mode is used. Then the chances are non-zero that both might be suspended at the same time!
Because a default backup job is run at the same time on all nodes. Unless you specify node specific backup jobs with a large enough offset.
So I would rather take a closer look at why snapshot mode + fleecing is still running into some issues sometimes.
 
Last edited:
  • Like
Reactions: Kingneutron
I let the backup run in Suspend mode, with fleecing enabled and it was in the 6th iteration when I manually aborted it to test further. So this mode does not seem to trigger the issue for some reason. I then switched back to Snapshot mode and it already crashed in the very first backup operation.

I assume this is a Linux guest right?
If you run ps -auxwf for example, you have the STAT column. If a process is in D state, it is waiting for IO to finish. See man 1 ps in the PROCESS STATE CODES section.

If a process is in that state you cannot kill it.
Process flags are as expected:
Code:
# Backup running, MySQL replication stalled.

$ sudo ps auxwf | grep mysqld
mysql       7212  183 20.8 35470720 27540788 ?   Ssl  Jan07 2431:39 /usr/sbin/mysqld

# Trying to kill the process using SIGTERM, but it's in an uninterruptible sleep:

$ sudo kill 7212
$ sudo ps auxwf | grep mysqld
mysql       7212  183 20.8 35405764 27535376 ?   Dsl  Jan07 2431:43 /usr/sbin/mysqld

# Trying to kill the process using SIGKILL, which converts it to a zombie:

$ sudo kill -KILL 7212
$ sudo ps auxwf | grep mysqld
mysql       7212  183  0.0      0     0 ?        Zsl  Jan07 2431:45 [mysqld] <defunct>

# It stays in this state while the backup is still running.

$ sudo ps auxwf | grep mysqld
mysql       7212  183  0.0      0     0 ?        Zsl  Jan07 2431:45 [mysqld] <defunct>

# Now I manually abort the backup, the sleep ends, the zombie disappears and a new process is spawned:

$ sudo ps auxwf | grep mysqld
mysql       9495  117  1.9 29069716 2526512 ?    Ssl  16:29   0:07 /usr/sbin/mysqld
Hard to say without taking a very deep dive into what kind of access patterns are happening.

Indeed... Is there some reasonable way to see the map of blocks which the hypervisor currently keeps locked for writes because they haven't been written to the backup server or the fleecing image?

As mentioned, we don't have any plans in the foreseeable future to fundamentally change how Proxmox VE takes backups of VMs. The current mechanism has been like that from the very beginning. Fleecing on a fast local storage should alleviate the performance issues.
I didn't ask but on what kind of storage is the VM located (ZFS backed by HDDs? SSDs? If so, consumer or enterprise SSDs?). The fleecing storage used is what, if it is not on the same storage?

Yeah, fleecing works well and the guest doesn't notice any performance degradation. At least for three times, while the original issue came up again in the fourth run.
The VM disks are located on eight Samsung 990 Pro 4TB SSDs, organized as two raidz1 vdevs (24TB usable storage in total). The fleecing image is either on the same storage or on the local-zfs, which is on a ZFS mirror of two Samsung PM9A3 3.xxTB datacenter NVMes. I've tested with both. The backup server runs on a bunch of Seagate "enterprise" 3.5" HDDs, so it will always be the bottleneck.

The dirty bitmap tracks which parts of the virtual disk have seen a write operation. When those areas are read in chunks (4 MiB for block based backup sources), the result is hashed and only sent to the backup server if the data actually changed since the last backup. And it is automatically also compressed with zstd before it is sent to the backup server. If the Proxmox Backup Server gets a chunk it already has, it will not store it twice.

Ok, so the granularity is 4MiB, meaning that any changed bit within a 4MiB block will lead to the whole block being marked as dirty, right? This could at least be an explanation to why we're seeing those large changes in the backups after only a few hours of VM runtime. Assuming that the DB changes individual blocks of 16kiB which are spread over a large area of the disk, it could lead to a worst case inflation factor of 4MiB/16kiB=256. So I think the actual factor of 20-50 that we see between ZFS snapshots (volume has volblocksize=16k) and the dirty-bitmap is explainable by this, that's fine.

That seems right. I found the following bug in our bugtracker https://bugzilla.proxmox.com/show_bug.cgi?id=3858
Feel free to chime in there so we see for how many people this might be an issue. This helps us prioritize what to work on next.

Oh, thanks for the hint. To be honest, I didn't even bother to search for this bug online, as I assumed that the Suspend mode probably isn't used by anyone as it's discouraged by the docs.

One thing that could cause issues is if both MySQL VMs are in the same cluster and suspend mode is used. Then the chances are non-zero that both might be suspended at the same time!
Because a default backup job is run at the same time on all nodes. Unless you specify node specific backup jobs with a large enough offset.
So I would rather take a closer look at why snapshot mode + fleecing is still running into some issues sometimes.

You mean because the suspend would cause a short downtime of the whole database in this case? We have individual jobs configured for selected sets of VMs and this DB even has its own job. Timings are shifted to avoid running multiple backups at the same time; not because of downtime, but because the backup server doesn't become faster with concurrent jobs. So I don't think that this would cause issues. However, I'd of course prefer to have a reliable snapshot+fleecing mode instead of doing a workaround with suspend mode that might not be a real fix to the underlying problem. Though I feel that this might become a fairly long story...
 
I let the backup run in Suspend mode, with fleecing enabled and it was in the 6th iteration when I manually aborted it to test further. So this mode does not seem to trigger the issue for some reason. I then switched back to Snapshot mode and it already crashed in the very first backup operation.
a difference with snapshot mode is that the qemu agent is doing a fs-freeze/fs-thawt before the backup, so maybe you can try to disable this ? (they are an vm config option in vm guest agent config).



also, not sure that it could help, if you using the agent agent , they are also an extra hook that you can add in the guest:
https://github.com/qemu/qemu/blob/m...t-agent/fsfreeze-hook.d/mysql-flush.sh.sample

(in /etc/qemu-guest-agent/....)

it's force commit of pending transactions before taking the snapshot.
 
a difference with snapshot mode is that the qemu agent is doing a fs-freeze/fs-thawt before the backup, so maybe you can try to disable this ? (they are an vm config option in vm guest agent config).

Indeed. I considered trying with the guest agent disabled, but haven't done so far.

also, not sure that it could help, if you using the agent agent , they are also an extra hook that you can add in the guest:
https://github.com/qemu/qemu/blob/m...t-agent/fsfreeze-hook.d/mysql-flush.sh.sample

(in /etc/qemu-guest-agent/....)

it's force commit of pending transactions before taking the snapshot.

Such a hook is already implemented and I first actually suspected this of causing the issues; but they persisted after removing the hook.