Data corruption when backup jobs enters timeout/communication issue with PBS

broth-itk

Member
Dec 30, 2024
34
9
8
Dear Community,

this is a followup of https://forum.proxmox.com/threads/vm-filesystem-corruption-after-suspending-reboot.168690/
I was suspecting hibernation to be the cause of the data corruption but further investigation now show that the backup process was actually causing it.

Investigation - Timeline of events:

  1. PVE Server #1 was updated from 8.4.1 to 8.4.5, then rebooted
  2. PVE Server #2 (with PBS as container used to backup server #1) was updated & rebooted as well
  3. PVE Server #1 & #2 came back online with no obvious problems
  4. Backup job scheduled for 9pm was started automatically
  5. This morning many linux VMs had a corrupted filesystem, were freezed or unable to boot

Attached a log from the backup job.
The first backups resulted in a timeout shortly after they started.

Starting with VM 103 there were severe errors:

Code:
103: 2025-07-20 21:52:59 ERROR: VM 103 qmp command 'guest-fsfreeze-thaw' failed - got timeout
103: 2025-07-20 21:52:59 INFO: started backup task '341012e3-8635-4c9a-aa56-d73eb8a82913'
103: 2025-07-20 21:52:59 INFO: resuming VM again
103: 2025-07-20 21:53:44 ERROR: VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries
103: 2025-07-20 21:53:44 INFO: aborting backup job
103: 2025-07-20 22:03:44 ERROR: VM 103 qmp command 'backup-cancel' failed - unable to connect to VM 103 qmp socket - timeout after 5973 retries
103: 2025-07-20 22:03:44 INFO: resuming VM again
103: 2025-07-20 22:04:29 ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries


Obviously there were communication issues with the PBS system. More on that below.

In any case, the backup process failed at a given point in time and left the VM in an inconsistent state.
During my investigation for the root cause, I successfully and unintentionally reproduced the data corruption multiple times.

IMHO the system should be resilient to this type of errors and ensure that the VMs are kept in good working order.
It looks like there were many yet unsuccessful retries. It looks like the system tried hard to avoid this problem.

I saw that the backup failed but was absolutely not aware at this time that my VMs were corrupted.


Root cause:

In short: The MTU of the network interface of PBS was set to 1412 bytes.

Yes, this setting is incorrect and it's my fault ;-)

This was configured to ensure that no packets greater than MTU are sent to an offsite mirror.
Before the update, I did not notice and issue with backup nor did the logs show problems.
I think that the server reboot did something different to the PBS container than just setting the MTU (& rebooting the LXC).

How to reproduce the issue:

Since I unintentionally corrupted my systems several times today by "just" running a backup, the issue is easy to reproduce by just setting the PBS containers MTU to a lower value like 1412 (in my case).
In reality this can happen on WAN/VPN links when fragmentation is disabled (df-bit sent or not ignored by firewall) or when MSS clamping setting is incorrect.

Expected behaviour:

When there is a problem talking to PBS for whatever reason, the system must ensure that the source VMs are reverted to their original state.



Please let me know if more information is required.
I'd like to share more details about the failed commands if another log contains such information.

Best regards,
Bernhard
 

Attachments

Last edited:
  • Like
Reactions: Johannes S
I'm missing important details here, like the storage used in these nodes and some VMs configuration.

When backing up a VM, a snapshot is created at QEMU level. When the VM wants to write data over a block that has not been backed up yet, that write is halted, the about-to-be-replaced block(s) are sent to the backup, and then new data is written to the VM disk. If PBS is unreachable/write op can't finish (as would happen with an MTU issue), the write operation at VM level should not finish at all, thus not causing any harm at the VMs filesystem.

I remember seeing similar issues in the past but AFAIR the got solved at the time.

Would you mind to try using fleecing device and try to reproduce the issue?
 
Obviously there were communication issues with the PBS system. More on that below.
That I do not see from the log, but what I see is the QEMU Guest Agent failing to that the guest file systems from the inside again.

Any logs from within there? Because as of now this seems to be the origin of the problems.

Root cause:

In short: The MTU of the network interface of PBS was set to 1412 bytes.
How did you determine that? As traffic between the libproxmox-backup-qemu and PBS is not altering the QEMU disk state at all it would seem rather odd that any impact in traffic there causes issues on the VM end besides naturally some higher IO pressure or hanging IO operations at worst (which != data corruption).

How to reproduce the issue:
As in, you re-tried it and it always happens in that case and newer with a "correct" MTU? FWIW, I got some non-standard MTU links due to odd some ISPs using IPv4 inside IPv6 tunneling, and did not notice any issue here. I also tested your config explicitly, could not see any issue here.

What storage does the VM use, if it wasn't the QGA thaw that causes this, then a network attached storage that got interrupted or the like might be rather the cause of this.
 
Last edited:
I remember seeing similar issues in the past but AFAIR the got solved at the time.
Yes, this error description also ringed my bells and reminded me of problems with vzdump back in the days. We had them also regularly and needed to reset VMs that were stuck in this frozen VM block disk state forever. Since switching to PBS, we never encountered them again.
 
Thanks for your feedback!
I'll try to answer each question individually.

storage used in these nodes and some VMs configuration

Storage: zpool with 2x NVMe Enterprise disks, VMs are using zvols with Virtio scsi single controller.

Which part of the VM configuration is required?
This is for ID 103:

Code:
agent: 1
bios: seabios
boot: order=scsi0;ide0
cores: 2
cpu: x86-64-v4
ide0: local:iso/virtio-win-0.1.271.iso,media=cdrom,size=709474K
machine: pc-i440fx-9.0
memory: 8192
meta: creation-qemu=9.0.2,ctime=1740488510
name: xxxxxx
net0: virtio=00:0c:29:62:2c:85,bridge=vmbr0,tag=40
numa: 0
onboot: 1
ostype: win11
scsi0: local-tank01:vm-103-disk-0,discard=on,iothread=1,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=564d6c4a-9822-650d-5948-f1103a622c85
sockets: 1
startup: order=2,up=30
vga: qxl
vmgenid: 3f18307e-10e7-45e1-b3b5-da5cf5fe261a

Any logs from within there? Because as of now this seems to be the origin of the problems.

Where can I get those logs? I doubt that all failed VMs have the same issue with the guest agent.
And even if there were problems with the guest agent, it should not cause data corruption in the virtual disk (zvol).

How did you determine that? As traffic between the libproxmox-backup-qemu and PBS is not altering the QEMU disk state at all it would seem rather odd that any impact in traffic there causes issues on the VM end besides naturally some higher IO pressure or hanging IO operations at worst (which != data corruption).

I'll going to reply in German to avoid misunderstandings, automatic English translation below.

Der Effekt, dass eine Verbindung zustande kommt, etwas startet und dann hängt ist mir aus dem Netzwerkbereich bekannt und deutet auf ein MTU Problem hin. Daraufhin habe ich alle Einstellungen geprüft und eine MTU der NIC vom PBS container von 1412 gefunden. Das war eine Einstellung die ich mal testweise wegen einer Fehlersuche (Remote-Sync Bandbreite) gewählt hatte. Wurde schlicht nicht zurückgebaut.
Bis zu dem Update auf 8.4.5 war das auch kein Problem. Ich denke nicht, dass das Update für das Problem verantwortlich ist aber ggf. der damit einhergehende Neustart von Host und Container.

Der reine Datentransfer zwischen PVE und PBS ist m.E. nicht für die Datenfehler innerhalb der VMs verantwortlich.

In den Backup-Logs ist auffällig, dass zwar das Backup startet, dann aber mit verschiedenen Fehlern/Timeouts abbricht.
Im Anschluss habe ich innerhalb der VMs diverse ext4 Fehler, systemd Meldungen wg. "read only filesystem" usw. gefunden.
Die VMs waren teils gar nicht mehr bedienbar (freeze) oder haben nur Fehler ausgespuckt. Ein Neustart führte meist zum initramfs Prompt da Dateisystemfehler nicht automatisch behoben werden konnten.

Ich kenne mich zu wenig mit den internas aus wie konkret Sicherungen gemacht werden. Bei ZFS würde ich erwarten:

- sync buffers mit qemu guest agent, pausieren
- zfs snapshot
- dirty map sichern
- vm resume
- sichern der geänderten blöcke

Wie dem auch sei, durch die Timeouts könnte es passiert sein, dass Daten, die eigentlich zwischengespeichert oder auf der SSD landen sollten, verloren gegangen sind. Die VM ist wegen der wartenden I/Os stehen geblieben. Die Frage ist nur was dann passiert ist.
Wurden die ausstehenden Daten geschrieben oder gingen diese verloren. Letzteres würde jedenfalls die beobachteten Fehler erklären.

Backend Storage für die VMs ist ein ZPOOL auf PCIe NVME SSDs (2x 6.4TB), kein NAS o.ä. im Einsatz.

Bitte entschuldigt meinen langen Text, bin etwas in Eile und wollte vollumfänglich antworten.

Ich nehme mir heute Abend gerne mehr Zeit um auf detaillierte Rückfragen einzugehen.

In der Zwischenzeit bereite ich einen Klon einer VM vor welche für weitere Tests verwendet werden.






English version (automatic translation - sorry for the crutidy):


I am familiar with the effect of a connection being established, something starting up and then freezing from my experience in networking, and it indicates an MTU problem. I then checked all the settings and found an MTU of 1412 for the NIC of the PBS container. This was a setting I had chosen on a trial basis for troubleshooting (remote sync bandwidth). It was simply not reset.

Until the update to 8.4.5, this wasn't a problem. I don't think the update is responsible for the problem, but possibly the accompanying restart of the host and container.



In my opinion, the pure data transfer between PVE and PBS is not responsible for the data errors within the VMs.



In the backup logs, it is noticeable that the backup starts, but then aborts with various errors/timeouts.

Afterwards, I found various ext4 errors, systemd messages about “read only filesystem,” etc. within the VMs.

Some of the VMs were no longer usable (frozen) or only spit out errors. A restart usually led to the initramfs prompt because file system errors could not be fixed automatically.



I am not familiar enough with the internals of how backups are actually made. With ZFS, I would expect:



- sync buffers with qemu guest agent, pause

- zfs snapshot

- save dirty map

- vm resume

- save the changed blocks

In any case, the timeouts may have caused data that should have been cached or stored on the SSD to be lost. The VM stopped due to the pending I/Os. The question is what happened next.

Was the pending data written or was it lost? The latter would explain the errors observed.



Backend storage for the VMs is a ZPOOL on PCIe NVME SSDs (2x 6.4TB), not a NAS or similar.



Please excuse my long text, I am in a bit of a hurry and wanted to give a comprehensive answer.



I will be happy to take more time this evening to answer detailed questions.



In the meantime, I am preparing a clone of a VM that will be used for further testing.

Translated with DeepL.com (free version)
 
I reproduced the issue successfully:

- Backup server has configured an MTU of 1412 both on vmbr0 and in NIC settings of PBS container
Important: Host and PBS container need to reboot in order for the issue to get triggered! It does not work if settings are only applied!

- I took a clone of my PDM VM (proxmox datacenter manager, installed with Alpha ISO) to perform the test

- In a screen session of the PDM test VM I ran
Code:
while ((1)); do date >>date.txt; sync; sleep 1; done
- Another terminal window was pinging the VMs IP
- A ssh session to PDM test VM ran "tail -f date.txt"

- Backup job was started by "Run now"

Observations:

- Ping did run throughout the process with no drops
- "tail -f date.txt" did stop at a given point in time
- "dmesg" showed kernel errors/hangs (attached)
- Once the backup was finished (log attached) bash in the SSH window returned endless Input/output errors
- ext4 aborted journal caused root to be mounted read-only
- Login to VM via CLI not possible
- After resetting the VM, linux did run fsck and corrected errors - this time the PDM VM was able to boot successfully.


I think all steps are properly documented with the logs and screenshots.

Please let me know if more information or logs are required.

Currently I repeat the test with fleecing enabled. The backup will fail again but I expect the data to stay consistent this time.
I let you know once the backup job has ended.

Final thoughts:

I don't know how the backup process is working in detail but I assume that while PBS communication is not working correctly, this causes VM writes to stack up and eventually time out.
Since the backup job mode is set to "snapshot", I'd expect the system to actually use the snapshot functionality of ZFS.
Unfortunately I can't see any snapshots while the backup job is running.

Code:
root@pve:~# zfs list -t snapshot
no datasets available
root@pve:~#

Is this intended to work this way?

I don't want to compare apples and pears but Veeam does exactly that for their backups: sync data to disk/flush VM OS buffers, snapshot, backup from snapshot, delete snapshot. In the meantime ZFS will happily write data to disk while a consistent zvol is available for backup.
This is maybe for a reason...

Enough for today, I'm very curious for your feedback :)
 

Attachments

  • aborted_ext4_journal.png
    aborted_ext4_journal.png
    53.6 KB · Views: 1
  • vm_config.png
    vm_config.png
    54.5 KB · Views: 1
  • task-pve-vzdump-2025-07-22T18_34_39Z.log
    task-pve-vzdump-2025-07-22T18_34_39Z.log
    1.3 KB · Views: 0
  • backup_failed_state.png
    backup_failed_state.png
    274.8 KB · Views: 1
  • dmesg.txt
    dmesg.txt
    12.6 KB · Views: 0
  • backup_job_freeze.png
    backup_job_freeze.png
    215.9 KB · Views: 1
  • backup_host_bridge.png
    backup_host_bridge.png
    93.9 KB · Views: 1
  • backup_container_nic.png
    backup_container_nic.png
    37.9 KB · Views: 1
  • backup_job_settings.png
    backup_job_settings.png
    75.1 KB · Views: 1
  • pdm_vm_dmesg.png
    pdm_vm_dmesg.png
    91.2 KB · Views: 1
Last edited:
15 minutes until timeout is triggered is just too much for the VMs to handle.
If backup job is cancelled prematurely, the VM might hang but no severe errors occur.