Data corruption when backup jobs enters timeout/communication issue with PBS

broth-itk

Member
Dec 30, 2024
31
9
8
Dear Community,

this is a followup of https://forum.proxmox.com/threads/vm-filesystem-corruption-after-suspending-reboot.168690/
I was suspecting hibernation to be the cause of the data corruption but further investigation now show that the backup process was actually causing it.

Investigation - Timeline of events:

  1. PVE Server #1 was updated from 8.4.1 to 8.4.5, then rebooted
  2. PVE Server #2 (with PBS as container used to backup server #1) was updated & rebooted as well
  3. PVE Server #1 & #2 came back online with no obvious problems
  4. Backup job scheduled for 9pm was started automatically
  5. This morning many linux VMs had a corrupted filesystem, were freezed or unable to boot

Attached a log from the backup job.
The first backups resulted in a timeout shortly after they started.

Starting with VM 103 there were severe errors:

Code:
103: 2025-07-20 21:52:59 ERROR: VM 103 qmp command 'guest-fsfreeze-thaw' failed - got timeout
103: 2025-07-20 21:52:59 INFO: started backup task '341012e3-8635-4c9a-aa56-d73eb8a82913'
103: 2025-07-20 21:52:59 INFO: resuming VM again
103: 2025-07-20 21:53:44 ERROR: VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries
103: 2025-07-20 21:53:44 INFO: aborting backup job
103: 2025-07-20 22:03:44 ERROR: VM 103 qmp command 'backup-cancel' failed - unable to connect to VM 103 qmp socket - timeout after 5973 retries
103: 2025-07-20 22:03:44 INFO: resuming VM again
103: 2025-07-20 22:04:29 ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries


Obviously there were communication issues with the PBS system. More on that below.

In any case, the backup process failed at a given point in time and left the VM in an inconsistent state.
During my investigation for the root cause, I successfully and unintentionally reproduced the data corruption multiple times.

IMHO the system should be resilient to this type of errors and ensure that the VMs are kept in good working order.
It looks like there were many yet unsuccessful retries. It looks like the system tried hard to avoid this problem.

I saw that the backup failed but was absolutely not aware at this time that my VMs were corrupted.


Root cause:

In short: The MTU of the network interface of PBS was set to 1412 bytes.

Yes, this setting is incorrect and it's my fault ;-)

This was configured to ensure that no packets greater than MTU are sent to an offsite mirror.
Before the update, I did not notice and issue with backup nor did the logs show problems.
I think that the server reboot did something different to the PBS container than just setting the MTU (& rebooting the LXC).

How to reproduce the issue:

Since I unintentionally corrupted my systems several times today by "just" running a backup, the issue is easy to reproduce by just setting the PBS containers MTU to a lower value like 1412 (in my case).
In reality this can happen on WAN/VPN links when fragmentation is disabled (df-bit sent or not ignored by firewall) or when MSS clamping setting is incorrect.

Expected behaviour:

When there is a problem talking to PBS for whatever reason, the system must ensure that the source VMs are reverted to their original state.



Please let me know if more information is required.
I'd like to share more details about the failed commands if another log contains such information.

Best regards,
Bernhard
 

Attachments

Last edited:
  • Like
Reactions: Johannes S
I'm missing important details here, like the storage used in these nodes and some VMs configuration.

When backing up a VM, a snapshot is created at QEMU level. When the VM wants to write data over a block that has not been backed up yet, that write is halted, the about-to-be-replaced block(s) are sent to the backup, and then new data is written to the VM disk. If PBS is unreachable/write op can't finish (as would happen with an MTU issue), the write operation at VM level should not finish at all, thus not causing any harm at the VMs filesystem.

I remember seeing similar issues in the past but AFAIR the got solved at the time.

Would you mind to try using fleecing device and try to reproduce the issue?
 
Obviously there were communication issues with the PBS system. More on that below.
That I do not see from the log, but what I see is the QEMU Guest Agent failing to that the guest file systems from the inside again.

Any logs from within there? Because as of now this seems to be the origin of the problems.

Root cause:

In short: The MTU of the network interface of PBS was set to 1412 bytes.
How did you determine that? As traffic between the libproxmox-backup-qemu and PBS is not altering the QEMU disk state at all it would seem rather odd that any impact in traffic there causes issues on the VM end besides naturally some higher IO pressure or hanging IO operations at worst (which != data corruption).

How to reproduce the issue:
As in, you re-tried it and it always happens in that case and newer with a "correct" MTU? FWIW, I got some non-standard MTU links due to odd some ISPs using IPv4 inside IPv6 tunneling, and did not notice any issue here. I also tested your config explicitly, could not see any issue here.

What storage does the VM use, if it wasn't the QGA thaw that causes this, then a network attached storage that got interrupted or the like might be rather the cause of this.
 
Last edited:
I remember seeing similar issues in the past but AFAIR the got solved at the time.
Yes, this error description also ringed my bells and reminded me of problems with vzdump back in the days. We had them also regularly and needed to reset VMs that were stuck in this frozen VM block disk state forever. Since switching to PBS, we never encountered them again.
 
Thanks for your feedback!
I'll try to answer each question individually.

storage used in these nodes and some VMs configuration

Storage: zpool with 2x NVMe Enterprise disks, VMs are using zvols with Virtio scsi single controller.

Which part of the VM configuration is required?
This is for ID 103:

Code:
agent: 1
bios: seabios
boot: order=scsi0;ide0
cores: 2
cpu: x86-64-v4
ide0: local:iso/virtio-win-0.1.271.iso,media=cdrom,size=709474K
machine: pc-i440fx-9.0
memory: 8192
meta: creation-qemu=9.0.2,ctime=1740488510
name: xxxxxx
net0: virtio=00:0c:29:62:2c:85,bridge=vmbr0,tag=40
numa: 0
onboot: 1
ostype: win11
scsi0: local-tank01:vm-103-disk-0,discard=on,iothread=1,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=564d6c4a-9822-650d-5948-f1103a622c85
sockets: 1
startup: order=2,up=30
vga: qxl
vmgenid: 3f18307e-10e7-45e1-b3b5-da5cf5fe261a

Any logs from within there? Because as of now this seems to be the origin of the problems.

Where can I get those logs? I doubt that all failed VMs have the same issue with the guest agent.
And even if there were problems with the guest agent, it should not cause data corruption in the virtual disk (zvol).

How did you determine that? As traffic between the libproxmox-backup-qemu and PBS is not altering the QEMU disk state at all it would seem rather odd that any impact in traffic there causes issues on the VM end besides naturally some higher IO pressure or hanging IO operations at worst (which != data corruption).

I'll going to reply in German to avoid misunderstandings, automatic English translation below.

Der Effekt, dass eine Verbindung zustande kommt, etwas startet und dann hängt ist mir aus dem Netzwerkbereich bekannt und deutet auf ein MTU Problem hin. Daraufhin habe ich alle Einstellungen geprüft und eine MTU der NIC vom PBS container von 1412 gefunden. Das war eine Einstellung die ich mal testweise wegen einer Fehlersuche (Remote-Sync Bandbreite) gewählt hatte. Wurde schlicht nicht zurückgebaut.
Bis zu dem Update auf 8.4.5 war das auch kein Problem. Ich denke nicht, dass das Update für das Problem verantwortlich ist aber ggf. der damit einhergehende Neustart von Host und Container.

Der reine Datentransfer zwischen PVE und PBS ist m.E. nicht für die Datenfehler innerhalb der VMs verantwortlich.

In den Backup-Logs ist auffällig, dass zwar das Backup startet, dann aber mit verschiedenen Fehlern/Timeouts abbricht.
Im Anschluss habe ich innerhalb der VMs diverse ext4 Fehler, systemd Meldungen wg. "read only filesystem" usw. gefunden.
Die VMs waren teils gar nicht mehr bedienbar (freeze) oder haben nur Fehler ausgespuckt. Ein Neustart führte meist zum initramfs Prompt da Dateisystemfehler nicht automatisch behoben werden konnten.

Ich kenne mich zu wenig mit den internas aus wie konkret Sicherungen gemacht werden. Bei ZFS würde ich erwarten:

- sync buffers mit qemu guest agent, pausieren
- zfs snapshot
- dirty map sichern
- vm resume
- sichern der geänderten blöcke

Wie dem auch sei, durch die Timeouts könnte es passiert sein, dass Daten, die eigentlich zwischengespeichert oder auf der SSD landen sollten, verloren gegangen sind. Die VM ist wegen der wartenden I/Os stehen geblieben. Die Frage ist nur was dann passiert ist.
Wurden die ausstehenden Daten geschrieben oder gingen diese verloren. Letzteres würde jedenfalls die beobachteten Fehler erklären.

Backend Storage für die VMs ist ein ZPOOL auf PCIe NVME SSDs (2x 6.4TB), kein NAS o.ä. im Einsatz.

Bitte entschuldigt meinen langen Text, bin etwas in Eile und wollte vollumfänglich antworten.

Ich nehme mir heute Abend gerne mehr Zeit um auf detaillierte Rückfragen einzugehen.

In der Zwischenzeit bereite ich einen Klon einer VM vor welche für weitere Tests verwendet werden.






English version (automatic translation - sorry for the crutidy):


I am familiar with the effect of a connection being established, something starting up and then freezing from my experience in networking, and it indicates an MTU problem. I then checked all the settings and found an MTU of 1412 for the NIC of the PBS container. This was a setting I had chosen on a trial basis for troubleshooting (remote sync bandwidth). It was simply not reset.

Until the update to 8.4.5, this wasn't a problem. I don't think the update is responsible for the problem, but possibly the accompanying restart of the host and container.



In my opinion, the pure data transfer between PVE and PBS is not responsible for the data errors within the VMs.



In the backup logs, it is noticeable that the backup starts, but then aborts with various errors/timeouts.

Afterwards, I found various ext4 errors, systemd messages about “read only filesystem,” etc. within the VMs.

Some of the VMs were no longer usable (frozen) or only spit out errors. A restart usually led to the initramfs prompt because file system errors could not be fixed automatically.



I am not familiar enough with the internals of how backups are actually made. With ZFS, I would expect:



- sync buffers with qemu guest agent, pause

- zfs snapshot

- save dirty map

- vm resume

- save the changed blocks

In any case, the timeouts may have caused data that should have been cached or stored on the SSD to be lost. The VM stopped due to the pending I/Os. The question is what happened next.

Was the pending data written or was it lost? The latter would explain the errors observed.



Backend storage for the VMs is a ZPOOL on PCIe NVME SSDs (2x 6.4TB), not a NAS or similar.



Please excuse my long text, I am in a bit of a hurry and wanted to give a comprehensive answer.



I will be happy to take more time this evening to answer detailed questions.



In the meantime, I am preparing a clone of a VM that will be used for further testing.

Translated with DeepL.com (free version)