Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

So for those wo see the interrupted by signal, please specify if you manually aborted the backup since slow or if this is part of the error chain. I would like to exclude that there are mixed issues reported here, thanks!
 
So for those wo see the interrupted by signal, please specify if you manually aborted the backup since slow or if this is part of the error chain. I would like to exclude that there are mixed issues reported here, thanks!

Hi,
in my last case I've aborted manually the job after seeing it stuck at 99%...

PS: I've just finished testing PBS 4.0.22 with the old kernel 6.14... 5 Backups done, blazing fast, no errors. Now I'm rebooting with the last Kernel and do the same test...
 

Attachments

Do you have any traffic control rules defined for this PBS instance? What filesystem is used to back the datastore?
 
Do you have any traffic control rules defined for this PBS instance? What filesystem is used to back the datastore?
Hi Chris,
we finally have a culprit: kernel 6.17.2-1-pve

It's more than one hour since I've run the job and it's stuck at 99% as you can see in the attached log. The other thing to notice is the great difference in speed between this test and the 5 others run previously (with an average of 500 MiB/s).

I have no traffic control rule enabled and my datastore is on ZFS (Raid Z2 on this PBS and Raid Z1 on the other one).

Best regards!
 

Attachments

Looks like I'm experiencing the same/similar issue as well. Running a 3 node setup with ceph as the backend storage as well as having an external PBS instance for backups.

Interestingly enough, I only have the slow down with vm's using the raw image format. I have a vm that is configured as qcow2 stored within ceph (Mistake....the qcow2 vm is local) that backups up at a much greater speed. The backup of a raw based vm; I'm seeing read/writes of < 5MBs while the vm using qcow2 is 100+.

It does eventually complete, taking about an hour for 64gb vm to get pushed up to PBS. One other thing to note; executing a "stop" in the popup window (starting a backup from PVE), I'll get the "ERROR: interrupted by signal" print out but it doesn't appear to actually halt the process. It will continue and the vm will persistently be in a locked state unless I run qm unlock

I'm happy to share logs but given the type of system; I'd need to transcribe them so that could take a while.
 
Last edited:
I have a vm that is configured as qcow2 stored within ceph that backups up at a much greater speed.
What do you mean by this? qcow2 is not a supported format on RBD storages? Have you added a directory on CephFS as directory type storage?

EDIT: is the same network used for PBS and Ceph?
 
Last edited:
What do you mean by this? qcow2 is not a supported format on RBD storages? Have you added a directory on CephFS as directory type storage?
Oops! My mistake! Disregard that...that vm is stored NOT stored in ceph. I think I'm mixing up my vms
 
  • Like
Reactions: fiona
Hi Chris,
we finally have a culprit: kernel 6.17.2-1-pve

It's more than one hour since I've run the job and it's stuck at 99% as you can see in the attached log. The other thing to notice is the great difference in speed between this test and the 5 others run previously (with an average of 500 MiB/s).

I have no traffic control rule enabled and my datastore is on ZFS (Raid Z2 on this PBS and Raid Z1 on the other one).

Best regards!
Please post the output of arc_summary, free and cat /proc/pressure/io with the 6.17.2-1 kernel when the backup hangs.
 
  • Like
Reactions: Heracleos
Can anybody else relate the issue to kernel 6.17? For people saying the issue started at a specific date, did that come with a reboot into the 6.17 kernel?

The logs and debug information do not show any clear cause unfortunately and there also is nothing obvious that's common between all systems and configurations, except for using network-based storages, many people with Ceph, one with LVM over iSCSI.

While there is no clear indication that this will help, one thing that might be interesting to test is to downgrade QEMU apt install pve-qemu-kvm=10.0.2-4 pve-qemu-kvm-dbgsym=10.0.2-4. Note that VMs will only pick up the version on fresh start (or if migrated to a node with that version or if using the Reboot button in the UI).
 
we finally have a culprit: kernel 6.17.2-1-pve
Thanks for the info!
I just rebooted our nodes using proxmox-boot-tool kernel pin 6.14.11-4-pve --next-boot and two consecutive backup runs were successful. Now the bitmaps are quite small, so I'll report back tomorrow after the nightly backup.
Edit: I should've added that our nodes are running both PVE and PBS. So I can't tell what the 6.14.11-4-pve kernel fixes exactly. I changed our backup schedule back to every 3 hours, so should have some runs by tomorrow.
 
Last edited:
Hello,
Same scenario here for a week as well. Backup throughput completely collapses on certain VMs and then they freeze.
The kernels in use are:
PVE: 6.8.12-16-pve,
PBS: 6.17.2-1-pve.
No anomalies detected on the ZFS pools.
 
Please post the output of arc_summary, free and cat /proc/pressure/io with the 6.17.2-1 kernel when the backup hangs.
Hi Chris,
here are the requested info:

free

total used free shared buff/cache available
Mem: 131631860 4842180 127435112 9860 425556 126789680
Swap: 8388604 0 8388604



cat /proc/pressure/io

some avg10=0.00 avg60=0.00 avg300=0.00 total=46263599
full avg10=0.00 avg60=0.00 avg300=0.00 total=43289403



arc_summary is attached.

PS: Backup Job is still stuck at 99% (almost 3 hours). Can I stop it?
 

Attachments

PS: Backup Job is still stuck at 99% (almost 3 hours). Can I stop it?
Can you install gdb and the PBS debug symbols via apt install gdb proxmox-backup-sever-dbgsym and run gdb --batch --ex 't a a bt' -p $(pidof proxmox-backup-proxy) > proxy.backtrace. Then post the backtrace as attachement, thanks.

Edit: Fixed '>' misinterpretation by the editor
 
Last edited:
Hello,
Same scenario here for a week as well. Backup throughput completely collapses on certain VMs and then they freeze.
The kernels in use are:
PVE: 6.8.12-16-pve,
PBS: 6.17.2-1-pve.
No anomalies detected on the ZFS pools.
Please also generate the backtrace as suggested in my previous post, thanks! Then see if by booting into the older kernel the issue disappears for you as well.
 
Hello,
Same scenario here for a week as well. Backup throughput completely collapses on certain VMs and then they freeze.
The kernels in use are:
PVE: 6.8.12-16-pve,
PBS: 6.17.2-1-pve.
No anomalies detected on the ZFS pools.
Hi,
try to reboot PBS with the previous kernel 6.14.11-4-pve. On my two PBS is working like a charm. This can be a temporary solution while the developers find a permanente solution...
 
Can you install gdb and the PBS debug symbols via apt install gdb proxmox-backup-sever-dbgsym and run gdb --batch --ex 't a a bt' -p $(pidof proxmox-backup-proxy) &gt; proxy.backtrace. Then post the backtrace as attachement, thanks.
Hi,
I hope the output attached is something useful for you...
 

Attachments