Proxmox Backup Server - a lot of "verify failed"

pixelpoint · Apr 16, 2025

Dear Proxmox Forum users,

we have a relatively new problem occuring in our proxmox backup server setup.
Since some time, more and more weekly verify show failed verify for some backups. There are currently 39 failed Backups, with verifying still going on and some backups were deleted already.

Information about the setup

We have a central Proxmox Backup Server (PBS), a dedicated root server in some datacenter. There are also another 2 on-site PBS at customer locations and one on-site PBS in our own office for local VM- and proxmox-backup-client backups.

2 of the 3 on-site PBS (mostly Proxmox VE VM backups, our local one also has some proxmox-backup-client backups) push their backups to the central PBS.

Most of our VMs and dedicated root servers run linux and use proxmox-backup-client to make backups directly to our central PBS.

This all amounts to somewhere between 100 and 200 servers using the central PBS as a backup server, keeping 7 daily, 4 weekly, 12 monthly and 1 yearly backup alive.

The central PBS consists of a ZFS pool with 4 HDDs, 16 TB each.

Information about the problem

Some time ago (not sure, maybe a month, maybe two or 3), a verify failed for a backup or two among the many there are. Now, 39 failed backups, as written above. Every server keeps 24 backups, one of them (450 GB backup) now has 15 failed out of 24. Some servers even have all their recent backups fail verify, so I couldn't even restore if I needed to. I already deleted some verify-failed backups of a few of our machines with 1TB+ backups and moved 4 of them to a newly created datastore.

In the PBS WebUI the SMART column of all 4 disks shows "passed".

I do see quite some SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from VALUE_X to VALUE_Y but according to this proxmox forum thread, these should be ignored if you have Seagate disks (which all of our central PBS disks are).

As these are all HDDs in a raidz1 setup and there are tons of backups to verify, prune and garbage-collect, I assume that garbage-collect, prune and verify somehow got in the way of each other. Garbage Collection currently takes ~24h for our biggest datastore and verifying the same datastore takes ~7-8 days.

Also: One of our customers VM backups (which has already been pushed to our central PBS) failed verify in our central PBS, but that same backup is still verified in the customers on-site PBS.

My Hypothesis

Is it possible that garbage collection, prune and verify taking too long might be the problem?
This weekend I will try to split up the biggest datastore into smaller ones, hoping this will fix the problem.

Also: failed backups are non-recoverable, except for if you have that same backup in another PBS, correct? Meaning, all of the proxmox-backup-client Backups from our VPS and dedicated root servers (which only reside on our central PBS) cannot be "healed" or "restored" some way, right?

Hopefully, someone here could point me in the right direction to fix this.

Best regards
pixelpoint

Chris · Apr 16, 2025

Hi,
please share the output of proxmox-backup-manager version --verbose of your main PBS instance.

pixelpoint said:
we have a relatively new problem occuring in our proxmox backup server setup.
Since some time, more and more weekly verify show failed verify for some backups. There are currently 39 failed Backups, with verifying still going on and some backups were deleted already.

Did you verify any of these snapshots before or was this the first time they are verified? I'm asking since the push sync jobs will reuse known chunks from the previous backup snapshot of the same group. If the previous backup snapshot was however corrupt, this corruption might get propagated along the chain of subsequent snapshots. If the previous backup snapshot however failed verification, then all the chunks will be re-uploaded for the next snapshot, breaking the chain.

pixelpoint said:
Information about the setup

We have a central Proxmox Backup Server (PBS), a dedicated root server in some datacenter. There are also another 2 on-site PBS at customer locations and one on-site PBS in our own office for local VM- and proxmox-backup-client backups.

2 of the 3 on-site PBS (mostly Proxmox VE VM backups, our local one also has some proxmox-backup-client backups) push their backups to the central PBS.

Most of our VMs and dedicated root servers run linux and use proxmox-backup-client to make backups directly to our central PBS.

This all amounts to somewhere between 100 and 200 servers using the central PBS as a backup server, keeping 7 daily, 4 weekly, 12 monthly and 1 yearly backup alive.

The central PBS consists of a ZFS pool with 4 HDDs, 16 TB each.

Information about the problem

Some time ago (not sure, maybe a month, maybe two or 3), a verify failed for a backup or two among the many there are. Now, 39 failed backups, as written above. Every server keeps 24 backups, one of them (450 GB backup) now has 15 failed out of 24. Some servers even have all their recent backups fail verify, so I couldn't even restore if I needed to. I already deleted some verify-failed backups of a few of our machines with 1TB+ backups and moved 4 of them to a newly created datastore.

I would advice to do the following: For each backup group, select the currently latest snapshot. Set it to protected so it cannot be pruned and run a verify of that snapshot. That will assure that further push sync jobs to that group will re-upload all chunks, not re-use known ones. This should help you to at least break the corruption chain and even heal some older snapshots if the chunk gets re-inserted into the datastore. A full re-verification would of course be best, but might take to much time in your case.

Also, if you have the possibility to pull from the source datastores, you might want to setup a pull sync job which also allows to select the re-sync corrupt flag, not possible for push sync jobs. I would however advice to verify the snapshots on the source side as well.

pixelpoint · Apr 16, 2025

Chris said:
Did you verify any of these snapshots before or was this the first time they are verified?

I cannot guarantee this for the newest snapshots and backups, but most, if not all, of these have been verified before.
The central PBS runs a weekly verify-job, with re-verifying of backups set to 30 days.
This was fine for a long time, as the verify always completed with time to spare for prune and garbage collection.

The big datastore with the most problems (called file_backup) wasn't always so big, but the company grew and so did the volume of backups.
It is called file_backup because it stores most of the proxmox-backup-client backups.
So this is most of all of our backups and none of these are re-syncable from somewhere, as our linux cloud servers directly backup to the central PBS file_backup datastore.

Sadly, I failed to mention that synced snapshots (mostly from Proxmox VE VMs) are a very minor problem. Out of all the synced snapshots, only 1 EVER failed to verify and it failed recently (this month). This VM snapshot has been created at the source on 2025-03-15 and it has already been verified at the source PBS + at the central PBS.
That snapshot still shows "verify OK" at the source PBS.

Chris said:
I would advice to do the following: For each backup group, select the currently latest snapshot. Set it to protected so it cannot be pruned and run a verify of that snapshot.

Should I still do this for the proxmox-backup-client backups, even though they cannot be re-synced as these backups use the backup client to directly store their data on central PBS instead of being synced from somewhere?

Chris said:
Also, if you have the possibility to pull from the source datastores, you might want to setup a pull sync job which also allows to select the re-sync corrupt flag, not possible for push sync jobs. I would however advice to verify the snapshots on the source side as well.

I will do that with the failed snapshot from our customer. Luckily, the source datastore still shows this snapshot as verified. Or should I force a re-verify in this case?

Thank you very much for your help.

Best regards
pixelpoint

Chris · Apr 16, 2025

pixelpoint said:
I cannot guarantee this for the newest snapshots and backups, but most, if not all, of these have been verified before.
The central PBS runs a weekly verify-job, with re-verifying of backups set to 30 days.
This was fine for a long time, as the verify always completed with time to spare for prune and garbage collection.

The big datastore with the most problems (called file_backup) wasn't always so big, but the company grew and so did the volume of backups.
It is called file_backup because it stores most of the proxmox-backup-client backups.
So this is most of all of our backups and none of these are re-syncable from somewhere, as our linux cloud servers directly backup to the central PBS file_backup datastore.

Sadly, I failed to mention that synced snapshots (mostly from Proxmox VE VMs) are a very minor problem. Out of all the synced snapshots, only 1 EVER failed to verify and it failed recently (this month). This VM snapshot has been created at the source on 2025-03-15 and it has already been verified at the source PBS + at the central PBS.
That snapshot still shows "verify OK" at the source PBS.

Should I still do this for the proxmox-backup-client backups, even though they cannot be re-synced as these backups use the backup client to directly store their data on central PBS instead of being synced from somewhere?

Yes, I would suggest to do this for all the groups where your last snapshot has not been recently verified.

pixelpoint said:
I will do that with the failed snapshot from our customer. Luckily, the source datastore still shows this snapshot as verified. Or should I force a re-verify in this case?

This depends, if the verify was rather recent, I would assume that the snapshots are still fine, but this depends also if you ever have seen failed verifications on that end.

pixelpoint said:
Thank you very much for your help.

Best regards
pixelpoint

In any case, I do urge you to upgrade to the latest version 3.4.1-1 as of now, that includes a fix for a recently discovered edge case and overall performance improvements for PBS introduced with the 3.4 release. Although you do not seem to have such an aggressive pruning, so might not be the same case. See the changelog.

pixelpoint · Apr 16, 2025

I am terribly sorry, I forgot to provide the proxmox-backup-manager version output.

Bash:

root@backup:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.4.0         running kernel: 6.8.12-3-pve
proxmox-backup-server              3.4.0-1       running version: 3.3.4
proxmox-kernel-helper              8.1.1
proxmox-kernel-6.8                 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed 6.8.12-6
proxmox-kernel-6.8.12-3-pve-signed 6.8.12-3
proxmox-kernel-6.5.13-6-pve-signed 6.5.13-6
pve-kernel-5.4                     6.4-4
pve-kernel-5.13.19-2-pve           5.13.19-4
pve-kernel-5.4.124-1-pve           5.4.124-1
pve-kernel-5.4.65-1-pve            5.4.65-1
ifupdown2                          3.2.0-1+pmx11
libjs-extjs                        7.0.0-5
proxmox-backup-docs                3.4.0-1
proxmox-backup-client              3.4.0-1
proxmox-mail-forward               0.3.2
proxmox-mini-journalreader         1.4.0
proxmox-offline-mirror-helper      0.6.7
proxmox-widget-toolkit             4.3.10
pve-xtermjs                        5.5.0-2
smartmontools                      7.3-pve1
zfsutils-linux                     2.2.7-pve2

So I wanted to upgrade and got the following error

Code:

Copying and configuring kernels on /dev/disk/by-uuid/3899-6BCD
        Copying kernel 5.4.124-1-pve
cp: error writing '/var/tmp/espmounts/3899-6BCD/vmlinuz-5.4.124-1-pve': No space left on device
run-parts: /etc/initramfs/post-update.d//proxmox-boot-sync exited with return code 1
run-parts: /etc/kernel/postinst.d/initramfs-tools exited with return code 1
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.12-9-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.12-9-pve-signed (--configure):
 installed proxmox-kernel-6.8.12-9-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.8:
 proxmox-kernel-6.8 depends on proxmox-kernel-6.8.12-9-pve-signed | proxmox-kernel-6.8.12-9-pve; however:
  Package proxmox-kernel-6.8.12-9-pve-signed is not configured yet.
  Package proxmox-kernel-6.8.12-9-pve is not installed.
  Package proxmox-kernel-6.8.12-9-pve-signed which provides proxmox-kernel-6.8.12-9-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.8 (--configure):
 dependency problems - leaving unconfigured

Seems like some of the central PBS /boot partitions on the zfs rpool are full.
One of the 4 partitions is 100% full, 2 are 97% full.
Is rebooting currently safe or should this be dealt with first?

After the upgrade:

Bash:

root@backup:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.4.0         running kernel: 6.8.12-3-pve
proxmox-backup-server              3.4.0-1       running version: 3.4.0
proxmox-kernel-helper              8.1.1
proxmox-kernel-6.8                 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed 6.8.12-6
proxmox-kernel-6.8.12-3-pve-signed 6.8.12-3
proxmox-kernel-6.5.13-6-pve-signed 6.5.13-6
pve-kernel-5.4                     6.4-4
pve-kernel-5.13.19-2-pve           5.13.19-4
pve-kernel-5.4.124-1-pve           5.4.124-1
pve-kernel-5.4.65-1-pve            5.4.65-1
ifupdown2                          3.2.0-1+pmx11
libjs-extjs                        7.0.0-5
proxmox-backup-docs                3.4.0-1
proxmox-backup-client              3.4.0-1
proxmox-mail-forward               0.3.2
proxmox-mini-journalreader         1.4.0
proxmox-offline-mirror-helper      0.6.7
proxmox-widget-toolkit             4.3.10
pve-xtermjs                        5.5.0-2
smartmontools                      7.3-pve1
zfsutils-linux                     2.2.7-pve2

I do not see a 3.4.1-1 upgrade though.

Bash:

root@backup:~# apt update
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 https://download.docker.com/linux/debian bullseye InRelease
Hit:3 http://ftp.at.debian.org/debian bookworm InRelease
Hit:4 http://download.proxmox.com/debian/pbs bookworm InRelease
Hit:5 http://ftp.at.debian.org/debian bookworm-updates InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
root@backup:~# apt list --upgradable
Listing... Done

Thank you for your help.

Best regards
pixelpoint

Chris · Apr 17, 2025

pixelpoint said:

I am terribly sorry, I forgot to provide the proxmox-backup-manager version output.

Bash:

root@backup:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.4.0         running kernel: 6.8.12-3-pve
proxmox-backup-server              3.4.0-1       running version: 3.3.4
proxmox-kernel-helper              8.1.1
proxmox-kernel-6.8                 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed 6.8.12-6
proxmox-kernel-6.8.12-3-pve-signed 6.8.12-3
proxmox-kernel-6.5.13-6-pve-signed 6.5.13-6
pve-kernel-5.4                     6.4-4
pve-kernel-5.13.19-2-pve           5.13.19-4
pve-kernel-5.4.124-1-pve           5.4.124-1
pve-kernel-5.4.65-1-pve            5.4.65-1
[/QUOTE]

Seems like you still have a lot of older kernels around, you can uninstall these to free some space on the efi partitions.

[QUOTE="pixelpoint, post: 764476, member: 115902"]
ifupdown2                          3.2.0-1+pmx11
libjs-extjs                        7.0.0-5
proxmox-backup-docs                3.4.0-1
proxmox-backup-client              3.4.0-1
proxmox-mail-forward               0.3.2
proxmox-mini-journalreader         1.4.0
proxmox-offline-mirror-helper      0.6.7
proxmox-widget-toolkit             4.3.10
pve-xtermjs                        5.5.0-2
smartmontools                      7.3-pve1
zfsutils-linux                     2.2.7-pve2

So I wanted to upgrade and got the following error

Code:

Copying and configuring kernels on /dev/disk/by-uuid/3899-6BCD
        Copying kernel 5.4.124-1-pve
cp: error writing '/var/tmp/espmounts/3899-6BCD/vmlinuz-5.4.124-1-pve': No space left on device
[/QUOTE]

Yes, seems like you ran out of space on the efi partition, you can mount the filesystem for that disk and remove some older kernel images to clean up space. Then re-run the [ICODE]apt full-upgrade[/ICODE]. As stated above, you might want to get rid of the old kernels, as you run the 6.8 kernel.

[QUOTE="pixelpoint, post: 764476, member: 115902"]
run-parts: /etc/initramfs/post-update.d//proxmox-boot-sync exited with return code 1
run-parts: /etc/kernel/postinst.d/initramfs-tools exited with return code 1
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.12-9-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.12-9-pve-signed (--configure):
 installed proxmox-kernel-6.8.12-9-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.8:
 proxmox-kernel-6.8 depends on proxmox-kernel-6.8.12-9-pve-signed | proxmox-kernel-6.8.12-9-pve; however:
  Package proxmox-kernel-6.8.12-9-pve-signed is not configured yet.
  Package proxmox-kernel-6.8.12-9-pve is not installed.
  Package proxmox-kernel-6.8.12-9-pve-signed which provides proxmox-kernel-6.8.12-9-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.8 (--configure):
 dependency problems - leaving unconfigured

Seems like some of the central PBS /boot partitions on the zfs rpool are full.
One of the 4 partitions is 100% full, 2 are 97% full.
Is rebooting currently safe or should this be dealt with first?

I would not reboot until this is fixed, you should be still able to boot into the previous kernel, but fixing it right away is definitely better.

pixelpoint said:

After the upgrade:

Bash:

root@backup:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.4.0         running kernel: 6.8.12-3-pve
proxmox-backup-server              3.4.0-1       running version: 3.4.0
proxmox-kernel-helper              8.1.1
proxmox-kernel-6.8                 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed 6.8.12-6
proxmox-kernel-6.8.12-3-pve-signed 6.8.12-3
proxmox-kernel-6.5.13-6-pve-signed 6.5.13-6
pve-kernel-5.4                     6.4-4
pve-kernel-5.13.19-2-pve           5.13.19-4
pve-kernel-5.4.124-1-pve           5.4.124-1
pve-kernel-5.4.65-1-pve            5.4.65-1
ifupdown2                          3.2.0-1+pmx11
libjs-extjs                        7.0.0-5
proxmox-backup-docs                3.4.0-1
proxmox-backup-client              3.4.0-1
proxmox-mail-forward               0.3.2
proxmox-mini-journalreader         1.4.0
proxmox-offline-mirror-helper      0.6.7
proxmox-widget-toolkit             4.3.10
pve-xtermjs                        5.5.0-2
smartmontools                      7.3-pve1
zfsutils-linux                     2.2.7-pve2

I do not see a 3.4.1-1 upgrade though.

Yeah, sorry my bad: failed to mention that this version is currently only available on pbstest repo. Should probably be available in the pbs-no-subscription repo within the next week.

pixelpoint said:

Bash:

root@backup:~# apt update
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 https://download.docker.com/linux/debian bullseye InRelease
Hit:3 http://ftp.at.debian.org/debian bookworm InRelease
Hit:4 http://download.proxmox.com/debian/pbs bookworm InRelease
Hit:5 http://ftp.at.debian.org/debian bookworm-updates InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
root@backup:~# apt list --upgradable
Listing... Done

Thank you for your help.

Best regards
pixelpoint

pixelpoint · Apr 17, 2025

Chris said:
I would not reboot until this is fixed, you should be still able to boot into the previous kernel, but fixing it right away is definitely better.

I have deleted a few kernels with apt purge and removed some leftover files manually from /boot drives, now they all have 300MB - 400MB out of 500MB disk space used.

I'm a little confused by the output of proxmox-boot-tool status

Bash:

root@backup:~# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
3099-D8E1 is configured with: grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9-pve)
3899-6BCD is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9
-pve)
WARN: /dev/disk/by-uuid/3899-E18F does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
389A-5488 is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9
-pve)
389A-C7CC is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9
-pve)

It seems most of the /boot drives use UEFI, but one is still configured to use grub.
Also, it seems to be missing one /boot drive, even though 4 /boot drives are present.
Is there any way I can fix these things too?

Update for the verify:
I have now marked all the latest backups as protected and a verify is running for all of them.
This will take some time though.
Some backups have already been verified, but there's still 54 verify jobs currently running.

Chris · Apr 17, 2025

Did you maybe replace one of the disks? There is also a

pixelpoint said:
WARN: /dev/disk/by-uuid/3899-E18F does not exist

which might indicate that one disk might have been swaped maybe? What is the output of proxmox-boot-tool clean --dry-run and cat /etc/kernel/proxmox-boot-uuids?

pixelpoint · Apr 17, 2025

Chris said:
Did you maybe replace one of the disks?

Yes, one of the disks was going bad and had to be swapped and I never had anything to do with zfs
This is the thread for disk replacement, october 2024.

proxmox-boot-tool clean --dry-run

Bash:

root@backup:~# proxmox-boot-tool clean --dry-run
Checking whether ESP '3099-D8E1' exists.. Found!
Checking whether ESP '3899-6BCD' exists.. Found!
Checking whether ESP '3899-E18F' exists.. Not found!
Checking whether ESP '389A-5488' exists.. Found!
Checking whether ESP '389A-C7CC' exists.. Found!
Sorting and removing duplicate ESPs..

cat /etc/kernel/proxmox-boot-uuids

Bash:

root@backup:~# cat /etc/kernel/proxmox-boot-uuids
3099-D8E1
3899-6BCD
3899-E18F
389A-5488
389A-C7CC

Chris · Friday at 08:54

Okay, so then first clean up the UUID not available anymore from the proxmox-boot-uuids by running proxmox-boot-tool clean. Then you can reinit the boot partitions via proxmox-boot-tool reinit.

P.S.: the proxmox-backup-server in version 3.4.1-1 is now also available on the pbs-no-subscription repo.

pixelpoint · Tuesday at 10:24

Update for things I did over the weekend.

Proxmox EFI Partitions
After cleaning out old kernels and running proxmox-boot-tool clean + proxmox-boot-tool reinit the /boot partitions seem to be OK.

Bash:

# sudo proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
3099-D8E1 is configured with: grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9-pve)
3899-6BCD is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9-pve)
389A-5488 is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9-pve)
389A-C7CC is configured with: uefi (versions: 6.8.12-2-pve, 6.8.12-3-pve, grubx64.efi), grub (versions: 6.8.12-3-pve, 6.8.12-8-pve, 6.8.12-9-pve)

The only question I have here is: Is it okay to have UEFI + grub on 3 of the 4 disks but only grub on one disk?

Proxmox Backup Server Update

Bash:

# sudo proxmox-backup-manager version --verbose
proxmox-backup                     3.4.0         running kernel: 6.8.12-3-pve
proxmox-backup-server              3.4.1-1       running version: 3.4.1
proxmox-kernel-helper              8.1.1
proxmox-kernel-6.8                 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed 6.8.12-8
proxmox-kernel-6.8.12-3-pve-signed 6.8.12-3
ifupdown2                          3.2.0-1+pmx11
libjs-extjs                        7.0.0-5
proxmox-backup-docs                3.4.1-1
proxmox-backup-client              3.4.1-1
proxmox-mail-forward               0.3.2
proxmox-mini-journalreader         1.4.0
proxmox-offline-mirror-helper      0.6.7
proxmox-widget-toolkit             4.3.10
pve-xtermjs                        5.5.0-2
smartmontools                      7.3-pve1
zfsutils-linux                     2.2.7-pve2

After Updating and rebooting, at some point the WebUI was not accessible anymore. It might have to do with me creating a new datastore and starting first-time-backups for a lot of servers with proxmox-backup-client at the same time. proxmox-backup-manager task list and other things like this didn't work anymore. I needed to restart proxmox-backup-proxy. This resulted in unfinished backups and quite a few backups with "UNKNOWN" job status (these have been deleted since).

Verify

Over the weekend, I split the big datastore into 3.
It's still a lot of data, but I hope this solves the problem of verify taking too long.

After updating to the newest proxmox backup server / manager version, I had verify-jobs just stopping mid-task without a clear error. This is one of the 2 verifies that stopped twice this weekend:

Code:

[...]
2025-04-22T00:13:51+02:00: SKIPPED: verify file_backup:host/dc-nlw/2025-04-16T21:00:06Z (recently verified)
2025-04-22T00:13:51+02:00: percentage done: 25.66% (24/94 groups, 3/24 snapshots in group #25)
2025-04-22T00:13:51+02:00: verify file_backup:host/dc-nlw/2025-04-15T21:00:08Z
2025-04-22T00:13:51+02:00:   check etc.pxar.didx
2025-04-22T00:13:51+02:00:   verified 0.98/6.63 MiB in 0.24 seconds, speed 4.10/27.73 MiB/s (0 errors)
2025-04-22T00:13:51+02:00:   check www.pxar.didx
2025-04-22T00:13:51+02:00:   verified 0.00/0.00 MiB in 0.00 seconds, speed 0.00/0.00 MiB/s (0 errors)
2025-04-22T00:13:51+02:00:   check home.pxar.didx
2025-04-22T00:19:27+02:00:   verified 3037.79/9305.80 MiB in 336.25 seconds, speed 9.03/27.68 MiB/s (0 errors)
2025-04-22T00:19:28+02:00:   check catalog.pcat1.didx
2025-04-22T00:19:29+02:00:   verified 0.37/0.59 MiB in 0.21 seconds, speed 1.76/2.78 MiB/s (0 errors)
2025-04-22T00:19:32+02:00: percentage done: 25.71% (24/94 groups, 4/24 snapshots in group #25)
2025-04-22T00:19:33+02:00: verify file_backup:host/dc-nlw/2025-04-14T21:00:04Z
2025-04-22T00:19:33+02:00:   check etc.pxar.didx
2025-04-22T00:19:34+02:00:   verified 0.98/6.63 MiB in 0.30 seconds, speed 3.30/22.29 MiB/s (0 errors)
2025-04-22T00:19:34+02:00:   check www.pxar.didx
2025-04-22T00:19:34+02:00:   verified 0.00/0.00 MiB in 0.00 seconds, speed 0.00/0.00 MiB/s (0 errors)
2025-04-22T00:19:35+02:00:   check home.pxar.didx
2025-04-22T00:20:37+02:00: TASK ERROR: verification failed - job aborted

One of them had been running for ~2 days, the other one for close to 1 day before just stopping.

I restarted the verifiy jobs and will update when I have more information.

Info about split datastores
I split the big datastore file_backup into 3 datastores, but I did not move the backups to the new datastores.
The new datastores now collect NEW backups for a lot of different machines.
I will remove old backups for servers from file_backup, as soon as there are a few verified new backups in the new datstores.

pixelpoint · 2025-04-23T15:26:51+0200

Update

Server somehow seems to have crashed.
ping to the DNS name of the server (and IP address) were still working, but everything else was done for, no ssh, no HTTP/S.
Didn't seen anything too suspicious in journalctl -p3 --since "BEFORE_CRASH" --until "AFTER_CRASH" except for a ton of unable to find blob from proxmox-backup-proxy.

Restarted server.
Restarted new datastore verify.
Started prune on the big datastore, then garbage collection, will start another verify after.

pixelpoint · 2025-04-24T10:29:37+0200

Server seems to have crashed again, somewhere at 01:00 today.
ssh, ping and WebUI did not work.
Had to hard-reset the server, again resulting in many backups not finishing, leaving half-done backups (status UNKNOWN) in task list and datastores.
Deleted all the half-finished backups, deleted some other backups from the big file_backup datastore which have already been successfully verified on the new datastore.

I found the following error repeatedly in dmesg:

Bash:

[Thu Apr 24 09:12:40 2025] ZFS: Loaded module v2.2.6-pve1, ZFS pool version 5000, ZFS filesystem version 5
[Thu Apr 24 09:16:43 2025] INFO: task zpool:303 blocked for more than 122 seconds.
[Thu Apr 24 09:16:43 2025]       Tainted: P           O       6.8.12-3-pve #1
[Thu Apr 24 09:16:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Apr 24 09:16:43 2025] task:zpool           state:D stack:0     pid:303   tgid:303   ppid:1      flags:0x00004002
[Thu Apr 24 09:16:43 2025] Call Trace:
[Thu Apr 24 09:16:43 2025]  <TASK>
[Thu Apr 24 09:16:43 2025]  __schedule+0x401/0x15e0
[Thu Apr 24 09:16:43 2025]  schedule+0x33/0x110
[Thu Apr 24 09:16:43 2025]  taskq_wait+0xb8/0x100 [spl]
[Thu Apr 24 09:16:43 2025]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Apr 24 09:16:43 2025]  dmu_objset_find_dp+0x17a/0x250 [zfs]
[Thu Apr 24 09:16:44 2025]  ? __pfx_zil_check_log_chain+0x10/0x10 [zfs]
[Thu Apr 24 09:16:44 2025]  spa_load+0x161d/0x1a30 [zfs]
[Thu Apr 24 09:16:44 2025]  spa_load_best+0x57/0x2c0 [zfs]
[Thu Apr 24 09:16:44 2025]  ? zpool_get_load_policy+0x19e/0x1b0 [zfs]
[Thu Apr 24 09:16:44 2025]  spa_import+0x234/0x6d0 [zfs]
[Thu Apr 24 09:16:44 2025]  zfs_ioc_pool_import+0x163/0x180 [zfs]
[Thu Apr 24 09:16:44 2025]  zfsdev_ioctl_common+0x89e/0x9f0 [zfs]
[Thu Apr 24 09:16:44 2025]  ? __check_object_size+0x9d/0x300
[Thu Apr 24 09:16:44 2025]  zfsdev_ioctl+0x57/0xf0 [zfs]
[Thu Apr 24 09:16:44 2025]  __x64_sys_ioctl+0xa0/0xf0
[Thu Apr 24 09:16:44 2025]  x64_sys_call+0xa68/0x24b0
[Thu Apr 24 09:16:44 2025]  do_syscall_64+0x81/0x170
[Thu Apr 24 09:16:44 2025]  ? syscall_exit_to_user_mode+0x89/0x260
[Thu Apr 24 09:16:44 2025]  ? do_syscall_64+0x8d/0x170
[Thu Apr 24 09:16:44 2025]  ? handle_mm_fault+0xad/0x380
[Thu Apr 24 09:16:44 2025]  ? do_user_addr_fault+0x337/0x660
[Thu Apr 24 09:16:44 2025]  ? irqentry_exit_to_user_mode+0x7e/0x260
[Thu Apr 24 09:16:44 2025]  ? irqentry_exit+0x43/0x50
[Thu Apr 24 09:16:44 2025]  ? exc_page_fault+0x94/0x1b0
[Thu Apr 24 09:16:44 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Thu Apr 24 09:16:44 2025] RIP: 0033:0x79285d435cdb
[Thu Apr 24 09:16:44 2025] RSP: 002b:00007ffc2baf0db0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Thu Apr 24 09:16:44 2025] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000079285d435cdb
[Thu Apr 24 09:16:44 2025] RDX: 00007ffc2baf0e70 RSI: 0000000000005a02 RDI: 0000000000000003
[Thu Apr 24 09:16:44 2025] RBP: 00007ffc2baf4d60 R08: 000079285d50b460 R09: 000079285d50b460
[Thu Apr 24 09:16:44 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 00005a593f1d62c0
[Thu Apr 24 09:16:44 2025] R13: 00007ffc2baf0e70 R14: 0000792854002530 R15: 00005a593f1fc340
[Thu Apr 24 09:16:44 2025]  </TASK>

8 out of 9 times the timestamp was between 09:00 and 09:35, another one at 10:18.

Also, I found the same in journalctl -p3 -b -1

Bash:

Apr 23 22:38:41 backup kernel: INFO: task txg_sync:566 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task systemd-journal:792 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task journal-offline:106592 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task filebeat:1751 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105665 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105771 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105859 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105867 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105903 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 22:38:41 backup kernel: INFO: task tokio-runtime-w:105906 blocked for more than 122 seconds.
Apr 23 22:38:41 backup kernel:       Tainted: P           O       6.8.12-3-pve #1
Apr 23 22:38:41 backup kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Something seems to be freezing / hanging?
Can somebody shed some light on why this happens?

Best regards
pixelpoint

Search

Search

Proxmox Backup Server - a lot of "verify failed"

pixelpoint

Active Member

Chris

Proxmox Staff Member

pixelpoint

Active Member

Chris

Proxmox Staff Member

pixelpoint

Active Member

Chris

Proxmox Staff Member

pixelpoint

Active Member

Chris

Proxmox Staff Member

pixelpoint

Active Member

Chris

Proxmox Staff Member

pixelpoint

Active Member

pixelpoint

Active Member

pixelpoint

Active Member

We value your privacy