Dear Proxmox Forum users,
we have a relatively new problem occuring in our proxmox backup server setup.
Since some time, more and more weekly verify show failed verify for some backups. There are currently 39 failed Backups, with verifying still going on and some backups were deleted already.
Information about the setup
We have a central Proxmox Backup Server (PBS), a dedicated root server in some datacenter. There are also another 2 on-site PBS at customer locations and one on-site PBS in our own office for local VM- and
2 of the 3 on-site PBS (mostly Proxmox VE VM backups, our local one also has some proxmox-backup-client backups) push their backups to the central PBS.
Most of our VMs and dedicated root servers run linux and use
This all amounts to somewhere between 100 and 200 servers using the central PBS as a backup server, keeping 7 daily, 4 weekly, 12 monthly and 1 yearly backup alive.
The central PBS consists of a ZFS pool with 4 HDDs, 16 TB each.
Information about the problem
Some time ago (not sure, maybe a month, maybe two or 3), a verify failed for a backup or two among the many there are. Now, 39 failed backups, as written above. Every server keeps 24 backups, one of them (450 GB backup) now has 15 failed out of 24. Some servers even have all their recent backups fail verify, so I couldn't even restore if I needed to. I already deleted some verify-failed backups of a few of our machines with 1TB+ backups and moved 4 of them to a newly created datastore.
In the PBS WebUI the SMART column of all 4 disks shows "passed".
I do see quite some SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from VALUE_X to VALUE_Y but according to this proxmox forum thread, these should be ignored if you have Seagate disks (which all of our central PBS disks are).
As these are all HDDs in a
Also: One of our customers VM backups (which has already been pushed to our central PBS) failed verify in our central PBS, but that same backup is still verified in the customers on-site PBS.
My Hypothesis
Is it possible that garbage collection, prune and verify taking too long might be the problem?
This weekend I will try to split up the biggest datastore into smaller ones, hoping this will fix the problem.
Also: failed backups are non-recoverable, except for if you have that same backup in another PBS, correct? Meaning, all of the
Hopefully, someone here could point me in the right direction to fix this.
Best regards
pixelpoint
we have a relatively new problem occuring in our proxmox backup server setup.
Since some time, more and more weekly verify show failed verify for some backups. There are currently 39 failed Backups, with verifying still going on and some backups were deleted already.
Information about the setup
We have a central Proxmox Backup Server (PBS), a dedicated root server in some datacenter. There are also another 2 on-site PBS at customer locations and one on-site PBS in our own office for local VM- and
proxmox-backup-client
backups.2 of the 3 on-site PBS (mostly Proxmox VE VM backups, our local one also has some proxmox-backup-client backups) push their backups to the central PBS.
Most of our VMs and dedicated root servers run linux and use
proxmox-backup-client
to make backups directly to our central PBS.This all amounts to somewhere between 100 and 200 servers using the central PBS as a backup server, keeping 7 daily, 4 weekly, 12 monthly and 1 yearly backup alive.
The central PBS consists of a ZFS pool with 4 HDDs, 16 TB each.
Information about the problem
Some time ago (not sure, maybe a month, maybe two or 3), a verify failed for a backup or two among the many there are. Now, 39 failed backups, as written above. Every server keeps 24 backups, one of them (450 GB backup) now has 15 failed out of 24. Some servers even have all their recent backups fail verify, so I couldn't even restore if I needed to. I already deleted some verify-failed backups of a few of our machines with 1TB+ backups and moved 4 of them to a newly created datastore.
In the PBS WebUI the SMART column of all 4 disks shows "passed".
I do see quite some SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from VALUE_X to VALUE_Y but according to this proxmox forum thread, these should be ignored if you have Seagate disks (which all of our central PBS disks are).
As these are all HDDs in a
raidz1
setup and there are tons of backups to verify, prune and garbage-collect, I assume that garbage-collect, prune and verify somehow got in the way of each other. Garbage Collection currently takes ~24h for our biggest datastore and verifying the same datastore takes ~7-8 days.Also: One of our customers VM backups (which has already been pushed to our central PBS) failed verify in our central PBS, but that same backup is still verified in the customers on-site PBS.
My Hypothesis
Is it possible that garbage collection, prune and verify taking too long might be the problem?
This weekend I will try to split up the biggest datastore into smaller ones, hoping this will fix the problem.
Also: failed backups are non-recoverable, except for if you have that same backup in another PBS, correct? Meaning, all of the
proxmox-backup-client
Backups from our VPS and dedicated root servers (which only reside on our central PBS) cannot be "healed" or "restored" some way, right?Hopefully, someone here could point me in the right direction to fix this.
Best regards
pixelpoint