PBS: backup doesn't exist but no backup errors on PVE server

ikidd

Member
Jul 6, 2020
14
0
21
101
I'm having a problem that might have something to do with write speed at the PBS server, but I'm just guessing.

I have a SATA drive in the PBS server that I do one backup job to. I also swap out USB drives (PBS server offline during swap) and backup an hour later to the USB drive as well.

I don't seem to have a problem with backups written to the PBS sata datastore, but I might lose some or all of the backups to the USB drive. But as far as the PVE server doing the backup is concerned, all went well.

Here is a screenshot of the lastest attempt (I've reinstalled PBS on different hardware at an attempt to fix this, so I only have a couple backups logged).

Screenshot_20241002_114400.png

As you can see, some of the CTs kept the backup, but a couple did not. Not a lot would change on either of these servers, but it still seems like it's logging chunks being written in the PBS tasklog.

I thought this might have something to do with the prune so I moved the prune to not be able to coincide with the backup (default hourly, moved to a couple hours before daily backup) All datastores are using a 7/4/12/1 retention policy.

PBS 3.2-7, completely updated of course. PVE 8.2.2 but I can't see that that would matter to this problem.

Logs from PVE backup job and the missing CT106 job on the PBS side.

Any thoughts on what I can try to do to debug this?
 

Attachments

Hi,
so just for clarification: The screenshot above shows the backup snapshots on the removable USB drive based datastore?
I have a SATA drive in the PBS server that I do one backup job to. I also swap out USB drives (PBS server offline during swap) and backup an hour later to the USB drive as well.
Well, are you sure that you have the exactly same backup job config? Maybe you have different exclude policies? You showed the task log for the backup to the sata drive I guess?

Please do share the backup job and storage config for both jobs from you PVE host as well as the backup tasks logs for both jobs, the one to the sata drive backed datastore and the one backed by the usb drive.

In general, instead of backing up the guests twice to different datastores, you could setup a sync job to pull the datastore contents from one datastore to the other one. See https://pbs.proxmox.com/docs/managing-remotes.html#sync-jobs
 
Hi,
so just for clarification: The screenshot above shows the backup snapshots on the removable USB drive based datastore?
Yes
Well, are you sure that you have the exactly same backup job config? Maybe you have different exclude policies? You showed the task log for the backup to the sata drive I guess?
That was the backup log for the USB drive job (commandline in the log shows the pvebak_toshiba2 destination).. It looks like it completed all the backups without errors, yet some of them are not on the content shown in PBS. Both logs seemed perfectly normal.

I thought I had attached the CT105 log from the PBS server for that backup but had not. I've attached it below. It seems like it's quite happily writing chunks as it receives them from the PVE node, but that days file doesn't show in the Content of the datastore.

I've attached pics of the three backup configs. They are identical as far as I can see, they're all exclude jobs.
Please do share the backup job and storage config for both jobs from you PVE host as well as the backup tasks logs for both jobs, the one to the sata drive backed datastore and the one backed by the usb drive.

In general, instead of backing up the guests twice to different datastores, you could setup a sync job to pull the datastore contents from one datastore to the other one. See https://pbs.proxmox.com/docs/managing-remotes.html#sync-jobs
I looked again today and all of the backups are on the USB drive from last night. I'm missing the 2 CT backups that I had mentioned yesterday still, but last nights seemed to have stuck around. So now I have 3 backups for all guests except the CT105 and CT106 that now have 2.

1728003982021.png

I'm highly suspecting the default configured Prune job that ran on the hour during that backup snipped the backups that were were occurring right after the turn of the hour. I changed that Prune job yesterday morning to only run once a day well before the backup window and there was no problem last night.

I'm going to keep monitoring it, if you want more logs to try to track down a potential bug in how that Prune might be removing files if it runs during a backup job, I'm happy to get you whatever you need.

I have tried the sync in the past, I wasn't happy with how it worked on removable drives at that time (it might have been how it failed if a target wasn't available and then wouldn't run the other sync job?), but it's been a couple years since I tried that I think. Maybe I'll try it again, but this swap method has seemed to work pretty well for me keeping an offsite backup. This is the first issue I've seen with it.
 

Attachments

That was the backup log for the USB drive job (commandline in the log shows the pvebak_toshiba2 destination)..
No way for me knowing that is the USB drive backed datastore;)

I'm going to keep monitoring it, if you want more logs to try to track down a potential bug in how that Prune might be removing files if it runs during a backup job, I'm happy to get you whatever you need.
Please also post the prune task log, that should tell which backup snapshots of which group got removed.

it might have been how it failed if a target wasn't available and then wouldn't run the other sync job?
Well, instead of the sync job, the backup job should fail instead in that case, so not really a gain there. But if it does not, can you make sure the mountpoint is empty before the disk is mounted? How do you proceed when disconnecting and reconnecting the datastore?

Note that removable datastores are not supported (yet, there are patches introducing them on the mailing list, so this will be possible in the future https://lore.proxmox.com/pbs-devel/20240904141155.350454-1-h.laimer@proxmox.com/)

Edit: Also, your screenshots do not show the retention period as configured on the Backup Jobs on the PVE side, but since there is nothing in the logs I guess these are kept default (please verify just to exclude that as well).
 
Last edited:
No way for me knowing that is the USB drive backed datastore;)
Yah, sorry. I thought my initial screenshot had the name of the datastore on it!
Please also post the prune task log, that should tell which backup snapshots of which group got removed.
So I was barking up the wrong tree about this being a failure in the backup, it was a failure in the prune job that for some reason removed the previous days backup of CT105 at the 2400 pruning (first log), then pruning the older backup of 106 in the next hours prune job (second log), despite retention policies you see at the top of each log.

Yet it doesn't do it on the 108/109 backups. It quite happily keeps them in the same log that it prunes the older 106 backup. The only thing I can think of is that the 105 was done before the 2400 prune and the 106 was ongoing during it. Which should make no difference at the 0100 prune that turfed 106's older backup.

The other datastore, local sata, (configured precisely the same way) did no such thing, and still has an hourly prune active on it with no misadventures.


Datastore pvebak_toshiba2:
1728054236723.png
Well, instead of the sync job, the backup job should fail instead in that case, so not really a gain there. But if it does not, can you make sure the mountpoint is empty before the disk is mounted? How do you proceed when disconnecting and reconnecting the datastore?
Since they're ZFS partitions, I just down the server and swap them so the zfs-mount services can do their jobs and PBS starts up gracefully with the active datastores. I could manually export and import but that's all taken care of by downing the PBS server and doing it offline. ZFS works very well for what I do later with the datasets for syncing to other storage for multiple extra copies (sysadmin backup paranoia level 9).


Note that removable datastores are not supported (yet, there are patches introducing them on the mailing list, so this will be possible in the future https://lore.proxmox.com/pbs-devel/20240904141155.350454-1-h.laimer@proxmox.com/)
That was my impression about what I was trying to do with syncs, which is why I settled on this method. The backup job on the disconnected datastore seems to fail more gracefully that way and doesn't mess up later backup attempts. I might be misremembering what was wrong with the sync method but it definitely didn't work as well as just scheduling concurrent backup jobs of which one just fails. I just have Nagios confirm at least one of them completed to stay green for that day, along with the local SATA version.
Edit: Also, your screenshots do not show the retention period as configured on the Backup Jobs on the PVE side, but since there is nothing in the logs I guess these are kept default (please verify just to exclude that as well).
I had put that in the first comment in a shorthand, but it's 7d, 4w, 12m, 1y

No problems last night again.
 

Attachments

Had a quick glance at the code, and I guess that the backup snapshots removed by the prune job were not completed correctly. Can you provide the task log also for the backup run the day before, which does include the backup snaphshots ct/106/2024-10-01T07:50:08Z and ct/105/2024-10-01T07:48:20Z? The prune job will remove incomplete snapshots (a snapshot being considered incomplete if there is no valid manifest for it).
 
All my task logs for Sept 30 which would would have been that backup are blank now. I'm guessing some retention policy clears or moves them at 1 week. Is there somewhere to look outside the PVE web gui that might still have them?
 
You could try to see if you still have that information in the systemd journal.
I'm guessing some retention policy clears or moves them at 1 week.
There is scheduled log rotation, but that should definitely not clear your task logs from a week ago, unless they were really huge for some reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!