Discrepancy between CLI verify job and GUI verify job

Taledo

Active Member
Nov 20, 2020
78
9
28
54
Hey all,

I'm seeing a weird behaviour on one of my PBS node. The GUI verify job which should run every day has been failing for a few days now. Upon further investigation, the verify job is trying to verify a non-existent (probably pruned) backup : (datastore name has been censored but isn't strictly relevant here)

Code:
2024-01-24T12:47:18+01:00: verify hello-pbs2:vm/9020/2023-11-23T22:50:15Z
2024-01-24T12:47:18+01:00:   check qemu-server.conf.blob
2024-01-24T12:47:18+01:00:   check drive-scsi0.img.fidx
2024-01-24T12:48:19+01:00:   verified 3681.25/21544.00 MiB in 60.41 seconds, speed 60.94/356.64 MiB/s (0 errors)
2024-01-24T12:48:19+01:00: TASK ERROR: verification failed - job aborted

Code:
root@hellothere-PBS2:/mnt/datastore/hellothere-pbs2/vm/9020# ls -lah | grep 2023-11-2
drwxr-xr-x  2 backup backup 4.0K Jan 23 17:36 2023-11-26T22:50:10Z
root@hellothere-PBS2:/mnt/datastore/hellothere-pbs2/vm/9020#

Weirdly, launching a verify job from the CLI works perfectly.


Code:
root@heya-PBS2:/mnt/datastore/wololo-pbs2/vm/9020# proxmox-backup-manager verify hellothere-pbs2
[...]
percentage done: 100.00% (47/47 groups)
TASK OK

Version :

proxmox-backup-server 3.1.2-1 running version: 3.1.2

The GC job runs everyday at 2100 with no issue.

Cheers
 
please post the full logs of both runs..
 
both see the same snapshots - is it possible you or somebody else simply aborted the verification task? if a new task from the UI also fails, please also check the journal during its execution for any errors..
 
I'm the only one accessing this PBS, so that shouldn't be an issue.

Any way to filter out journalctl? Its spammed by running backups every few minutes.
 
you can use --since and --until and just look at +-5 minutes of the error
 
Hey,

Nothing in the journal I can spot :

1706172394969.png

Code:
root@heyhey-PBS2:~# journalctl --since "2024-01-25 00:20:00" --until "2024-01-25 00:30:00"
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: write rrd data back to disk
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: starting rrd data sync
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: rrd journal successfully committed (25 files in 0.045 seconds)
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 79
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 70
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 30
root@heyhey-PBS2:~#

(don't mind smartd, those are old drives I need to replace, but I somehow doubt it's 66°C in front of the AC)
 
does it abort at the same snapshot again? what happens if you try to verify just that snapshot via the UI?
 
Not the same one as before but the same group.

Code:
2024-01-25T00:24:57+01:00: percentage done: 77.98% (36/47 groups, 48/74 snapshots in group #37)
2024-01-25T00:24:57+01:00: verify hello-pbs2:vm/9020/2023-11-24T22:50:23Z
2024-01-25T00:24:57+01:00:   check qemu-server.conf.blob
2024-01-25T00:24:57+01:00:   check drive-scsi0.img.fidx
2024-01-25T00:26:08+01:00:   verified 3650.90/21524.00 MiB in 71.09 seconds, speed 51.35/302.76 MiB/s (0 errors)
2024-01-25T00:26:08+01:00: TASK ERROR: verification failed - job aborted

I can't verify via GUI because the snapshot doesn't exist any more :

1706178542698.png

I was thinking maybe it was a race condition between the prune job and the verify job (my prune job runs hourly, which it doesn't really need to, I'll shift it to daily at 5am to see if it does anything).
 
which version are you on? a snapshot disappearing while its group is being verified should be handled, unless your version is very old..
 
Alright, looks like a layer 8 issue.

Basically, I had copied the VM backups from the root namespace into another namespace to restore said VM into a different environment. However, I had failed to chown the folder to my backup user, resulting in this :

1706179439064.png
(I'm putting the blame on the post christmas fatigue)

Anyhow, this of course didn't play well with the verify as it couldn't write files correctly.

I, however, do not know if having a VM with the same VMID in different NS is a good thing. As for why the CLI verify did work, my theory is that its not recursing into the namespaces unless specified?

It might also help in the future if the verify job specified the namespace its in, for the dumb dumb like me.

I've launched a verify job again and will edit this once its done.

EDIT : verify job has finished. Wunderbar.

Thanks Fabian for your help!

Cheers all
 
Last edited:
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!