Discrepancy between CLI verify job and GUI verify job

Taledo · Jan 24, 2024

Hey all,

I'm seeing a weird behaviour on one of my PBS node. The GUI verify job which should run every day has been failing for a few days now. Upon further investigation, the verify job is trying to verify a non-existent (probably pruned) backup : (datastore name has been censored but isn't strictly relevant here)

Code:

2024-01-24T12:47:18+01:00: verify hello-pbs2:vm/9020/2023-11-23T22:50:15Z
2024-01-24T12:47:18+01:00:   check qemu-server.conf.blob
2024-01-24T12:47:18+01:00:   check drive-scsi0.img.fidx
2024-01-24T12:48:19+01:00:   verified 3681.25/21544.00 MiB in 60.41 seconds, speed 60.94/356.64 MiB/s (0 errors)
2024-01-24T12:48:19+01:00: TASK ERROR: verification failed - job aborted

Code:

root@hellothere-PBS2:/mnt/datastore/hellothere-pbs2/vm/9020# ls -lah | grep 2023-11-2
drwxr-xr-x  2 backup backup 4.0K Jan 23 17:36 2023-11-26T22:50:10Z
root@hellothere-PBS2:/mnt/datastore/hellothere-pbs2/vm/9020#

Weirdly, launching a verify job from the CLI works perfectly.

Code:

root@heya-PBS2:/mnt/datastore/wololo-pbs2/vm/9020# proxmox-backup-manager verify hellothere-pbs2
[...]
percentage done: 100.00% (47/47 groups)
TASK OK

Version :

proxmox-backup-server 3.1.2-1 running version: 3.1.2

The GC job runs everyday at 2100 with no issue.

Cheers

fabian · Jan 24, 2024

please post the full logs of both runs..

Taledo · Jan 24, 2024

cli-verif is the cli command, the other file is the task log from the GUI

fabian · Jan 24, 2024

both see the same snapshots - is it possible you or somebody else simply aborted the verification task? if a new task from the UI also fails, please also check the journal during its execution for any errors..

Taledo · Jan 24, 2024

I'm the only one accessing this PBS, so that shouldn't be an issue.

Any way to filter out journalctl? Its spammed by running backups every few minutes.

fabian · Jan 25, 2024

you can use --since and --until and just look at +-5 minutes of the error

Taledo · Jan 25, 2024

Hey,

Nothing in the journal I can spot :

Code:

root@heyhey-PBS2:~# journalctl --since "2024-01-25 00:20:00" --until "2024-01-25 00:30:00"
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: write rrd data back to disk
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: starting rrd data sync
Jan 25 00:21:43 heyhey-PBS2 proxmox-backup-[170817]: heyhey-PBS2 proxmox-backup-proxy[170817]: rrd journal successfully committed (25 files in 0.045 seconds)
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 79
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 70
Jan 25 00:22:35 heyhey-PBS2 smartd[508]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 30
root@heyhey-PBS2:~#

(don't mind smartd, those are old drives I need to replace, but I somehow doubt it's 66°C in front of the AC)

fabian · Jan 25, 2024

does it abort at the same snapshot again? what happens if you try to verify just that snapshot via the UI?

Taledo · Jan 25, 2024

Not the same one as before but the same group.

Code:

2024-01-25T00:24:57+01:00: percentage done: 77.98% (36/47 groups, 48/74 snapshots in group #37)
2024-01-25T00:24:57+01:00: verify hello-pbs2:vm/9020/2023-11-24T22:50:23Z
2024-01-25T00:24:57+01:00:   check qemu-server.conf.blob
2024-01-25T00:24:57+01:00:   check drive-scsi0.img.fidx
2024-01-25T00:26:08+01:00:   verified 3650.90/21524.00 MiB in 71.09 seconds, speed 51.35/302.76 MiB/s (0 errors)
2024-01-25T00:26:08+01:00: TASK ERROR: verification failed - job aborted

I can't verify via GUI because the snapshot doesn't exist any more :

I was thinking maybe it was a race condition between the prune job and the verify job (my prune job runs hourly, which it doesn't really need to, I'll shift it to daily at 5am to see if it does anything).

fabian · Jan 25, 2024

which version are you on? a snapshot disappearing while its group is being verified should be handled, unless your version is very old..

Taledo · Jan 25, 2024

Alright, looks like a layer 8 issue.

Basically, I had copied the VM backups from the root namespace into another namespace to restore said VM into a different environment. However, I had failed to chown the folder to my backup user, resulting in this :

(I'm putting the blame on the post christmas fatigue)

Anyhow, this of course didn't play well with the verify as it couldn't write files correctly.

I, however, do not know if having a VM with the same VMID in different NS is a good thing. As for why the CLI verify did work, my theory is that its not recursing into the namespaces unless specified?

It might also help in the future if the verify job specified the namespace its in, for the dumb dumb like me.

I've launched a verify job again and will edit this once its done.

EDIT : verify job has finished. Wunderbar.

Thanks Fabian for your help!

Cheers all

Search

Search

Discrepancy between CLI verify job and GUI verify job

Taledo

Active Member

fabian

Proxmox Staff Member

Taledo

Active Member

Attachments

fabian

Proxmox Staff Member

Taledo

Active Member

fabian

Proxmox Staff Member

Taledo

Active Member

fabian

Proxmox Staff Member

Taledo

Active Member

fabian

Proxmox Staff Member

Taledo

Active Member

We value your privacy