PBS PVE Fail Backup

Hi Fabian,

yes, could say so, that those two VMs are especially active. One is Zabbix so data is continuously processed.
The only other difference might be the size: those are the two largest VMs i have (600GB and 50GB of harddisks).
Anyway, i waited 24h, ran the prune on the datastore, manually deleted the folders vms/[ID] with only the owner file inside and tried the backup again which, this time, completed successfully. Nothing changed from yesterday and I didn't notice anything with the ISP (that might have generated issue yesterday)...

"Tutto è bene ciò che finisce bene" we say in Italy... bye!
 
Hi,

Don't see any followup to this - but we're having precisely the same problem.

This is a brand new PBS installation - all updates are done - and the zpool layout is as follows:

Code:
root@pbs-01:~# zpool status
  pool: PBS01-Datastore
 state: ONLINE
  scan: scrub repaired 0B in 00:17:24 with 0 errors on Tue Feb 22 19:40:51 2022
config:

        NAME             STATE     READ WRITE CKSUM
        PBS01-Datastore  ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            sdu          ONLINE       0     0     0
            sdv          ONLINE       0     0     0
            sdw          ONLINE       0     0     0
          mirror-1       ONLINE       0     0     0
            sdaa         ONLINE       0     0     0
            sdab         ONLINE       0     0     0
            sdac         ONLINE       0     0     0
          mirror-2       ONLINE       0     0     0
            sdad         ONLINE       0     0     0
            sdae         ONLINE       0     0     0
            sdaf         ONLINE       0     0     0
          mirror-3       ONLINE       0     0     0
            sdag         ONLINE       0     0     0
            sdah         ONLINE       0     0     0
            sdai         ONLINE       0     0     0
          mirror-4       ONLINE       0     0     0
            sdaj         ONLINE       0     0     0
            sdak         ONLINE       0     0     0
            sdal         ONLINE       0     0     0
          mirror-5       ONLINE       0     0     0
            sdi          ONLINE       0     0     0
            sdj          ONLINE       0     0     0
            sdk          ONLINE       0     0     0
          mirror-6       ONLINE       0     0     0
            sdl          ONLINE       0     0     0
            sdm          ONLINE       0     0     0
            sdn          ONLINE       0     0     0
          mirror-7       ONLINE       0     0     0
            sdo          ONLINE       0     0     0
            sdp          ONLINE       0     0     0
            sdq          ONLINE       0     0     0
          mirror-8       ONLINE       0     0     0
            sdr          ONLINE       0     0     0
            sds          ONLINE       0     0     0
            sdt          ONLINE       0     0     0
        special
          mirror-9       ONLINE       0     0     0
            sdd          ONLINE       0     0     0
            sde          ONLINE       0     0     0
            sdf          ONLINE       0     0     0
        logs
          mirror-10      ONLINE       0     0     0
            sdg          ONLINE       0     0     0
            sdh          ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Tue Feb 22 19:23:31 2022
config:

        NAME                                                     STATE     READ WRITE CKSUM
        rpool                                                    ONLINE       0     0     0
          mirror-0                                               ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0J400741-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0J400728-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H413627-part3  ONLINE       0     0     0

errors: No known data errors

The 'rpool' contains the OS (3-way SSD RAIDZ Mirror), and the PBS01-Datastore contains the backup datastore - constructed of 3-way HDD RAIDZ mirrors and SSD-based mirrored SPECIAL device and a mirrored SLOG/ZIL device (and in a 44 unit Supermicro JBOD - HGST12TB drives).

The PBS 'compute' unit is a 2x Intel Gold 6150 Supermicro Ultra with 384GB RAM.

So on paper, we should have no performance issues - and these errors we are getting are purely during test, not production usage.

We are seeing this (as an example) in our logs and we see failing backups:

Code:
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: starting new backup on datastore 'PBS01-Datastore': "vm/1869/2022-02-23T15:11:38Z"
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: GET /previous: 400 Bad Request: no valid previous backup
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: created new fixed index 1 ("vm/1869/2022-02-23T15:11:38Z/drive-scsi0.img.fidx")
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: created new fixed index 2 ("vm/1869/2022-02-23T15:11:38Z/drive-scsi1.img.fidx")
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: add blob "/mnt/datastore/PBS01-Datastore/vm/1869/2022-02-23T15:11:38Z/qemu-server.conf.blob" (358 bytes, comp: 358)
Feb 23 17:11:41 pbs-01 proxmox-backup-proxy[5448]: add blob "/mnt/datastore/PBS01-Datastore/vm/1869/2022-02-23T15:11:38Z/fw.conf.blob" (138 bytes, comp: 138)
Feb 23 17:14:11 pbs-01 systemd[1]: session-1.scope: Succeeded.
Feb 23 17:14:11 pbs-01 systemd[1]: session-3.scope: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[1]: Stopping User Manager for UID 0...
Feb 23 17:14:21 pbs-01 systemd[6105]: Stopped target Main User Target.
Feb 23 17:14:21 pbs-01 systemd[6105]: Stopped target Basic System.
Feb 23 17:14:21 pbs-01 systemd[6105]: Stopped target Paths.
Feb 23 17:14:21 pbs-01 systemd[6105]: Stopped target Sockets.
Feb 23 17:14:21 pbs-01 systemd[6105]: Stopped target Timers.
Feb 23 17:14:21 pbs-01 systemd[6105]: dirmngr.socket: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Closed GnuPG network certificate management daemon.
Feb 23 17:14:21 pbs-01 systemd[6105]: gpg-agent-browser.socket: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Feb 23 17:14:21 pbs-01 systemd[6105]: gpg-agent-extra.socket: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Feb 23 17:14:21 pbs-01 systemd[6105]: gpg-agent-ssh.socket: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Feb 23 17:14:21 pbs-01 systemd[6105]: gpg-agent.socket: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Closed GnuPG cryptographic agent and passphrase cache.
Feb 23 17:14:21 pbs-01 systemd[6105]: Removed slice User Application Slice.
Feb 23 17:14:21 pbs-01 systemd[6105]: Reached target Shutdown.
Feb 23 17:14:21 pbs-01 systemd[6105]: systemd-exit.service: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[6105]: Finished Exit the Session.
Feb 23 17:14:21 pbs-01 systemd[6105]: Reached target Exit the Session.
Feb 23 17:14:21 pbs-01 systemd[1]: user@0.service: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[1]: Stopped User Manager for UID 0.
Feb 23 17:14:21 pbs-01 systemd[1]: Stopping User Runtime Directory /run/user/0...
Feb 23 17:14:21 pbs-01 systemd[1]: run-user-0.mount: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[1]: user-runtime-dir@0.service: Succeeded.
Feb 23 17:14:21 pbs-01 systemd[1]: Stopped User Runtime Directory /run/user/0.
Feb 23 17:14:21 pbs-01 systemd[1]: Removed slice User Slice of UID 0.
Feb 23 17:14:21 pbs-01 systemd[1]: user-0.slice: Consumed 1min 41.746s CPU time.
Feb 23 17:14:56 pbs-01 proxmox-backup-proxy[5448]: error during snapshot file listing: 'unable to load blob '"/mnt/datastore/PBS01-Datastore/vm/1869/2022-02-23T15:11:38Z /index.json.blob"' - No such file or directory (os error 2)'
Feb 23 17:14:56 pbs-01 proxmox-backup-proxy[5448]: error during snapshot file listing: 'unable to load blob '"/mnt/datastore/PBS01-Datastore/vm/1403/2022-02-23T14:17:09Z /index.json.blob"' - No such file or directory (os error 2)'
Feb 23 17:15:01 pbs-01 CRON[1363317]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 23 17:16:00 pbs-01 proxmox-backup-proxy[5448]: error during snapshot file listing: 'unable to load blob '"/mnt/datastore/PBS01-Datastore/vm/1869/2022-02-23T15:11:38Z /index.json.blob"' - No such file or directory (os error 2)'
Feb 23 17:16:00 pbs-01 proxmox-backup-proxy[5448]: error during snapshot file listing: 'unable to load blob '"/mnt/datastore/PBS01-Datastore/vm/1403/2022-02-23T14:17:09Z /index.json.blob"' - No such file or directory (os error 2)'

So one of the backups started fine - large 6TB VM - and progressed fine until about 01:00 this morning after which things 'broke' - no further syslog or kern.log entries and all the while both the PVE host and the PBS server indicating that the backup was still ongoing - but it simply did not progress and hung up.

We're looking at whether this could be linked to the SPECIAL and/or SLOG/ZIL device/s (with the SLOG/ZIL device looking like it actually is not utilised/serves no prupose in the PBS setup).

Anything else we should be looking at/any ideas?

Kind regards,

Angelo.
 
Last edited:
well you defnitely need more logs closer to when the problem appeared. if the log files on disk don't contain the relevant time span it's possible you triggered some kernel bug that prevented writing to disk - in that case serial or netconsole might help collecting the relevant info.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!