Tuxis launches free Proxmox Backup Server BETA service

Felix. · Oct 12, 2020

@tuxis Your PBS seems to be a bit snappier now - pvestatd stays alive way longer now, but that means like 15-20 minutes at best.
Still not comfortable with it yet, but i see a notable progress!

Meanwhile i switched my own PBS instance to a virtual machine to see how that works.
I'm recycling my Hetzner Storage Box that is used for classic vzdump backups as datastore. "It works."

The Storage Box is connected via SMB / CIFS directly to the VM.
Generating the .chunks folder with its tons of subfolders took hours alone. The initial backup was quite long too, but acceptable.
Now all incremental backups take a little longer than with the dedicated pbs server with local hdds, but its actually only ~5 minutes more.

Of course thats far away from a real production setup, but to my surprise, it works.
It worked flawlessy the whole last week... my spooky little setup.

eider · Oct 12, 2020

Unable to post full log from server; would @tuxis be kind enough as to assist here as it is their server?

your keyfile contains timestamps in a wrongly serialized format that the new client does not understand.. could you post the 'created' and 'modified' entries of your keyfile (for PVE's autogenerated one, that is '/etc/pve/priv/storage/STORAGEID.enc')?

Interesting, here's the file in question:

Code:

{"kdf":null,"created":"2020-09-28T11:23:12.024189514+02:00","modified":"2020-09-28T11:23:12.024189514+02:00"}

The file was created on filesystem at the same time, and was created using proxmox-backup-client in version of 0.8.16-1 as indicated by the entry from apt log:

Code:

Start-Date: 2020-09-24  14:19:15
Commandline: apt-get -o Acquire::ForceIPv4=true dist-upgrade
Requested-By: eider (1000)
Install: proxmox-archive-keyring:amd64 (1.0, automatic)
Upgrade: pve-headers:amd64 (6.2-1, 6.2-2), pve-qemu-kvm:amd64 (5.1.0-1, 5.1.0-2), proxmox-backup-client:amd64 (0.8.15-1, 0.8.16-1), proxmox-ve:amd64 (6.2-1, 6.2-2)
End-Date: 2020-09-24  14:19:20

Start-Date: 2020-10-09  21:54:42
Commandline: apt-get -o Acquire::ForceIPv4=true dist-upgrade
Requested-By: eider (1000)
Install: pve-kernel-5.4.65-1-pve:amd64 (5.4.65-1, automatic), qrencode:amd64 (4.0.2-1, automatic), pve-headers-5.4.65-1-pve:amd64 (5.4.65-1, automatic), libqrencode4:amd64 (4.0.2-1, automatic)
Upgrade: proxmox-widget-toolkit:amd64 (2.2-12, 2.3-1), pve-kernel-5.4:amd64 (6.2-6, 6.2-7), libpve-access-control:amd64 (6.1-2, 6.1-3), libpve-storage-perl:amd64 (6.2-6, 6.2-8), pve-qemu-kvm:amd64 (5.1.0-2, 5.1.0-3), libpve-cluster-api-perl:amd64 (6.1-8, 6.2-1), libpve-cluster-perl:amd64 (6.1-8, 6.2-1), pve-docs:amd64 (6.2-5, 6.2-6), zfs-initramfs:amd64 (0.8.4-pve1, 0.8.4-pve2), pve-firewall:amd64 (4.1-2, 4.1-3), pve-container:amd64 (3.2-1, 3.2-2), zfs-zed:amd64 (0.8.4-pve1, 0.8.4-pve2), proxmox-backup-client:amd64 (0.8.16-1, 0.9.0-2), pve-cluster:amd64 (6.1-8, 6.2-1), zfsutils-linux:amd64 (0.8.4-pve1, 0.8.4-pve2), ifupdown2:amd64 (3.0.0-1+pve2, 3.0.0-1+pve3), pve-manager:amd64 (6.2-11, 6.2-12), spl:amd64 (0.8.4-pve1, 0.8.4-pve2), libzfs2linux:amd64 (0.8.4-pve1, 0.8.4-pve2), qemu-server:amd64 (6.2-14, 6.2-15), pve-headers-5.4:amd64 (6.2-6, 6.2-7), pve-kernel-helper:amd64 (6.2-6, 6.2-7), libzpool2linux:amd64 (0.8.4-pve1, 0.8.4-pve2), libnvpair1linux:amd64 (0.8.4-pve1, 0.8.4-pve2), libuutil1linux:amd64 (0.8.4-pve1, 0.8.4-pve2)
End-Date: 2020-10-09  21:56:15

and does not affect the status of existing backups in any way (the client just cannot interact with them until you fix the timestamp).

That is interesting, as the backups were working fine until the time this has happened, and if the issue was caused by rsync on Tuxis side, I'd expect the previous backup (from 09.10) to be fine as it was done after they started rsync. Could it possibly be related to client failing backup and server purging old chunks as if the client succeeded?

There is also very interesting difference in how this is handled between CTs and VMs; CTs fail before any data is sent to server, while VMs start transmitting data and only fail after that stage is complete; why?

Code:

INFO: starting new backup job: vzdump --mode snapshot --compress zstd --mailto root@i32.pl --quiet 1 --all 1 --mailnotification failure --storage pbs.tuxis.nl
INFO: skip external VMs: 102, 104, 105, 107, 112
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-10-10 02:15:01
INFO: status = running
INFO: VM Name: azusa
INFO: include disk 'scsi0' 'local:100/vm-100-disk-0.qcow2' 128G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2020-10-10T00:15:01Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: enabling encryption
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'ef8d07f4-04f3-4c5a-91f5-6a486b762743'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (284.0 MiB of 128.0 GiB) in  3s, read: 94.7 MiB/s, write: 80.0 MiB/s
[...]
INFO: 100% (128.0 GiB of 128.0 GiB) in  1h 57m 45s, read: 188.0 MiB/s, write: 2.7 MiB/s
INFO: backup is sparse: 89.97 GiB (70%) total zero data
INFO: backup was done incrementally, reused 89.97 GiB (70%)
INFO: transferred 128.00 GiB in 7065 seconds (18.6 MiB/s)
INFO: run: /usr/bin/proxmox-backup-client prune vm/100 --quiet 1 --keep-last 7 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox
INFO: vm/100/2020-10-03T00:15:02Z Sat Oct  3 02:15:02 2020 remove
INFO: Finished Backup of VM 100 (01:58:05)
INFO: Backup finished at 2020-10-10 04:13:06
ERROR: Backup job failed - upload log failed: Error: parse_rfc_3339 failed - wrong length at line 1 column 60

TASK ERROR: upload log failed: Error: parse_rfc_3339 failed - wrong length at line 1 column 60

Code:

INFO: Starting Backup of VM 112 (lxc)
INFO: Backup started at 2020-10-10 02:18:15
INFO: status = running
INFO: CT Name: cardinal
INFO: including mount point rootfs ('/') in backup
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: cardinal
INFO: including mount point rootfs ('/') in backup
INFO: starting first sync /proc/3452/root/ to /var/tmp/vzdumptmp20837
INFO: first sync finished - transferred 3.25G bytes in 44s
INFO: suspending guest
INFO: starting final sync /proc/3452/root/ to /var/tmp/vzdumptmp20837
INFO: final sync finished - transferred 0 bytes in 5s
INFO: resuming guest
INFO: guest is online again after 5 seconds
INFO: creating Proxmox Backup Server archive 'ct/112/2020-10-10T00:18:15Z'
INFO: run: lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup --crypt-mode=encrypt --keyfd=11 pct.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.fw root.pxar:/var/tmp/vzdumptmp20837 --include-dev /var/tmp/vzdumptmp20837/. --skip-lost-and-found --backup-type ct --backup-id 112 --backup-time 1602289095 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox
INFO: Starting backup: ct/112/2020-10-10T00:18:15Z
INFO: Client name: ako
INFO: Starting backup protocol: Sat Oct 10 02:19:05 2020
INFO: Error: parse_rfc_3339 failed - wrong length at line 1 column 60
ERROR: Backup of VM 112 failed - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup '--crypt-mode=encrypt' '--keyfd=11' pct.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.fw root.pxar:/var/tmp/vzdumptmp20837 --include-dev /var/tmp/vzdumptmp20837/. --skip-lost-and-found --backup-type ct --backup-id 112 --backup-time 1602289095 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox' failed: exit code 255
INFO: Failed at 2020-10-10 02:19:09
INFO: Backup job finished with errors

TASK ERROR: job errors

eider · Oct 12, 2020

@tuxis there is (or was?) something very bad happening with your PBS; the data is corrupted. Please note that this backup was done AFTER you indicated that migration was done.

Code:

INFO: starting new backup job: vzdump --compress zstd --mailnotification failure --node riko --storage pbs.tuxis.nl --mailto root@i32.pl --mode snapshot --all 1 --quiet 1
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-10-10 10:27:59
INFO: status = running
INFO: VM Name: azusa
INFO: include disk 'scsi0' 'local:100/vm-100-disk-0.qcow2' 128G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2020-10-10T08:27:59Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: enabling encryption
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '35970c13-5621-4084-aecd-75c1473ae463'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:   0% (544.0 MiB of 128.0 GiB) in  3s, read: 181.3 MiB/s, write: 84.0 MiB/s
[...]
INFO: 100% (128.0 GiB of 128.0 GiB) in  2h 27m 37s, read: 300.0 MiB/s, write: 2.7 MiB/s
INFO: backup is sparse: 89.97 GiB (70%) total zero data
INFO: backup was done incrementally, reused 89.97 GiB (70%)
INFO: transferred 128.00 GiB in 8857 seconds (14.8 MiB/s)
INFO: run: /usr/bin/proxmox-backup-client prune vm/100 --quiet 1 --keep-last 7 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox
INFO: Finished Backup of VM 100 (02:27:57)
INFO: Backup finished at 2020-10-10 12:55:56

Code:

2020-10-10T10:28:07+02:00: starting new backup on datastore 'DB0623_proxmox': "vm/100/2020-10-10T08:27:59Z"
2020-10-10T10:28:07+02:00: GET /previous: 400 Bad Request: no valid previous backup
2020-10-10T10:28:07+02:00: created new fixed index 1 ("vm/100/2020-10-10T08:27:59Z/drive-scsi0.img.fidx")
2020-10-10T10:28:08+02:00: add blob "/lvm_pbs/DB0623/proxmox/vm/100/2020-10-10T08:27:59Z/qemu-server.conf.blob" (550 bytes, comp: 550)
2020-10-10T12:55:55+02:00: Upload statistics for 'drive-scsi0.img.fidx'
2020-10-10T12:55:55+02:00: UUID: af67fbd72e854ed08d188770e2e90a4c
2020-10-10T12:55:55+02:00: Checksum: 5d973adb30e0ac172a8cfe16ba6edfc344241b4bf64f26e51e0b8d180fbf74a2
2020-10-10T12:55:55+02:00: Size: 137438953472
2020-10-10T12:55:55+02:00: Chunk count: 32768
2020-10-10T12:55:55+02:00: Upload size: 40835743744 (29%)
2020-10-10T12:55:55+02:00: Duplicates: 23032+8590 (96%)
2020-10-10T12:55:55+02:00: Compression: 55%
2020-10-10T12:55:55+02:00: successfully closed fixed index 1
2020-10-10T12:55:55+02:00: add blob "/lvm_pbs/DB0623/proxmox/vm/100/2020-10-10T08:27:59Z/index.json.blob" (365 bytes, comp: 365)
2020-10-10T12:55:55+02:00: successfully finished backup
2020-10-10T12:55:55+02:00: backup finished successfully
2020-10-10T12:55:55+02:00: TASK OK

I've noticed this only now as the client does not seem to report these corruptions back during backup jobs at all (which is concerning as the server can't be configured to send you notices about them automatically either).

The log from next day for the same VM confirms that it detected invalid backup, as indicated by HTTP 400

Code:

2020-10-11T02:15:11+02:00: starting new backup on datastore 'DB0623_proxmox': "vm/100/2020-10-11T00:15:01Z"
2020-10-11T02:15:11+02:00: GET /previous: 400 Bad Request: no valid previous backup
2020-10-11T02:15:11+02:00: created new fixed index 1 ("vm/100/2020-10-11T00:15:01Z/drive-scsi0.img.fidx")
2020-10-11T02:15:11+02:00: add blob "/lvm_pbs/DB0623/proxmox/vm/100/2020-10-11T00:15:01Z/qemu-server.conf.blob" (550 bytes, comp: 550)
2020-10-11T05:01:22+02:00: Upload statistics for 'drive-scsi0.img.fidx'
2020-10-11T05:01:22+02:00: UUID: bb5d21727fd34a6b835face15eecfc42
2020-10-11T05:01:22+02:00: Checksum: 22ea2ba7fc697c21e5aa4f0e0ec05dfb42e21c4e2c49abfc8c83e9d706c2ad97
2020-10-11T05:01:22+02:00: Size: 137438953472
2020-10-11T05:01:22+02:00: Chunk count: 32768
2020-10-11T05:01:22+02:00: Upload size: 40835743744 (29%)
2020-10-11T05:01:22+02:00: Duplicates: 23032+8303 (95%)
2020-10-11T05:01:22+02:00: Compression: 55%
2020-10-11T05:01:22+02:00: successfully closed fixed index 1
2020-10-11T05:01:22+02:00: add blob "/lvm_pbs/DB0623/proxmox/vm/100/2020-10-11T00:15:01Z/index.json.blob" (362 bytes, comp: 362)
2020-10-11T05:01:22+02:00: successfully finished backup
2020-10-11T05:01:22+02:00: backup finished successfully
2020-10-11T05:01:22+02:00: TASK OK

Code:

2020-10-12T15:53:53+02:00: verify DB0623_proxmox:vm/100/2020-10-10T08:27:59Z
2020-10-12T15:53:53+02:00:   check qemu-server.conf.blob
2020-10-12T15:53:53+02:00:   check drive-scsi0.img.fidx
2020-10-12T16:06:44+02:00: can't verify chunk, load failed - store 'DB0623_proxmox', unable to load chunk '76c9575320941841d8adff3336d706ae31939af3ed3188cf8d4b31484e5e499e' - No such file or directory (os error 2)
2020-10-12T16:06:44+02:00: can't verify chunk, load failed - store 'DB0623_proxmox', unable to load chunk '36991239399ed7f4718869f3ccc14c45fe05bba37d17df0869d6f4b6b4480ad1' - No such file or directory (os error 2)
2020-10-12T16:17:27+02:00:   verified 21631.91/38928.00 MiB in 1413.40 seconds, speed 15.30/27.54 MiB/s (2 errors)
2020-10-12T16:17:27+02:00: verify DB0623_proxmox:vm/100/2020-10-10T08:27:59Z/drive-scsi0.img.fidx failed: chunks could not be verified
2020-10-12T16:17:27+02:00: Failed to verify following snapshots:
2020-10-12T16:17:27+02:00:  vm/100/2020-10-10T08:27:59Z
2020-10-12T16:17:27+02:00: TASK ERROR: verification failed - please check the log for details

---

In the meantime, scheduled check has finished on another backup (different CT) and it is also marked as failed:

Code:

2020-10-12T16:34:42+02:00: verify DB0623_proxmox:ct/105/2020-10-12T00:18:38Z
2020-10-12T16:34:42+02:00:   check pct.conf.blob
2020-10-12T16:34:42+02:00:   check fw.conf.blob
2020-10-12T16:34:42+02:00:   check root.pxar.didx
2020-10-12T16:34:54+02:00: can't verify chunk, load failed - store 'DB0623_proxmox', unable to load chunk '1c6eeab20c25becd0bd31effb2b3cb3085cff2368a2feb79409ec6e8910cfaa6' - No such file or directory (os error 2)
2020-10-12T16:35:36+02:00:   verified 1180.36/2143.76 MiB in 53.11 seconds, speed 22.22/40.36 MiB/s (1 errors)
2020-10-12T16:35:36+02:00: verify DB0623_proxmox:ct/105/2020-10-12T00:18:38Z/root.pxar.didx failed: chunks could not be verified
2020-10-12T16:35:36+02:00:   check catalog.pcat1.didx
2020-10-12T16:35:36+02:00:   verified 0.27/0.67 MiB in 0.03 seconds, speed 8.26/20.34 MiB/s (0 errors)
2020-10-12T16:35:36+02:00: Failed to verify following snapshots:
2020-10-12T16:35:36+02:00:     ct/105/2020-10-12T00:18:38Z
2020-10-12T16:35:36+02:00: TASK ERROR: verification failed - please check the log for details

Now, either PBS does not perform verification on network traffic or the data is being mangled somehow on your storage. It's also interesting that all of them fail with 'no such file or directory'.

I am leaving these backups untouched at the moment, could you take a look at them @tuxis?

Code:

Formatting '/var/lib/vz/images/113/vm-113-disk-0.raw', fmt=raw size=8589934592
mke2fs 1.44.5 (15-Dec-2018)
Discarding device blocks: 4096/2097152 done
Creating filesystem with 2097152 4k blocks and 524288 inodes
Filesystem UUID: 191df624-fa79-4056-aee5-925189395950
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: 0/64 done
Writing inode tables: 0/64 done
Creating journal (16384 blocks): done
Multiple mount protection is enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information: 0/64 done

Error: error extracting archive - error at entry "metrics.2020-10-07T18-11-04Z-00000": failed to copy file contents: reading file "/lvm_pbs/DB0623/proxmox/.chunks/1c6e/1c6eeab20c25becd0bd31effb2b3cb3085cff2368a2feb79409ec6e8910cfaa6" failed: No such file or directory (os error 2)
TASK ERROR: unable to restore CT 113 - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client restore '--crypt-mode=encrypt' '--keyfd=12' ct/105/2020-10-12T00:18:38Z root.pxar /var/lib/lxc/113/rootfs --allow-existing-dirs --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox' failed: exit code 255

tuxis · Oct 12, 2020

Ok, so I missed some messages..

For starters, sorry about the corruption. I did not expect that. I finished the last rsync this afternoon (which explains @Felix. his downtime this afternoon). Indeed, the PBS is much snappier.

The file @eider is missing indeed is not there. @fabian which logfile do you want to see?

eider · Oct 12, 2020

Okay @tuxis, i'll remove the broken backups and see how it goes from there.

fabian · Oct 13, 2020

@eider @tuxis the backup task logs for the snapshot referencing the missing chunk, and any subsequent/concurrent verify/GC task logs would be interesting. system logs from around the backup time as well.

fabian · Oct 13, 2020

eider said:
Unable to post full log from server; would @tuxis be kind enough as to assist here as it is their server?

Interesting, here's the file in question:

Code:

{"kdf":null,"created":"2020-09-28T11:23:12.024189514+02:00","modified":"2020-09-28T11:23:12.024189514+02:00"}

The file was created on filesystem at the same time, and was created using proxmox-backup-client in version of 0.8.16-1 as indicated by the entry from apt log:

so removing the .024...14 from both timestamps should fixup the key if you haven't guessed that already

eider said:
That is interesting, as the backups were working fine until the time this has happened, and if the issue was caused by rsync on Tuxis side, I'd expect the previous backup (from 09.10) to be fine as it was done after they started rsync. Could it possibly be related to client failing backup and server purging old chunks as if the client succeeded?

no, a failing backup leaves the chunks around, they will only be collected by garbage collection once they are older than 24h and are not referenced by any existing index.

eider said:

There is also very interesting difference in how this is handled between CTs and VMs; CTs fail before any data is sent to server, while VMs start transmitting data and only fail after that stage is complete; why?

Code:

INFO: starting new backup job: vzdump --mode snapshot --compress zstd --mailto root@i32.pl --quiet 1 --all 1 --mailnotification failure --storage pbs.tuxis.nl
INFO: skip external VMs: 102, 104, 105, 107, 112
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-10-10 02:15:01
INFO: status = running
INFO: VM Name: azusa
INFO: include disk 'scsi0' 'local:100/vm-100-disk-0.qcow2' 128G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2020-10-10T00:15:01Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: enabling encryption
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task 'ef8d07f4-04f3-4c5a-91f5-6a486b762743'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (284.0 MiB of 128.0 GiB) in  3s, read: 94.7 MiB/s, write: 80.0 MiB/s
[...]
INFO: 100% (128.0 GiB of 128.0 GiB) in  1h 57m 45s, read: 188.0 MiB/s, write: 2.7 MiB/s
INFO: backup is sparse: 89.97 GiB (70%) total zero data
INFO: backup was done incrementally, reused 89.97 GiB (70%)
INFO: transferred 128.00 GiB in 7065 seconds (18.6 MiB/s)
INFO: run: /usr/bin/proxmox-backup-client prune vm/100 --quiet 1 --keep-last 7 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox
INFO: vm/100/2020-10-03T00:15:02Z Sat Oct  3 02:15:02 2020 remove
INFO: Finished Backup of VM 100 (01:58:05)
INFO: Backup finished at 2020-10-10 04:13:06
ERROR: Backup job failed - upload log failed: Error: parse_rfc_3339 failed - wrong length at line 1 column 60

TASK ERROR: upload log failed: Error: parse_rfc_3339 failed - wrong length at line 1 column 60

Code:

INFO: Starting Backup of VM 112 (lxc)
INFO: Backup started at 2020-10-10 02:18:15
INFO: status = running
INFO: CT Name: cardinal
INFO: including mount point rootfs ('/') in backup
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: cardinal
INFO: including mount point rootfs ('/') in backup
INFO: starting first sync /proc/3452/root/ to /var/tmp/vzdumptmp20837
INFO: first sync finished - transferred 3.25G bytes in 44s
INFO: suspending guest
INFO: starting final sync /proc/3452/root/ to /var/tmp/vzdumptmp20837
INFO: final sync finished - transferred 0 bytes in 5s
INFO: resuming guest
INFO: guest is online again after 5 seconds
INFO: creating Proxmox Backup Server archive 'ct/112/2020-10-10T00:18:15Z'
INFO: run: lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup --crypt-mode=encrypt --keyfd=11 pct.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.fw root.pxar:/var/tmp/vzdumptmp20837 --include-dev /var/tmp/vzdumptmp20837/. --skip-lost-and-found --backup-type ct --backup-id 112 --backup-time 1602289095 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox
INFO: Starting backup: ct/112/2020-10-10T00:18:15Z
INFO: Client name: ako
INFO: Starting backup protocol: Sat Oct 10 02:19:05 2020
INFO: Error: parse_rfc_3339 failed - wrong length at line 1 column 60
ERROR: Backup of VM 112 failed - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup '--crypt-mode=encrypt' '--keyfd=11' pct.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.conf fw.conf:/var/tmp/vzdumptmp20837/etc/vzdump/pct.fw root.pxar:/var/tmp/vzdumptmp20837 --include-dev /var/tmp/vzdumptmp20837/. --skip-lost-and-found --backup-type ct --backup-id 112 --backup-time 1602289095 --repository DB0623@pbs@pbs.tuxis.nl:DB0623_proxmox' failed: exit code 255
INFO: Failed at 2020-10-10 02:19:09
INFO: Backup job finished with errors

TASK ERROR: job errors

because Qemu uses libproxmox-backup-qemu0 for the backup itself, and proxmox-backup-client for uploading the client log and pruning. it looks like your VM's libproxmox-backup-qemu0 was built using the old timestamp parser (the lib and proxmox-backup-client use the same code, but are not always published in sync). for containers the whole backup is done with the client binary.

tuxis · Oct 13, 2020

eider said:
And the tasks are named as 1970-01-01.

This is a known issue and is pending decisions from Proxmox whether or not they want to use _ as a delimiter. Because we use _ in the datastore, the timestamp-parser gets confused. https://bugzilla.proxmox.com/show_bug.cgi?id=2889

eider · Oct 13, 2020

fabian said:
because Qemu uses libproxmox-backup-qemu0 for the backup itself, and proxmox-backup-client for uploading the client log and pruning. it looks like your VM's libproxmox-backup-qemu0 was built using the old timestamp parser (the lib and proxmox-backup-client use the same code, but are not always published in sync). for containers the whole backup is done with the client binary.

Thanks for explanation. How about the fact that qemu backup performed prune action BEFORE it failed to upload log (due to parse_rfc_3339)?

fabian said:
so removing the .024...14 from both timestamps should fixup the key if you haven't guessed that already

Indeed I figured it out and it solved the issue.

tuxis · Oct 13, 2020

Ok. So I still do not have a clue why things with rsync went south. But I do find something that's curious about the verify-logic. So lets take chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 (and disregard HOW this got broken, but only look at steps afterwards).

During a verify run, this chunk was marked as corrupt:

Code:

verify:6068298_proxmox:backup@pam::             2020-10-09T08:13:30+02:00: can't verify chunk, load failed - store '6068298_proxmox', unable to load chunk '16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9' - Data blob has wrong CRC checksum.
verify:6068298_proxmox:backup@pam::             2020-10-09T08:13:30+02:00: corrupted chunk renamed to "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9.0.bad"
verify:6068298_proxmox:backup@pam::             2020-10-09T10:19:49+02:00: chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 was marked as corrupt

This is now four days ago.

After that, garbage_collection failed because the chunk is missing. Which makes sense, because we moved the file away to .bad.

Code:

garbage_collection:6068298_proxmox:backup@pam:: 2020-10-10T06:42:39+02:00: WARN: warning: unable to access chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9, required by "/lvm_pbs/6068298/proxmox/vm/103/2020-10-05T04:00:02Z/drive-scsi0.img.fidx" - update atime failed for chunk "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" - ENOENT: No such file or directory
garbage_collection:6068298_proxmox:backup@pam:: 2020-10-10T06:52:48+02:00: WARN: warning: unable to access chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9, required by "/lvm_pbs/6068298/proxmox/vm/103/2020-09-28T15:54:38Z/drive-scsi0.img.fidx" - update atime failed for chunk "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" - ENOENT: No such file or directory

Later that day, we run verify again:

Code:

verify:6068298_proxmox:backup@pam::             2020-10-10T08:12:41+02:00: can't verify chunk, load failed - store '6068298_proxmox', unable to load chunk '16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9' - No such file or directory (os error 2)
verify:6068298_proxmox:backup@pam::             2020-10-10T09:52:20+02:00: chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 was marked as corrupt

And this process keeps repeating itself.

So, if we know a backup is corrupt, why do we keep verifying and garbagecollecting it? If the backup is corrupt, is there really anything we can do about it? Or should it be safe to say, 'this one is broken, we ignore it until you delete it, so we can throw it away'?

fabian · Oct 13, 2020

eider said:
Thanks for explanation. How about the fact that qemu backup performed prune action BEFORE it failed to upload log (due to parse_rfc_3339)?

for vzdump autoamtic pruning (based on storage config and / or vzdump parameters) after backup is still part of that backup task and thus log I think.

fabian · Oct 13, 2020

tuxis said:
Ok. So I still do not have a clue why things with rsync went south. But I do find something that's curious about the verify-logic. So lets take chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 (and disregard HOW this got broken, but only look at steps afterwards).

During a verify run, this chunk was marked as corrupt:

Code:

verify:6068298_proxmox:backup@pam:: 2020-10-09T08:13:30+02:00: can't verify chunk, load failed - store '6068298_proxmox', unable to load chunk '16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9' - Data blob has wrong CRC checksum. verify:6068298_proxmox:backup@pam:: 2020-10-09T08:13:30+02:00: corrupted chunk renamed to "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9.0.bad" verify:6068298_proxmox:backup@pam:: 2020-10-09T10:19:49+02:00: chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 was marked as corrupt

This is now four days ago.

After that, garbage_collection failed because the chunk is missing. Which makes sense, because we moved the file away to .bad.

Code:

garbage_collection:6068298_proxmox:backup@pam:: 2020-10-10T06:42:39+02:00: WARN: warning: unable to access chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9, required by "/lvm_pbs/6068298/proxmox/vm/103/2020-10-05T04:00:02Z/drive-scsi0.img.fidx" - update atime failed for chunk "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" - ENOENT: No such file or directory garbage_collection:6068298_proxmox:backup@pam:: 2020-10-10T06:52:48+02:00: WARN: warning: unable to access chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9, required by "/lvm_pbs/6068298/proxmox/vm/103/2020-09-28T15:54:38Z/drive-scsi0.img.fidx" - update atime failed for chunk "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" - ENOENT: No such file or directory

Later that day, we run verify again:

Code:

verify:6068298_proxmox:backup@pam:: 2020-10-10T08:12:41+02:00: can't verify chunk, load failed - store '6068298_proxmox', unable to load chunk '16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9' - No such file or directory (os error 2) verify:6068298_proxmox:backup@pam:: 2020-10-10T09:52:20+02:00: chunk 16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9 was marked as corrupt

And this process keeps repeating itself.

So, if we know a backup is corrupt, why do we keep verifying and garbagecollecting it? If the backup is corrupt, is there really anything we can do about it? Or should it be safe to say, 'this one is broken, we ignore it until you delete it, so we can throw it away'?

the idea is that a subsequent backup can re-upload (or a sync from off-site PBS that has the proper chunk can re-sync) that chunk (with the correct content), which would then get detected by the next verify, and the backup returns to a 'verified ok' state. of course you can also go ahead and 'forget' all the broken snapshots.

eider · Oct 13, 2020

fabian said:
for vzdump autoamtic pruning (based on storage config and / or vzdump parameters) after backup is still part of that backup task and thus log I think.

Yes however it would be nice if it happened after log upload so that pruning does not end removing old working backup (someone might specify 1 backup for example) before ensuring that new one is indeed fine.

I'd also love to see PBS having some way to alert user via email on failed verify on their backup and for PBS to expose backup state back to Proxmox. Right now to see if backup is OK or broken I have to log onto PBS instance or attempt to restore it on Proxmox to see if it fails or succeeds.

fabian · Oct 14, 2020

eider said:
Yes however it would be nice if it happened after log upload so that pruning does not end removing old working backup (someone might specify 1 backup for example) before ensuring that new one is indeed fine.

pruning ignores unfinished backups, so there should not be a problem there. if the verification detected that a chunk is missing or corrupt, the next backup will re-upload that chunk instead of re-using the (corrupt/non-existent) one from the server.

so there are two cases:
- backup is broken but PBS does not know it yet, client will potentially re-use broken chunks and generate broken backups (this could only be caught by doing a full verify immediately after/as final part of the backup, which is quite costly as it needs to read and hash all the referenced chunks. we might still offer this in the future as super paranoid mode

)
- backup is broken and PBS knows it, client will go into "no previous backup exists" and upload all chunks to generate a good backup

eider said:
I'd also love to see PBS having some way to alert user via email on failed verify on their backup and for PBS to expose backup state back to Proxmox. Right now to see if backup is OK or broken I have to log onto PBS instance or attempt to restore it on Proxmox to see if it fails or succeeds.

yes on both counts there!

TwiX · Oct 14, 2020

Hi,

Tuxis Pbs was unavailable yesterday ?
My cluster is up to date but I had to kill proxmox-backup-client status this morning.

eider · Oct 14, 2020

For what it's worth, I've never HAD to manually kill proxmox-backup-client or pvestatd. While it is true that if PBS is down, proxmox-backup-client will stall pvestatd awaiting for reply (and in fact I DID kill it once) it never locked itself for longer than PBS being down and never in a situation where no one is actively querying PBS status via API/Web UI or backup is not being performed. I have pvestatd configured to send metrics to Influx so I would see gaps on graphs if it did happen.

That being said, pvestatd really needs safeguards against this (ideally it should be reworked to be multi-threaded); it was fine so long it was all local, but once PBS enters the equation it really is no good anymore, and I can imagine that lot of people will opt to use external PBS server.

Verax · Oct 14, 2020

Tried to restore a backup on a failed VM

I get this?

Code:

Logical volume "vm-105-disk-0" created.
new volume ID is 'local-lvm:vm-105-disk-0'
restore proxmox backup image: /usr/bin/pbs-restore --repository 6068298@pbs@pbs.tuxis.nl:6068298_proxmox vm/103/2020-09-28T15:54:38Z drive-scsi0.img.fidx /dev/pve/vm-105-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/Tuxis-Test.enc --skip-zero
connecting to repository '6068298@pbs@pbs.tuxis.nl:6068298_proxmox'
open block backend for target '/dev/pve/vm-105-disk-0'
starting to restore snapshot 'vm/103/2020-09-28T15:54:38Z'
download and verify backup index
restore failed: reading file "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" failed: No such file or directory (os error 2)
  Logical volume "vm-105-disk-0" successfully removed
temporary volume 'local-lvm:vm-105-disk-0' sucessfuly removed
TASK ERROR: command '/usr/bin/pbs-restore --repository 6068298@pbs@pbs.tuxis.nl:6068298_proxmox vm/103/2020-09-28T15:54:38Z drive-scsi0.img.fidx /dev/pve/vm-105-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/Tuxis-Test.enc --skip-zero' failed: exit code 255

Anyway to restore or are both my backups bad?

tuxis · Oct 16, 2020

Verax said:

Tried to restore a backup on a failed VM

I get this?

Code:

Logical volume "vm-105-disk-0" created.
new volume ID is 'local-lvm:vm-105-disk-0'
restore proxmox backup image: /usr/bin/pbs-restore --repository 6068298@pbs@pbs.tuxis.nl:6068298_proxmox vm/103/2020-09-28T15:54:38Z drive-scsi0.img.fidx /dev/pve/vm-105-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/Tuxis-Test.enc --skip-zero
connecting to repository '6068298@pbs@pbs.tuxis.nl:6068298_proxmox'
open block backend for target '/dev/pve/vm-105-disk-0'
starting to restore snapshot 'vm/103/2020-09-28T15:54:38Z'
download and verify backup index
restore failed: reading file "/lvm_pbs/6068298/proxmox/.chunks/16fd/16fde23122afe9559c9f6d3109f693fa85574d8d758123010b2e83c88f863db9" failed: No such file or directory (os error 2)
  Logical volume "vm-105-disk-0" successfully removed
temporary volume 'local-lvm:vm-105-disk-0' sucessfuly removed
TASK ERROR: command '/usr/bin/pbs-restore --repository 6068298@pbs@pbs.tuxis.nl:6068298_proxmox vm/103/2020-09-28T15:54:38Z drive-scsi0.img.fidx /dev/pve/vm-105-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/Tuxis-Test.enc --skip-zero' failed: exit code 255

Anyway to restore or are both my backups bad?

Yes. According to the PBS verify, your backups are bad. A verify has renamed the chunks that are bad, and thus you get 'No such file or directory'. It's kinda weird that you are able to start restoring from a failed backup. I'll create a bug for that.

tuxis · Oct 16, 2020

https://bugzilla.proxmox.com/show_bug.cgi?id=3078

Verax · Oct 16, 2020

Thanks for the help Tuxis, I just spent a day to rebuild the VM. Question, why would both my backups be bad? Is there a certain way to do PBS backups or was I just unlucky?

Tuxis launches free Proxmox Backup Server BETA service

Renowned Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Famous Member

Well-Known Member

Famous Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Member

Famous Member

Famous Member

Member

We value your privacy