Cannot restore container from Backup Server after corrupted HA failover

codingspiderfox · Sep 27, 2022

Due to a network outage, my cluster tried to start a HA managed container on another node which seems to have failed.
The CT was in errored state, so I deleted the HA entry for that container.

The errored state went away, CT was still stopped though
Tried to start the CT. The CT could not start because it's disk file was not present on the node where the CT was showing up at.
But the disk existed in another node (zfs-local drive)

Deleted the CT and the leftover disks of the CT.
Went to the backups of the CT on the attached proxmox backup server and started the restore.

The following logs were printed but even over the duration of the whole night, it did not progress a bit.

Code:

recovering backed-up configuration from 'backup.example.com:backup/ct/100/2022-09-21T23:00:01Z'
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

What should I do now?

fabian · Sep 28, 2022

is the task still running? is there a corresponding PBS server task running on the PBS system?

codingspiderfox · Sep 28, 2022

I cancelled the task the next morning.
There was no task on the PBS running when I started the restore as far as I can remember

fabian · Sep 29, 2022

is it reproducible? if it is, is attempting to start a restore visible on the PBS side (e.g., access logs)?

codingspiderfox · Oct 3, 2022

Just restarted the restore process once more 5 minutes ago.

A task "Datastore data Read Objects ct/100/2022-09-21T23:00:01Z" appeared.

These are the logs so far

Code:

2022-10-03T23:41:50+02:00: starting new backup reader datastore 'data': "/rpool/data"
2022-10-03T23:41:50+02:00: protocol upgrade done
2022-10-03T23:41:50+02:00: GET /download
2022-10-03T23:41:50+02:00: download "/rpool/data/ct/100/2022-09-21T23:00:01Z/index.json.blob"
2022-10-03T23:41:50+02:00: GET /download
2022-10-03T23:41:50+02:00: download "/rpool/data/ct/100/2022-09-21T23:00:01Z/root.pxar.didx"
2022-10-03T23:41:50+02:00: register chunks in 'root.pxar.didx' as downloadable.
2022-10-03T23:41:50+02:00: GET /chunk
2022-10-03T23:41:50+02:00: download chunk "/rpool/data/.chunks/4ed8/4ed809d2e31300facda5d3612ecbbbc077237e2cdffd48451d23b31049a28049"
2022-10-03T23:41:51+02:00: GET /chunk
2022-10-03T23:41:51+02:00: download chunk "/rpool/data/.chunks/ee60/ee60acfcd753ffb8ae9eb81b9046b0d416cd706f3491f1ee58b1e555f7dedb99"
2022-10-03T23:41:52+02:00: GET /chunk
2022-10-03T23:41:52+02:00: download chunk "/rpool/data/.chunks/e079/e07970495175b6ea6df0ad63502b0dd9d8bdb025d9c883ae32e547764b531518"
[...]
2022-10-03T23:43:27+02:00: GET /chunk
2022-10-03T23:43:27+02:00: download chunk "/rpool/data/.chunks/824f/824faef60bdd58604b831cec63ccaaf70d60cc0ce77b45caeacdf3c5adfdd98d"
2022-10-03T23:43:27+02:00: GET /chunk
2022-10-03T23:43:27+02:00: download chunk "/rpool/data/.chunks/d209/d2096c8654ccce6b83e9d4d7750addd866c050dfa8bb0c33eca759b9a1e6e697"
2022-10-03T23:43:28+02:00: GET /chunk
2022-10-03T23:43:28+02:00: download chunk "/rpool/data/.chunks/2b6a/2b6a1c1f1bc154f6b458bf8f1addbed6b40e75ac9d8dde9425d63038da28b69f"
2022-10-03T23:43:28+02:00: GET /chunk
2022-10-03T23:43:28+02:00: download chunk "/rpool/data/.chunks/fb0d/fb0d2069a01f0b1a20ab42587c08e6589a677cdab7ba7badf05f04e9914a8144"
2022-10-03T23:43:29+02:00: GET /chunk
2022-10-03T23:43:29+02:00: download chunk "/rpool/data/.chunks/1ea9/1ea975aba45ec4309afa96595e63df28acca3dc49f3eb8e8f379c2dade573c36"
2022-10-03T23:43:30+02:00: GET /chunk
2022-10-03T23:43:30+02:00: download chunk "/rpool/data/.chunks/7c46/7c46620b091599f315e5394ed40b0f818c108fd4c246a364ae46ee207bd58cd6"
2022-10-03T23:43:31+02:00: GET /chunk
2022-10-03T23:43:31+02:00: download chunk "/rpool/data/.chunks/7fa1/7fa12ef4c808280341326bd2c13e42c163a9827608d2aa490dd824c642eb4ad6"
2022-10-03T23:43:32+02:00: GET /chunk
2022-10-03T23:43:32+02:00: download chunk "/rpool/data/.chunks/7530/75308bbddce306230ee78093e1f656a95fc4dd88bf648bc679666fa6da418f18"
2022-10-03T23:43:33+02:00: GET /chunk
2022-10-03T23:43:33+02:00: download chunk "/rpool/data/.chunks/a80f/a80feb7d0bebdce18625a8c7ee2bde288643337d5a0658069f0057916770006c"
2022-10-03T23:43:33+02:00: GET /chunk
2022-10-03T23:43:33+02:00: download chunk "/rpool/data/.chunks/2495/2495eb2b90e842cdb74a1489a90cc44d064d1695b01d1315f687fcfc914ab276"

codingspiderfox · Oct 3, 2022

4 Minutes later, still no new chunk has been read.

The logs of the restore in Proxmox VE so far are:

Code:

Task viewer: CT 100 - Restore

recovering backed-up configuration from '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z'
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed
/dev/rbd0
Creating filesystem with 3932160 4k blocks and 983040 inodes
Filesystem UUID: dbdbbd40-7c2d-4155-94f0-6805f6c610a2
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208
restoring '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z' now..
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

codingspiderfox · Oct 4, 2022

Next morning, still no progress

fabian · Oct 4, 2022

that sounds like something is amiss with your network possibly? is the restore process still running on the PVE side?

codingspiderfox · Oct 4, 2022

Yes, the restore is still in progress on the PVE side and I can still browse the available backups list on the PBS storage from that node

codingspiderfox · Oct 4, 2022

PS:

I cancelled the restore on the PVE node now. The logs of the restore are now

Code:

recovering backed-up configuration from '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z'

Using encryption key from file descriptor..

Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

/dev/rbd0

Creating filesystem with 3932160 4k blocks and 983040 inodes

Filesystem UUID: dbdbbd40-7c2d-4155-94f0-6805f6c610a2

Superblock backups stored on blocks:

    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

restoring '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z' now..

Using encryption key from file descriptor..

Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

umount: /var/lib/lxc/100/rootfs: target is busy.

command 'umount -d /var/lib/lxc/100/rootfs/' failed: exit code 32

rbd: sysfs write failed

can't unmap rbd device /dev/rbd-pve/da46a94d-9d40-45e9-b0e4-95ed6ea51633/cephfs_data/vm-100-disk-0: rbd: sysfs write failed

rbd: sysfs write failed

can't unmap rbd device /dev/rbd-pve/da46a94d-9d40-45e9-b0e4-95ed6ea51633/cephfs_data/vm-100-disk-0: rbd: sysfs write failed

TASK ERROR: unable to restore CT 100 - volume deactivation failed: cephVMStorageV3:vm-100-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1231.

fabian · Oct 5, 2022

okay, so I see two possible causes for this
- the network being funky (i.e., really really really slow without breaking the connection entirely)
- your ceph storage blocking

can you check the ceph status and the journal on the PVE side when the issue occurs? are any other guests on the ceph cluster affected?

codingspiderfox · Oct 5, 2022

Indeed there is a ceph problem in the cluster:

Code:

Full OSDs blocking recovery: 1 pg recovery_toofull

I will try to resolve that and then restart the restore

codingspiderfox · Oct 5, 2022

Freeing up enough storage fixed the problem. The disk of the container with the config in the backed up state was indeed targeted to the ceph storage that was full.

I feel like I would have never found the cause of the problem myself.

Wouldn't it be possible for PVE to detect when the target storage of any attached disk of the restored container or VM is full before or during the restore and display an error message pointing the right direction to solve the problem more quickly?

fabian · Oct 6, 2022

well, most storages would just give you an error when they are full, but some (especially thin provisioned) storages really don't like getting full while at the same time making it hard to estimate up front whether they'd become full by an operation.. I assume in this case it was also just a single OSD that became too full?

codingspiderfox · Oct 6, 2022

Probably yes. Which log file could help us find that out maybe?

fabian · Oct 7, 2022

the ceph logs if they are still available

codingspiderfox · Oct 7, 2022

which file exactly?

fabian · Oct 7, 2022

the logs are in /var/log/ceph (on each node where an OSD or MON was running)

Search

Search

Cannot restore container from Backup Server after corrupted HA failover

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

codingspiderfox

Member

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

fabian

Proxmox Staff Member

codingspiderfox

Member

fabian

Proxmox Staff Member

We value your privacy