Cannot restore container from Backup Server after corrupted HA failover

May 21, 2020
54
13
13
35
Due to a network outage, my cluster tried to start a HA managed container on another node which seems to have failed.
The CT was in errored state, so I deleted the HA entry for that container.

The errored state went away, CT was still stopped though
Tried to start the CT. The CT could not start because it's disk file was not present on the node where the CT was showing up at.
But the disk existed in another node (zfs-local drive)

Deleted the CT and the leftover disks of the CT.
Went to the backups of the CT on the attached proxmox backup server and started the restore.

The following logs were printed but even over the duration of the whole night, it did not progress a bit.

Code:
recovering backed-up configuration from 'backup.example.com:backup/ct/100/2022-09-21T23:00:01Z'
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

What should I do now?
 
is the task still running? is there a corresponding PBS server task running on the PBS system?
 
I cancelled the task the next morning.
There was no task on the PBS running when I started the restore as far as I can remember
 
is it reproducible? if it is, is attempting to start a restore visible on the PBS side (e.g., access logs)?
 
Just restarted the restore process once more 5 minutes ago.

A task "Datastore data Read Objects ct/100/2022-09-21T23:00:01Z" appeared.

These are the logs so far

Code:
2022-10-03T23:41:50+02:00: starting new backup reader datastore 'data': "/rpool/data"
2022-10-03T23:41:50+02:00: protocol upgrade done
2022-10-03T23:41:50+02:00: GET /download
2022-10-03T23:41:50+02:00: download "/rpool/data/ct/100/2022-09-21T23:00:01Z/index.json.blob"
2022-10-03T23:41:50+02:00: GET /download
2022-10-03T23:41:50+02:00: download "/rpool/data/ct/100/2022-09-21T23:00:01Z/root.pxar.didx"
2022-10-03T23:41:50+02:00: register chunks in 'root.pxar.didx' as downloadable.
2022-10-03T23:41:50+02:00: GET /chunk
2022-10-03T23:41:50+02:00: download chunk "/rpool/data/.chunks/4ed8/4ed809d2e31300facda5d3612ecbbbc077237e2cdffd48451d23b31049a28049"
2022-10-03T23:41:51+02:00: GET /chunk
2022-10-03T23:41:51+02:00: download chunk "/rpool/data/.chunks/ee60/ee60acfcd753ffb8ae9eb81b9046b0d416cd706f3491f1ee58b1e555f7dedb99"
2022-10-03T23:41:52+02:00: GET /chunk
2022-10-03T23:41:52+02:00: download chunk "/rpool/data/.chunks/e079/e07970495175b6ea6df0ad63502b0dd9d8bdb025d9c883ae32e547764b531518"
[...]
2022-10-03T23:43:27+02:00: GET /chunk
2022-10-03T23:43:27+02:00: download chunk "/rpool/data/.chunks/824f/824faef60bdd58604b831cec63ccaaf70d60cc0ce77b45caeacdf3c5adfdd98d"
2022-10-03T23:43:27+02:00: GET /chunk
2022-10-03T23:43:27+02:00: download chunk "/rpool/data/.chunks/d209/d2096c8654ccce6b83e9d4d7750addd866c050dfa8bb0c33eca759b9a1e6e697"
2022-10-03T23:43:28+02:00: GET /chunk
2022-10-03T23:43:28+02:00: download chunk "/rpool/data/.chunks/2b6a/2b6a1c1f1bc154f6b458bf8f1addbed6b40e75ac9d8dde9425d63038da28b69f"
2022-10-03T23:43:28+02:00: GET /chunk
2022-10-03T23:43:28+02:00: download chunk "/rpool/data/.chunks/fb0d/fb0d2069a01f0b1a20ab42587c08e6589a677cdab7ba7badf05f04e9914a8144"
2022-10-03T23:43:29+02:00: GET /chunk
2022-10-03T23:43:29+02:00: download chunk "/rpool/data/.chunks/1ea9/1ea975aba45ec4309afa96595e63df28acca3dc49f3eb8e8f379c2dade573c36"
2022-10-03T23:43:30+02:00: GET /chunk
2022-10-03T23:43:30+02:00: download chunk "/rpool/data/.chunks/7c46/7c46620b091599f315e5394ed40b0f818c108fd4c246a364ae46ee207bd58cd6"
2022-10-03T23:43:31+02:00: GET /chunk
2022-10-03T23:43:31+02:00: download chunk "/rpool/data/.chunks/7fa1/7fa12ef4c808280341326bd2c13e42c163a9827608d2aa490dd824c642eb4ad6"
2022-10-03T23:43:32+02:00: GET /chunk
2022-10-03T23:43:32+02:00: download chunk "/rpool/data/.chunks/7530/75308bbddce306230ee78093e1f656a95fc4dd88bf648bc679666fa6da418f18"
2022-10-03T23:43:33+02:00: GET /chunk
2022-10-03T23:43:33+02:00: download chunk "/rpool/data/.chunks/a80f/a80feb7d0bebdce18625a8c7ee2bde288643337d5a0658069f0057916770006c"
2022-10-03T23:43:33+02:00: GET /chunk
2022-10-03T23:43:33+02:00: download chunk "/rpool/data/.chunks/2495/2495eb2b90e842cdb74a1489a90cc44d064d1695b01d1315f687fcfc914ab276"
 
4 Minutes later, still no new chunk has been read.

The logs of the restore in Proxmox VE so far are:


Code:
Task viewer: CT 100 - Restore

recovering backed-up configuration from '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z'
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed
/dev/rbd0
Creating filesystem with 3932160 4k blocks and 983040 inodes
Filesystem UUID: dbdbbd40-7c2d-4155-94f0-6805f6c610a2
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208
restoring '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z' now..
Using encryption key from file descriptor..
Fingerprint: 34:1d:00:b2:c6:bd:c4:ed
 
that sounds like something is amiss with your network possibly? is the restore process still running on the PVE side?
 
Yes, the restore is still in progress on the PVE side and I can still browse the available backups list on the PBS storage from that node
 
PS:

I cancelled the restore on the PVE node now. The logs of the restore are now

Code:
recovering backed-up configuration from '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z'

Using encryption key from file descriptor..

Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

/dev/rbd0

Creating filesystem with 3932160 4k blocks and 983040 inodes

Filesystem UUID: dbdbbd40-7c2d-4155-94f0-6805f6c610a2

Superblock backups stored on blocks:

    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

restoring '<SERVER>:backup/ct/100/2022-09-21T23:00:01Z' now..

Using encryption key from file descriptor..

Fingerprint: 34:1d:00:b2:c6:bd:c4:ed

umount: /var/lib/lxc/100/rootfs: target is busy.

command 'umount -d /var/lib/lxc/100/rootfs/' failed: exit code 32

rbd: sysfs write failed

can't unmap rbd device /dev/rbd-pve/da46a94d-9d40-45e9-b0e4-95ed6ea51633/cephfs_data/vm-100-disk-0: rbd: sysfs write failed

rbd: sysfs write failed

can't unmap rbd device /dev/rbd-pve/da46a94d-9d40-45e9-b0e4-95ed6ea51633/cephfs_data/vm-100-disk-0: rbd: sysfs write failed

TASK ERROR: unable to restore CT 100 - volume deactivation failed: cephVMStorageV3:vm-100-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1231.
 
okay, so I see two possible causes for this
- the network being funky (i.e., really really really slow without breaking the connection entirely)
- your ceph storage blocking

can you check the ceph status and the journal on the PVE side when the issue occurs? are any other guests on the ceph cluster affected?
 
Indeed there is a ceph problem in the cluster:

Code:
Full OSDs blocking recovery: 1 pg recovery_toofull

I will try to resolve that and then restart the restore
 
Freeing up enough storage fixed the problem. The disk of the container with the config in the backed up state was indeed targeted to the ceph storage that was full.

I feel like I would have never found the cause of the problem myself.

Wouldn't it be possible for PVE to detect when the target storage of any attached disk of the restored container or VM is full before or during the restore and display an error message pointing the right direction to solve the problem more quickly?
 
well, most storages would just give you an error when they are full, but some (especially thin provisioned) storages really don't like getting full while at the same time making it hard to estimate up front whether they'd become full by an operation.. I assume in this case it was also just a single OSD that became too full?
 
the ceph logs if they are still available
 
the logs are in /var/log/ceph (on each node where an OSD or MON was running)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!