[SOLVED] live recovery progress

DrillSgtErnst · Jun 22, 2022

Hi,
after fatal crash yesterday I had to recover all VMs from backup.

I tried Live Recovery and all Servers are running quite fine so far. But I do not have a clue about the progress.
The task finished quite early and the machines started.
I can not back them up, because they are locked (create), and I fear shutting them down will destroy the machines.
Is there any way to verify that the recovery is completed

dcsapak · Jun 22, 2022

what do the task logs say? if those are finished, the restore should be finished and the locks should be removed

DrillSgtErnst · Jun 22, 2022

All I can see is this

After that the prompt closed and the Server started

dcsapak · Jun 22, 2022

seems the task was interrupted... did you reboot the server while restoring? if not, does the journal/syslog maybe contain info what might happened?

DrillSgtErnst · Jun 22, 2022

Yeah the task always failed on the first tries. I had to start them two times.

Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
rbd rm 'vm-205-disk-1' error: interrupted by signal

This was happening for 2 hours straight.

This happened on the second try
Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
new volume ID is 'ceph_fail1:vm-205-disk-0'
new volume ID is 'ceph_fail1:vm-205-disk-1'
restore proxmox backup image: /usr/bin/pbs-restore --repository backup@pbs@172.20.14.18:cephbackup vm/205/2022-06-21T10:30:04Z drive-efidisk0.img.fidx /dev/zvol/ceph_fail1/vm-205-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/pvebackup.enc --skip-zero
connecting to repository 'backup@pbs@172.20.14.18:cephbackup'
open block backend for target '/dev/zvol/ceph_fail1/vm-205-disk-0'
starting to restore snapshot 'vm/205/2022-06-21T10:30:04Z'
download and verify backup index
progress 100% (read 540672 bytes, zeroes = 0% (0 bytes), duration 0 sec)
restore image complete (bytes=540672, duration=0.14s, speed=3.78MB/s)
rescan volumes...
got interrupt - ignored
rbd error: got signal 15
starting VM for live-restore
repository: 'backup@pbs@172.20.14.18:cephbackup', snapshot: 'vm/205/2022-06-21T10:30:04Z'
restoring 'drive-virtio0' to 'ceph_fail1:vm-205-disk-1'
restore-drive-virtio0: transferred 0.0 B of 400.0 GiB (0.00%) in 0s
restore-drive-virtio0: transferred 508.0 MiB of 400.0 GiB (0.12%) in 1s
restore-drive-virtio0: transferred 532.0 MiB of 400.0 GiB (0.13%) in 2s
restore-drive-virtio0: transferred 636.0 MiB of 400.0 GiB (0.16%) in 3s

The machines are running. I am just scared of data loss. If I shut them down now, will they come back?

DrillSgtErnst · Jun 23, 2022

So answer is. so far every one came back after shutdown and I could move the disks to another space, everything works fine, even though it does not look like that

dcsapak · Jun 23, 2022

DrillSgtErnst said:
Yeah the task always failed on the first tries. I had to start them two times.

did only the task fail? or did the qemu process crash?
can you post the task log of such a failed restore?

DrillSgtErnst said:
rbd error: got signal 15

seems like something killed the rbd process? can you post the journal from that time period?

DrillSgtErnst said:
So answer is. so far every one came back after shutdown and I could move the disks to another space, everything works fine, even though it does not look like that

great, i'd still investigate why the tasks fail in the first place...

DrillSgtErnst · Jun 23, 2022

So tbh I think I know the reason. The machines Hard Drives reside on a crashed Ceph System.
I guess this deadlocks the process in the first place, because it can not get Information regarding the machine from the drive.

Code:

Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
Using encryption key from file descriptor..
Fingerprint: fc:30:23:0f:e8:ec:13:22
rbd error: 'storage-ptvceph'-locked command timed out - aborting
rbd error: 'storage-ptvceph'-locked command timed out - aborting

Search

Search

[SOLVED] live recovery progress

DrillSgtErnst

Active Member

dcsapak

Proxmox Staff Member

DrillSgtErnst

Active Member

dcsapak

Proxmox Staff Member

DrillSgtErnst

Active Member

DrillSgtErnst

Active Member

dcsapak

Proxmox Staff Member

DrillSgtErnst

Active Member