When a machine fails, recovery requires several tricks at least 50% of the time.
I am using Proxmox VE 5.4-13 and a machine failed, was isolated, etc. Some machines properly migrated, but 3 are in a "difficult" state
Two instances migrated to a machine that is not even in the HA cluster that was defined for them. One is using a raw image (not synced) and the other ceph (not available on this machine).
One instance is on the right machine, but does not start (systemctl status pve-container@130.service):
What I would like to do is to move these machines back to the machine that rebooted without syncing/moving the disk image (not available on one machine, and possibly not in sync for machine n° 130 above).
1. Any idea how to do that?
[EDIT: I managed to move the CEPH based machine without a hassle, and I managed to move the raw image based machine using my procedure below. I fixed the ZFS machine by starting it on the server iit was located on after disabling the HA. Any answer to this question is still usefull for a future failure].
2. Any idea why the machines moved to a server not in the HA cluster configuration (or how to find out why)?
Here is a procedure that I put together in the past for a machine with a raw image in a similar situation, but I can't apply it this time as I need to correct ZFS based machines, not a raw machine that is not synced at all:
Procedure to move a raw image based VM back to the original server without "syncing" the image:
VM107 was found on p1 or a copy of its raw image was on /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw. This image was smaller than the image in the same location on p5.
The migration to p1 was not configured, it is not explained how the machine 107 was assigned to p1 (happened again just now).
It was desirable to operate the VM107 on p5 with the image that was on p5. The problem was twofold: migration was not possible because the image on p5 blocked, and at the same time it was the desired image.
Solution:
- Rename of /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw on both machines (to /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw.org).
- touch on /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw on p1 in order to have a size 0 image that will not start.
- Through the Proxmox interface, request to migrate to p5, which succeeded because the image was no longer blocking.
- Having a size 0 image, the migration is almost immediate.
- The machine not having started, on p5, an mv from /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw.org to / mnt / bigdisk / vz / images / 107 / vm- 107-disk-0.raw.
- Then, request to start VM107 on p5 successfully.
I am using Proxmox VE 5.4-13 and a machine failed, was isolated, etc. Some machines properly migrated, but 3 are in a "difficult" state
Two instances migrated to a machine that is not even in the HA cluster that was defined for them. One is using a raw image (not synced) and the other ceph (not available on this machine).
One instance is on the right machine, but does not start (systemctl status pve-container@130.service):
Code:
May 19 04:42:11 p3 lxc-start[24666]: lxc-start: 130: lxccontainer.c: wait_on_daemonized_start: 856 No such file or directory - Failed to receive the container state
May 19 04:42:11 p3 lxc-start[24666]: lxc-start: 130: tools/lxc_start.c: main: 330 The container failed to start
May 19 04:42:11 p3 lxc-start[24666]: lxc-start: 130: tools/lxc_start.c: main: 333 To get more details, run the container in foreground mode
May 19 04:42:11 p3 lxc-start[24666]: lxc-start: 130: tools/lxc_start.c: main: 336 Additional information can be obtained by setting the --logfile and --logpriority options
May 19 04:42:11 p3 systemd[1]: pve-container@130.service: Control process exited, code=exited status=1
What I would like to do is to move these machines back to the machine that rebooted without syncing/moving the disk image (not available on one machine, and possibly not in sync for machine n° 130 above).
1. Any idea how to do that?
[EDIT: I managed to move the CEPH based machine without a hassle, and I managed to move the raw image based machine using my procedure below. I fixed the ZFS machine by starting it on the server iit was located on after disabling the HA. Any answer to this question is still usefull for a future failure].
2. Any idea why the machines moved to a server not in the HA cluster configuration (or how to find out why)?
Here is a procedure that I put together in the past for a machine with a raw image in a similar situation, but I can't apply it this time as I need to correct ZFS based machines, not a raw machine that is not synced at all:
Procedure to move a raw image based VM back to the original server without "syncing" the image:
VM107 was found on p1 or a copy of its raw image was on /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw. This image was smaller than the image in the same location on p5.
The migration to p1 was not configured, it is not explained how the machine 107 was assigned to p1 (happened again just now).
It was desirable to operate the VM107 on p5 with the image that was on p5. The problem was twofold: migration was not possible because the image on p5 blocked, and at the same time it was the desired image.
Solution:
- Rename of /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw on both machines (to /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw.org).
- touch on /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw on p1 in order to have a size 0 image that will not start.
- Through the Proxmox interface, request to migrate to p5, which succeeded because the image was no longer blocking.
- Having a size 0 image, the migration is almost immediate.
- The machine not having started, on p5, an mv from /mnt/bigdisk/vz/images/107/vm-107-disk-0.raw.org to / mnt / bigdisk / vz / images / 107 / vm- 107-disk-0.raw.
- Then, request to start VM107 on p5 successfully.
Last edited: