Failed Fail Over

ballybob

Member
Aug 13, 2022
46
2
13
I have a cluster running proxmox 8.4.1 using ZFS replication.
I have manually migrated many times, and I have tested a few times failing over by turning off nodes.

After power loss, I noticed my docker VM wasn't up. I looked at proxmox and all VMs failed over fine except for my larger docker VM. Even after bringing pve1 back online, it wasn't failed over. I noticed it was stuck on pve2 trying to migrate, but kept failing with "zfs failure" saying the vm disk did not exist in rpool.

I recovered by shutting down Pve2, and everything came back up on pve1 and then I brought pve2 back online.
Since then, I have tried manually migrating vms without issues, including 102 which was the one that had problems failing over.
 
Any idea what causes this? It seems to only happen when I need it most not when I am just testing fail over. Testing always succeeds.
 
So you unplugged the power cord as a test and it did work, yet if e.g. a power outage hits, it does not?
Yes, I've done this quite a few times when I first set it up. Also had to do it outside of that.
Out of 5-6 real outages, twice I had problems where it said it couldn't fail over properly, usually it migrates to PVE2 but can't get back.
After yanking the plug on PVE2 it showed up on PVE1, then I tried to migrate and it migrated in both directions no problem.