can't migrate vm unless i'm logged into that exact node

jwsl224

Member
Apr 6, 2024
72
3
8
i've this weird issue with live migrating where pve can't do it unless i log into exactly the source node when doing the migration; connecting to any other node in this 5-node cluster makes the migration fail. using 'qm migrate' also fails; it can only be done via gui. this is so weird i don't even know where to start looking.

firstly, the issue seems to center around pve thinking the vm disks are "attached" storage. they're clearly not, they're regular zfs volumes like everything else.

when connected to the gui via any other node, here is what the migration window says:
1749422777669.png
vm is of course powered on. this remains, and the 'migrate' button remains greyed out, even after i select an appropriate target storage. the target storge for target nodes do populate correctly.
here is the error in detail when i try to start the migration via shell using qm migrate:
1749422884848.png

when i connect to the gui via the source node, i can use the gui to migrate the vm, but not qm migrate. any ideas?
 
Hi,
qm migrate needs to be issued on the source node, that is by design. And if your VM has local disks, you need to specify the --with-local-disks flag like the log tells you. To issue the migration call to a VM on another node you can do e.g.
pvesh create /nodes/pve8a2/qemu/106/migrate --target pve8a1 --online 1 --with-local-disks 1

As for why the UI is behaving differently, that sounds like the status of the VM is not detected correctly. Do you see any errors when you check the Network tab in your browsers developer tools (often Ctrl+Shift+C). Is your pvestatd service functioning properly on all nodes? Do you see any interesting messages in the system logs/journal of the relevant nodes?
 
qm migrate needs to be issued on the source node, that is by design.
certainly. this is what i am speaking of. what i meant is connecting to the cluster has to be done via the source node. it should technically work to connect to the cluster from any node, and then issue the command by connecting to the source node shell, right?

And if your VM has local disks, you need to specify the --with-local-disks flag like the log tells you.
the vm has no local disks. that's what is confusing me. it just has regular zfs virtual disks like all the rest of the vms. and again, the migration works fine when connecting to the cluster using the IP address of the source node. and only that way.


Do you see any errors when you check the Network tab in your browsers developer tools (often Ctrl+Shift+C).
hmm. wow. never seen that tab before :p but no i don't see any errors. the "status" column shows a list of green "200". lots of activity though. evidently clustering is hard work :)

Is your pvestatd service functioning properly on all nodes?
yes. it is "running" on relevant nodes.

Do you see any interesting messages in the system logs/journal of the relevant nodes?
i do not see anything interesting in journalctl. is there a specific something you'd like me to filter for?
 
certainly. this is what i am speaking of. what i meant is connecting to the cluster has to be done via the source node. it should technically work to connect to the cluster from any node, and then issue the command by connecting to the source node shell, right?
Yes, if you issue the qm migrate command on the node where the VM currently is, it should work.
the vm has no local disks. that's what is confusing me. it just has regular zfs virtual disks like all the rest of the vms. and again, the migration works fine when connecting to the cluster using the IP address of the source node. and only that way.
ZFS virtual disks are local disks. ZFS storage is not shared (except in the ZFS over iSCSI case): https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types

Please share the VM configuration qm config 100 and storage configuration /etc/pve/storage.cfg as well as the output of pveversion -v on both nodes.
 
this is the only node i am having this issue with. and only when not connecting to it's ip address for the web gui, as mentioned.
Can you ping and ssh between the node and the other node in both directions?

What do you get when you run the following command on both the node itself and on the other node: pvesh get /nodes/pve8a2/qemu/106/migrate --output-format json-pretty (replacing the node name and VM ID with yours for which the error in the UI occurs)
 
Can you ping and ssh between the node and the other node in both directions
hmm. when i try to ping from the offending node to the other node, it works great. but trying to ssh brings us to this:
1750122249991.png

edit: ssh'ing from the other node to the offending node works fine.
 
Last edited:
It means that the SSH host keys have changed for some reason. You might need to regenerate them. There are a lot of threads about similar situations in the forum.
 
It means that the SSH host keys have changed for some reason. You might need to regenerate them. There are a lot of threads about similar situations in the forum.
there seems to be a lot of threads about people just yolo'ing it. lol. is there an official docs entry on how to do this safely without jeopardizing the cluster?
 
there seems to be a lot of threads about people just yolo'ing it. lol. is there an official docs entry on how to do this safely without jeopardizing the cluster?
Not that I know of. What exactly you need to do depends on what precisely the issue is and that needs to be examined on the cluster itself.
 
Not that I know of. What exactly you need to do depends on what precisely the issue is and that needs to be examined on the cluster itself.
ok i'm not well versed in cryptography. here is the error when trying to ssh from issue host into another:

1752854463977.png

some forum entries mention "check the entries in
Code:
/etc/pve/priv/authorized_keys
and
Code:
/etc/pve/priv/known_hosts
. are all the keys for all hosts supposed to be the same? i don't know what i'm "checking".

pve staff suggested running
Code:
pvecm updatecert -f
on one host, and that should update the keys across all hosts. this has been running so well for months now that i never had to dig into the inner workings of pve clustering. lol.
 
Last edited:
What do you get when you run the following command on both the node itself and on the other node: pvesh get /nodes/pve8a2/qemu/106/migrate --output-format json-pretty (replacing the node name and VM ID with yours for which the error in the UI occurs)
ok it seems to be something about the specific storage. migrating the vm disk to a different local storage makes the migration work as expected. fixing the ssh keys was unrelated. here is what it says when running the code you asked:
Code:
root@pve8a1:~# pvesh get /nodes/pve8a1/qemu/188/migrate --output-format json-pretty
{
   "allowed_nodes" : [],
   "local_disks" : [
      {
         "cdrom" : 0,
         "drivename" : "scsi0",
         "is_attached" : 1,
         "is_tpmstate" : 0,
         "is_unused" : 0,
         "is_vmstate" : 0,
         "referenced_in_snapshot" : {
            "freepbxinstalled" : 1,
            "installedfresh" : 1
         },
         "replicate" : 1,
         "shared" : 0,
         "size" : 53687091200,
         "volid" : "nvme4:vm-188-disk-0"
      },
      {
         "cdrom" : 0,
         "is_attached" : 0,
         "is_tpmstate" : 0,
         "is_unused" : 0,
         "is_vmstate" : 1,
         "referenced_in_snapshot" : {
            "installedfresh" : 1
         },
         "replicate" : 1,
         "shared" : 0,
         "volid" : "nvme4:vm-188-state-installedfresh"
      },
      {
         "cdrom" : 0,
         "is_attached" : 0,
         "is_tpmstate" : 0,
         "is_unused" : 0,
         "is_vmstate" : 1,
         "referenced_in_snapshot" : {
            "freepbxinstalled" : 1
         },
         "replicate" : 1,
         "shared" : 0,
         "volid" : "nvme4:vm-188-state-freepbxinstalled"
      }
   ],
   "local_resources" : [],
   "mapped-resource-info" : {},
   "mapped-resources" : [],
   "not_allowed_nodes" : {
      "pve8a2" : {
         "unavailable_storages" : [
            "nvme4"
         ]
      },
      "pve8a3" : {
         "unavailable_storages" : [
            "nvme4"
         ]
      },
      "pve8a4" : {
         "unavailable_storages" : [
            "nvme4"
         ]
      },
      "pve8a5" : {
         "unavailable_storages" : [
            "nvme4"
         ]
      },
      "pve8a6" : {
         "unavailable_storages" : [
            "nvme4"
         ]
      }
   },
   "running" : 0
}
 
I don't see an indication that the storage itself has any issues from what you posted. Please share the details about the current situation again, do you still get the error in the UI with the VM running? Because according to the pvesh output you posted, the VM was not running.
Code:
"running" : 0
Did you ever attempt to use the --with-local-disks option for the CLI command? Because, again, a ZFS storage with the same name, which is available on mutliple nodes, does count as local. Shared storages are really only the ones serving a common state to multiple nodes at the same time: https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types