ESXi imported VMs cannot find their config file when migrated within the Proxmox cluster

Dec 14, 2023
7
0
1
Hi,

I have a four node Proxmox cluster.
These four nodes see the same 8 NetApp NFS datastores. My VMs lie on those external Datastores.
Currently I am in the process of migrating VMs from ESXi into that Proxmox cluster and onto those Datastores.
I am aware that the config file of the VM lies locally on the node, and the disk file on the external datastore.

Now, when migrating those imported VMs within my Proxmox cluster, I get the following error each time:
"unable to find configuration file for VM 122 on node 'proxmox06' (500)" (Proxmox06 being the source host curiously)

This error doesn’t occure with VMs that are created via Proxmox VE itself. Only the ones I imported from an ESXi host. I imported them via the Proxmox Import Wizard.

What is the problem? The VM is working fine after the move. I am just afraid that in the future there will be problems. Also the error is hindering productivity.

Right now I am still on Proxmox 8.

Thanks in advance for any input!
 
Hi @roggeb ,

The configuration file is located on a PVE specific shared cluster filesystem (https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)

Adding the following information will be helpful:
- "qm config ID" output after import, before migration
- "qm list" output from the node that owns the VM
- exact method of migrating the VM (UI, CLI, API)
- if UI - list exact steps, locations where the migration is invoked from, otherwise provide the CLI/API
- exact representation of the error (screenshot if necessary)
- "qm list" after the migration is done from the source and target host
- "qm config ID" from target host
- "pveversion"
- log snippet around the time of migration "journalctl --since xx:xx --until yy:yy"

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,

sorry for my late reply. Other projects demanded my attention in the meantime.
Yes, it was my understanding also, that the VM config files are in the pmxcfs.

My latest insights:
  • The error appears when moving VMs with a snapshot attached. Even after the snapshot is deleted. The error continues to pop up. Only a reboot of the VM will fix it (after deleting the snapshot). No more errors.
  • It does not make a difference if I include memory in the snapshot or not, the error will pop up
  • I tried changing the Async IO options, to no avail
  • I tried qm rescan disk
  • When I move the VM from one local storage to another local storage on a different node, then there is no error.
This behaviour was also already announced to us by a NetApp representative I had meeting with. He said it occurs, when a VM is having a snapshot attached to any external/shared storage. I doubt that tho. That would be a major design flaw.

If it is a NetApp only problem, then this problem lies outside of the scope of this forum anyway. But maybe someone has a similar experience.
Maybe this is a caching issue somwhere?

Here is the additional information you asked for:
  1. qm config ID output after import, before migration:
    After import, before the migration:
    Bash:
    agent: enabled=1,type=virtio,fstrim_cloned_disks=1
    bios: seabios
    boot: order=scsi0
    cores: 4
    cpu: host
    ide0: none,media=cdrom
    machine: pc-i440fx-9.2+pve1
    memory: 8192
    meta: creation-qemu=9.2.0,ctime=1757600456
    name: myVM
    net0: virtio=00:50:56:80:32:33,bridge=vmbr1,tag=635
    numa: 1
    ostype: win10
    parent: asd
    scsi0: myStorage:109/vm-109-disk-0.qcow2,discard=on,iothread=1,size=100G,ssd=1
    scsihw: virtio-scsi-single
    smbios1: uuid=4......0
    sockets: 2
    vmgenid: e........9
2. "qm list" output from the node that owns the VM:
Bash:
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       104 aDifferentVM          stopped    6144             100.00 0
       109 myVM                  running    8192             100.00 708216
  1. Method of Migration is in the GUI. When I do it via the CLI the error also pops up in the GUI.
  2. I right click the VM → Migrate → Chose the target node. Only a compute migration. The VMs disk stays on the same external datastore.
  3. Here is a Screenshot of the error message.
    Migration Error Message.png
6. qm list from the source node (after migration)
Bash:
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       104 aDifferentVM         stopped    6144             100.00 0
qm list from the target node (after migration):
Bash:
       103 aDifferentVM           stopped    6144             100.00 0
       106 aDifferentVM           running    2048              10.00 4079688
       108 aDifferentVM           stopped    8192             100.00 0
       109 myVM                   running    8192             100.00 2534292
7. qm config on target node after migration
Bash:
agent: enabled=1,type=virtio,fstrim_cloned_disks=1
bios: seabios
boot: order=scsi0
cores: 4
cpu: host
ide0: none,media=cdrom
machine: pc-i440fx-9.2+pve1
memory: 8192
meta: creation-qemu=9.2.0,ctime=1757600456
name: myVM
net0: virtio=00:50:56:80:32:33,bridge=vmbr1,tag=635
numa: 1
ostype: win10
parent: asd
scsi0: myStorage:109/vm-109-disk-0.qcow2,discard=on,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4........0
sockets: 2
vmgenid: e.........9

8. pve version:
Bash:
pve-manager/8.4.5/57892e8e686cb35b (running kernel: 6.8.12-13-pve)

9. log snippet from around the time of the migration:

Bash:
Sep 12 08:46:21 sourceNode pvedaemon[1153676]: <MyUserName> starting task UPID:sourceNode:00119D52:00E00195:68C3C1BD:qmigrate:109:MyUserName:
Sep 12 08:46:22 sourceNode pmxcfs[3430]: [status] notice: received log
Sep 12 08:46:23 sourceNode pmxcfs[3430]: [status] notice: received log
Sep 12 08:46:36 sourceNode kernel: tap109i0: left allmulticast mode
Sep 12 08:46:36 sourceNode kernel: vmbr1: port 2(tap109i0) entered disabled state
Sep 12 08:46:36 sourceNode kernel: bond1: left promiscuous mode
Sep 12 08:46:36 sourceNode kernel: bnxt_en 0000:21:00.1 ens2f1np1: left promiscuous mode
Sep 12 08:46:36 sourceNode kernel: bnxt_en 0000:a1:00.1 ens6f1np1: left promiscuous mode
Sep 12 08:46:36 sourceNode qmeventd[2978]: read: Connection reset by peer
Sep 12 08:46:36 sourceNode systemd[1]: 109.scope: Deactivated successfully.
Sep 12 08:46:36 sourceNode systemd[1]: 109.scope: Consumed 49.230s CPU time.
Sep 12 08:46:36 sourceNode pvedaemon[1153676]: <MyUserName> end task UPID:sourceNode:00119D52:00E00195:68C3C1BD:qmigrate:109:MyUserName: OK

Thanks again for the reply. I am happy for any idea.