VM migration (storage+host) slow to get started

helgeros

New Member
Feb 2, 2024
6
0
1
Current setup is two servers connected to an uplink switch, but also a 10Gb cable directly between them.
Cluster is enabled. I have set migration to happen over the 10Gb DAC (confirmed working).
No shared storage, hardware raid mirroring of two 4TB SSDs on each server.
Running latest version (pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)).

When I right click a standard VM and hit migrate, the process starts right away, but then stops and hangs for a good 5minute before it actually starts transferring data.
Is this normal? What are it doing in those minutes? :)

Code:
2024-02-02 08:25:08 use dedicated network address for sending migration traffic (172.16.0.1)
2024-02-02 08:25:09 starting migration of VM 102 to node 'pv01' (172.16.0.1)
2024-02-02 08:25:09 found local disk 'thin01_4tb:vm-102-disk-2' (attached)
2024-02-02 08:25:09 found local disk 'thin01_4tb:vm-102-disk-3' (attached)
2024-02-02 08:25:09 starting VM 102 on remote node 'pv01'
2024-02-02 08:25:12 volume 'thin01_4tb:vm-102-disk-2' is 'thin01_4tb:vm-102-disk-0' on the target
2024-02-02 08:25:12 volume 'thin01_4tb:vm-102-disk-3' is 'thin01_4tb:vm-102-disk-1' on the target
2024-02-02 08:25:12 start remote tunnel
2024-02-02 08:25:13 ssh tunnel ver 1
2024-02-02 08:25:13 starting storage migration
2024-02-02 08:25:13 scsi0: start migration to nbd:172.16.0.1:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
<<5-7min wait>>
drive-scsi0: transferred 571.0 MiB of 105.0 GiB (0.53%) in 5m 23s
 
Current setup is two servers connected to an uplink switch, but also a 10Gb cable directly between them.
Cluster is enabled. I have set migration to happen over the 10Gb DAC (confirmed working).
No shared storage, hardware raid mirroring of two 4TB SSDs on each server.
Running latest version (pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)).

When I right click a standard VM and hit migrate, the process starts right away, but then stops and hangs for a good 5minute before it actually starts transferring data.
Is this normal? What are it doing in those minutes? :)

Code:
2024-02-02 08:25:08 use dedicated network address for sending migration traffic (172.16.0.1)
2024-02-02 08:25:09 starting migration of VM 102 to node 'pv01' (172.16.0.1)
2024-02-02 08:25:09 found local disk 'thin01_4tb:vm-102-disk-2' (attached)
2024-02-02 08:25:09 found local disk 'thin01_4tb:vm-102-disk-3' (attached)
2024-02-02 08:25:09 starting VM 102 on remote node 'pv01'
2024-02-02 08:25:12 volume 'thin01_4tb:vm-102-disk-2' is 'thin01_4tb:vm-102-disk-0' on the target
2024-02-02 08:25:12 volume 'thin01_4tb:vm-102-disk-3' is 'thin01_4tb:vm-102-disk-1' on the target
2024-02-02 08:25:12 start remote tunnel
2024-02-02 08:25:13 ssh tunnel ver 1
2024-02-02 08:25:13 starting storage migration
2024-02-02 08:25:13 scsi0: start migration to nbd:172.16.0.1:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
<<5-7min wait>>
drive-scsi0: transferred 571.0 MiB of 105.0 GiB (0.53%) in 5m 23s
Hi,
no there should be no delay.
Can you verify the block job started correctly during the seemingly inactive time by running the following command
Bash:
echo '{"execute": "qmp_capabilities"}{"execute": "query-block-jobs", "arguments": {}}' | socat - /var/run/qemu-server/102.qmp | jq
Make sure to install apt install socat and apt install jq first. This should tell if the block job has been started within the qemu process.

Also, please share the VM config qm config 102 --current and check the systemd journal on both nodes for errors, journalctl --since <DATETIME> --until <DATETIME> gives you a paginated view.
 
Code:
qm config 102 --current
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;sata2
cores: 6
lock: migrate
memory: 16544
name: cla-admints
net0: virtio=3E:57:78:22:63:05,bridge=vmbr10,tag=10
numa: 0
onboot: 1
ostype: win10
sata2: none,media=cdrom
scsi0: thin01_4tb:vm-102-disk-2,cache=writeback,discard=on,format=raw,size=105G,ssd=1
scsi1: thin01_4tb:vm-102-disk-3,cache=writeback,discard=on,format=raw,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=ada78a60-6228-461a-8839-743057b956d6
sockets: 1
vmgenid: 21ff1401-bd36-4c74-add2-d8edc91dcacc

Code:
echo '{"execute": "qmp_capabilities"}{"execute": "query-block-jobs", "arguments": {}}' | socat - /var/run/qemu-server/102.qmp | jq
{
  "QMP": {
    "version": {
      "qemu": {
        "micro": 2,
        "minor": 1,
        "major": 8
      },
      "package": "pve-qemu-kvm_8.1.2-6"
    },
    "capabilities": []
  }
}
{
  "return": {}
}
{
  "return": [
    {
      "auto-finalize": true,
      "io-status": "ok",
      "device": "drive-scsi1",
      "auto-dismiss": true,
      "busy": true,
      "len": 0,
      "offset": 0,
      "status": "running",
      "paused": false,
      "speed": 0,
      "ready": false,
      "type": "mirror"
    }
  ]
}
 
Tested a bit, seems to be consistent with just above 5minutes wait period where nothing seems to be happening.
No data traffic over any NICs, nothing showing in journalctl -f.
What I did discover though is if I shut down the VM first, then migrate it, it starts moving data right away.
 
Can you please also share your storage configuration cat /etc/pve/storage.cfg and (I assume this is an LVM-thin based on the storage's name) the output of lvdisplay. Does this issue persist even after restarting the VM?
"busy": true, "len": 0, "offset": 0,
This might indicate that the blockjob is waiting for I/O, not having the size information about the disk yet.
 
This solution isn't really in production yet, so there is hardly IO going on.
Tried to reboot the VM and the same happens. Only way to make it instantly migrate the VM to the 2nd host is to do a shutdown of the VM.
This is a windows VM just so thats mentioned. I can create a linux VM later tonight and test the behaviour there as well.
Guest-agent installed and running.

Code:
cat /etc/pve/storage.cfg
dir: local
    path /var/lib/vz
    content backup,iso,vztmpl

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

lvmthin: thin01_4tb
    thinpool thin01_4tb
    vgname thin01_4tb
    content rootdir,images
    nodes pv01,pv02

Code:
lvdisplay
  --- Logical volume ---
  LV Name                thin01_4tb
  VG Name                thin01_4tb
  LV UUID                SHFTk2-SJqr-fMnl-3j6t-s2Fa-QaRS-5gnHtB
  LV Write Access        read/write (activated read only)
  LV Creation host, time pv01, 2024-01-26 15:44:54 +0100
  LV Pool metadata       thin01_4tb_tmeta
  LV Pool data           thin01_4tb_tdata
  LV Status              available
  # open                 0
  LV Size                <3.61 TiB
  Allocated pool data    9.35%
  Allocated metadata     0.83%
  Current LE             945503
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:8
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-100-disk-0
  LV Name                vm-100-disk-0
  VG Name                thin01_4tb
  LV UUID                iwuhm6-Vnz2-8iph-QsZQ-9z1y-rj6B-9fIC4n
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-01-26 19:55:32 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 1
  LV Size                80.00 GiB
  Mapped size            11.93%
  Current LE             20480
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:9
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-101-disk-0
  LV Name                vm-101-disk-0
  VG Name                thin01_4tb
  LV UUID                n8GmRJ-1d0N-AXND-PJY7-YQ5M-2XuH-wUMtki
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-01-27 20:18:54 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 1
  LV Size                35.00 GiB
  Mapped size            100.00%
  Current LE             8960
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:12
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-103-disk-0
  LV Name                vm-103-disk-0
  VG Name                thin01_4tb
  LV UUID                EHvc2c-dnBS-07OP-hJRh-Y4H2-Wh2A-R31GkX
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-01-29 07:17:01 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 1
  LV Size                500.00 GiB
  Mapped size            4.09%
  Current LE             128000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:14
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-102-disk-0
  LV Name                vm-102-disk-0
  VG Name                thin01_4tb
  LV UUID                RxvzLH-7HWN-v6Md-bGYk-IJY4-NEpe-wktP0O
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-02-02 08:25:10 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 0
  LV Size                105.00 GiB
  Mapped size            100.00%
  Current LE             26880
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:11
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-102-disk-1
  LV Name                vm-102-disk-1
  VG Name                thin01_4tb
  LV UUID                57IswB-nKrs-qlWS-9YcU-k6yS-agqW-Q0uSsk
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-02-02 08:25:11 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 0
  LV Size                100.00 GiB
  Mapped size            0.00%
  Current LE             25600
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:13
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-102-disk-2
  LV Name                vm-102-disk-2
  VG Name                thin01_4tb
  LV UUID                LQhh2E-5cnB-Zsds-oAQs-5Ivz-AX20-58rVbJ
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-02-02 11:21:22 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 0
  LV Size                105.00 GiB
  Mapped size            10.61%
  Current LE             26880
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:15
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-102-disk-3
  LV Name                vm-102-disk-3
  VG Name                thin01_4tb
  LV UUID                lfPvoF-Cise-Hxvq-CW3f-ONdr-gYRf-N5Bcd8
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-02-02 11:24:46 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 1
  LV Size                105.00 GiB
  Mapped size            60.95%
  Current LE             26880
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:16
  
  --- Logical volume ---
  LV Path                /dev/thin01_4tb/vm-102-disk-4
  LV Name                vm-102-disk-4
  VG Name                thin01_4tb
  LV UUID                727kz6-Ve4Y-39Rs-u0mz-QA94-FbKo-BGAZUj
  LV Write Access        read/write
  LV Creation host, time pv01, 2024-02-02 11:24:46 +0100
  LV Pool name           thin01_4tb
  LV Status              available
  # open                 1
  LV Size                100.00 GiB
  Mapped size            100.00%
  Current LE             25600
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:17
  
  --- Logical volume ---
  LV Name                data
  VG Name                pve
  LV UUID                sdmvGD-udZ1-F8v6-7erf-546W-oea1-YSeQge
  LV Write Access        read/write (activated read only)
  LV Creation host, time proxmox, 2024-01-26 12:54:16 +0100
  LV Pool metadata       data_tmeta
  LV Pool data           data_tdata
  LV Status              available
  # open                 0
  LV Size                <793.80 GiB
  Allocated pool data    0.00%
  Allocated metadata     0.24%
  Current LE             203212
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:10
  
  --- Logical volume ---
  LV Path                /dev/pve/swap
  LV Name                swap
  VG Name                pve
  LV UUID                qmeVpq-51pV-YWFe-SrF2-hnGf-fNgJ-FFb7Qt
  LV Write Access        read/write
  LV Creation host, time proxmox, 2024-01-26 12:53:58 +0100
  LV Status              available
  # open                 2
  LV Size                8.00 GiB
  Current LE             2048
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:0
  
  --- Logical volume ---
  LV Path                /dev/pve/root
  LV Name                root
  VG Name                pve
  LV UUID                44N0nz-5dcu-2jFS-w1BS-Fbx6-wLL1-x0O31A
  LV Write Access        read/write
  LV Creation host, time proxmox, 2024-01-26 12:53:58 +0100
  LV Status              available
  # open                 1
  LV Size                96.00 GiB
  Current LE             24576
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1
 
Interesting find, I tried now the same procedure, but with a fresh ubuntu server VM. Starts migrating right away, no delay.

EDIT: tested now with creating a brand new windows 2022 VM, it works fine on that VM.
So then it seems to be an issue with the specific VM I had restored.
 
Last edited:
Hmm, I see 5 disks related to VM 102 from the output of lvdisplay but only 2 of them referenced by the VMs config. Is this LVM-thin storage somehow shared between the nodes or do you have pre-existing disks on the other node with the VMID? Can you backup the VM as is and restore it to another VMID to see if the issue persists even with the restored one?
 
This was actually something I was about to open another thread about, because I'm not sure how I ended up with 5 think disks for the same VM. VM only got 2, so the 3 others comes from (manually) aborted migrations I assume.
The thin storage is local only.

I'll try backup/restore to another VMID and report back. Backup server is currently offline