VM migration problem

plastilin · May 16, 2023

Colleagues, I had a problem with migrating a virtual machine from one cluster node to another when using lvmthin. The essence of the problem lies in the fact that the process seems to be going on, but the result is 0%

Code:

()
2023-05-16 22:12:49 starting migration of VM 137 to node 'node02' (10.8.6.2)
2023-05-16 22:12:50 found local disk 'local-data-raid:vm-137-disk-0' (in current VM config)
2023-05-16 22:12:50 starting VM 137 on remote node 'node02'
2023-05-16 22:12:52 volume 'local-data-raid:vm-137-disk-0' is 'local-data-raid:vm-137-disk-0' on the target
2023-05-16 22:12:52 start remote tunnel
2023-05-16 22:12:54 ssh tunnel ver 1
2023-05-16 22:12:54 starting storage migration
2023-05-16 22:12:54 scsi0: start migration to nbd:unix:/run/qemu-server/137_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 0s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 1s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 2s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 3s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 4s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 5s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 6s
drive-scsi0: transferred 0.0 B of 20.0 GiB (0.00%) in 7s
drive-scsi0: Cancelling block job

After canceling the task on the node that was supposed to accept the virtual machine, a virtual disk appears that cannot be deleted because the system writes that it is in use

Code:

root@node02:~# lvremove /dev/raid/vm-137-disk-0
  Logical volume raid/vm-137-disk-0 in use.

Code:

root@node02:~# lsof /dev/mapper/raid-vm--137--disk--0
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kvm     19014 root   31u   BLK 253,11      0t0  495 /dev/mapper/../dm-11

Helps to terminate the kvm process kill -9 19014

Then the disk can be removed

The virtual machine itself is also blocked. Helps qm unlock

How can I fix this and get the migration working again?

fiona · May 17, 2023

Hi,
please share the output of pveversion -v and qm config 137. Do you have discard enabled for the disk? If yes, please try again after disabling it. Anything special about your LVM-thin configuration?

plastilin · May 17, 2023

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.203-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

Code:

agent: 1
boot: order=scsi0;ide2;net0
cores: 1
ide2: none,media=cdrom
machine: q35
memory: 1024
name: vm
net0: virtio=62:A1:09:3A:72:E6,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsi0: local-data-raid:vm-137-disk-0,format=raw,size=20G
scsihw: virtio-scsi-pci
smbios1: uuid=5a5f306f-09f0-40e0-82bb-ed30307223fa
sockets: 1
vmgenid: 59629c6f-af37-4e23-917e-31923ef3d0b6

fiona · May 17, 2023

Not sure if it will help with your problem, but I'd suggest upgrading to a current version to not miss out on (security) fixes: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
Proxmox VE 6 has been end-of-life since nearly a year now.

Does migrating offline work (that uses a different mechanism to copy the disk)?

plastilin · May 17, 2023

I think the problem is with the network. A bridge was built on Linux Bridge. And I see some kind of inadequate work. For example, iperf3 shows a result of 0.

Code:

[  5] local 10.8.6.3 port 36416 connected to 10.8.6.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   323 KBytes  2.65 Mbits/sec    7   8.74 KBytes      
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes      
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes      
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes      
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes      
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   323 KBytes   265 Kbits/sec   10             sender
[  5]   0.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf Done.

When this echo request works

Code:

root@node03:~# ping 10.8.6.2
PING 10.8.6.2 (10.8.6.2) 56(84) bytes of data.
64 bytes from 10.8.6.2: icmp_seq=1 ttl=64 time=0.298 ms
64 bytes from 10.8.6.2: icmp_seq=2 ttl=64 time=0.154 ms
64 bytes from 10.8.6.2: icmp_seq=3 ttl=64 time=0.163 ms
64 bytes from 10.8.6.2: icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from 10.8.6.2: icmp_seq=5 ttl=64 time=0.201 ms
64 bytes from 10.8.6.2: icmp_seq=6 ttl=64 time=0.192 ms

Network configuration

Code:

auto lo
iface lo inet loopback

auto enp1s0f0
iface enp1s0f0 inet manual
        mtu 9000

auto enp1s0f1
iface enp1s0f1 inet manual
        mtu 9000

auto enp4s0f0
iface enp4s0f0 inet manual
        mtu 9000

auto enp4s0f1
iface enp4s0f1 inet manual
        mtu 9000

auto bond0
iface bond0 inet manual
        bond-slaves enp1s0f0 enp1s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000
#MGMT

auto bond1
iface bond1 inet manual
        bond-slaves enp4s0f0 enp4s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000
#TRUNK

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000

auto vmbr0
iface vmbr0 inet static
        address 10.8.6.2/24
        gateway 10.8.6.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        mtu 9000

The nodes are connected to an ISCOM2648G-4C-AC/S switch with an assembled LACP with the following configuration

Code:

nterface port-channel 4
description lacp_mgmt_node2
jumboframe 9000
max-active links 2
load-sharing mode src-dst-ip
switchport access vlan 2020

plastilin · May 17, 2023

Finally. The problem was the MTU of the switch. He was given 9000. After specifying 12000, everything worked correctly. Something like this )

Search

Search

VM migration problem

plastilin

Renowned Member

fiona

Proxmox Staff Member

plastilin

Renowned Member

fiona

Proxmox Staff Member

plastilin

Renowned Member

plastilin

Renowned Member

We value your privacy