[SOLVED] Fail to migrate (Live or not) VM/CT from node

ctacat

Renowned Member
Dec 8, 2010
7
1
68
Brest, France, Europe, Earth
Hello,

I'm facing issues to migrate CT or VM in a pve cluster of 3 nodes, I systematically have this kind of error:

2024-07-18 11:23:31 shutdown CT 105
2024-07-18 11:23:33 starting migration of CT 105 to node 'discovery' (192.168.222.11)
2024-07-18 11:23:33 found local volume 'local-lvm:vm-105-disk-0' (in current VM config)
2024-07-18 11:23:34 volume pve/vm-105-disk-0 already exists - importing with a different name
2024-07-18 11:23:34 Logical volume "vm-105-disk-1" created.
2024-07-18 11:23:38 35848192 bytes (36 MB, 34 MiB) copied, 3 s, 11.9 MB/s
2024-07-18 11:23:41 72876032 bytes (73 MB, 70 MiB) copied, 6 s, 12.1 MB/s

...

2024-07-18 11:29:35 4444913664 bytes (4.4 GB, 4.1 GiB) copied, 360 s, 12.3 MB/s
2024-07-18 11:29:38 command 'dd 'if=/dev/pve/vm-105-disk-0' 'bs=64k' 'status=progress'' failed: got signal 13
2024-07-18 11:29:38 ERROR: storage migration for 'local-lvm:vm-105-disk-0' to storage 'local-lvm' failed - command 'set -o pipefail && pvesm export local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=discovery' -o 'UserKnownHostsFile=/etc/pve/nodes/discovery/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.222.11 -- pvesm import local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 -allow-rename 1' failed: exit code 255
2024-07-18 11:29:38 aborting phase 1 - cleanup resources
2024-07-18 11:29:38 ERROR: found stale volume copy 'local-lvm:vm-105-disk-0' on node 'discovery'
2024-07-18 11:29:38 start final cleanup
2024-07-18 11:29:38 start container on source node
2024-07-18 11:29:40 ERROR: migration aborted (duration 00:06:09): storage migration for 'local-lvm:vm-105-disk-0' to storage 'local-lvm' failed - command 'set -o pipefail && pvesm export local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=discovery' -o 'UserKnownHostsFile=/etc/pve/nodes/discovery/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.222.11 -- pvesm import local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 -allow-rename 1' failed: exit code 255
TASK ERROR: migration aborted

The elapsed time before dd fails is variable. Sometimes, I obtain this supplemental message which I think is related to ssh:
client_loop: send disconnect: Broken pipe

Here is the result of pveversion (same on 3 nodes):

pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-2-pve)

Any help would be appreciated.
 
Last edited:
I reply to myself:

- 1st: I disabled the use of TP-Link TX401 10 Gb nics because they are causing a lot of trouble on my 3 nodes (freezes, panics, ...), and use only motherboard's integrated NICs for now (I guess I need to update the firmware of the nics...)

- 2nd: I made sure MTU are set consistently everywhere, starting at 1500.

Now I can migrate VMs and CTs flawlessly.

Starting from this baseline, I will now try to increase speed step by step (making 10Gb NICs work flawlessly, increase MTU to 9000, ...) ;)
 
Last edited:
  • Like
Reactions: leesteken