[SOLVED] Fail to migrate (Live or not) VM/CT from node

ctacat

Renowned Member
Dec 8, 2010
7
1
68
Brest, France, Europe, Earth
Hello,

I'm facing issues to migrate CT or VM in a pve cluster of 3 nodes, I systematically have this kind of error:

2024-07-18 11:23:31 shutdown CT 105
2024-07-18 11:23:33 starting migration of CT 105 to node 'discovery' (192.168.222.11)
2024-07-18 11:23:33 found local volume 'local-lvm:vm-105-disk-0' (in current VM config)
2024-07-18 11:23:34 volume pve/vm-105-disk-0 already exists - importing with a different name
2024-07-18 11:23:34 Logical volume "vm-105-disk-1" created.
2024-07-18 11:23:38 35848192 bytes (36 MB, 34 MiB) copied, 3 s, 11.9 MB/s
2024-07-18 11:23:41 72876032 bytes (73 MB, 70 MiB) copied, 6 s, 12.1 MB/s

...

2024-07-18 11:29:35 4444913664 bytes (4.4 GB, 4.1 GiB) copied, 360 s, 12.3 MB/s
2024-07-18 11:29:38 command 'dd 'if=/dev/pve/vm-105-disk-0' 'bs=64k' 'status=progress'' failed: got signal 13
2024-07-18 11:29:38 ERROR: storage migration for 'local-lvm:vm-105-disk-0' to storage 'local-lvm' failed - command 'set -o pipefail && pvesm export local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=discovery' -o 'UserKnownHostsFile=/etc/pve/nodes/discovery/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.222.11 -- pvesm import local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 -allow-rename 1' failed: exit code 255
2024-07-18 11:29:38 aborting phase 1 - cleanup resources
2024-07-18 11:29:38 ERROR: found stale volume copy 'local-lvm:vm-105-disk-0' on node 'discovery'
2024-07-18 11:29:38 start final cleanup
2024-07-18 11:29:38 start container on source node
2024-07-18 11:29:40 ERROR: migration aborted (duration 00:06:09): storage migration for 'local-lvm:vm-105-disk-0' to storage 'local-lvm' failed - command 'set -o pipefail && pvesm export local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=discovery' -o 'UserKnownHostsFile=/etc/pve/nodes/discovery/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.222.11 -- pvesm import local-lvm:vm-105-disk-0 raw+size - -with-snapshots 0 -allow-rename 1' failed: exit code 255
TASK ERROR: migration aborted

The elapsed time before dd fails is variable. Sometimes, I obtain this supplemental message which I think is related to ssh:
client_loop: send disconnect: Broken pipe

Here is the result of pveversion (same on 3 nodes):

pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-2-pve)

Any help would be appreciated.
 
Last edited:
I reply to myself:

- 1st: I disabled the use of TP-Link TX401 10 Gb nics because they are causing a lot of trouble on my 3 nodes (freezes, panics, ...), and use only motherboard's integrated NICs for now (I guess I need to update the firmware of the nics...)

- 2nd: I made sure MTU are set consistently everywhere, starting at 1500.

Now I can migrate VMs and CTs flawlessly.

Starting from this baseline, I will now try to increase speed step by step (making 10Gb NICs work flawlessly, increase MTU to 9000, ...) ;)
 
Last edited:
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!