Proxmox VE 8.3: live migration problems

GPLExpert

Renowned Member
Jul 18, 2016
36
1
73
France
gplexpert.com
Hi,

We're testing the new 8.3 version on our just upgraded test environment and we occur problems during the VM migrations.

An hardware machine is installed with Proxmox 8.3 up-to-date. Three VM are installed with 8.3 Proxmox too into a CEPH cluster for the shared storage. Three other VM are installed with 8.3 Proxmox too into a PVE cluster. This PVE cluster use the CEPH cluster for the RDB shared storage.

The test I did is to have 5 Debian 12 VM on the PVE cluster. I launched a CPU and memory stress on each VM. I tried to migrate it from PVE cluster members to others, once a time, and some of migration occured problems.

Here're the PVE logs of a failled migration:
Code:
2024-11-25 15:34:45 starting migration of VM 103 to node 'test-pve-03' (10.0.0.2)
2024-11-25 15:34:45 starting VM 103 on remote node 'test-pve-03'
2024-11-25 15:34:49 start remote tunnel
2024-11-25 15:34:51 ssh tunnel ver 1
2024-11-25 15:34:51 starting online/live migration on unix:/run/qemu-server/103.migrate
2024-11-25 15:34:51 set migration capabilities
2024-11-25 15:34:51 migration downtime limit: 100 ms
2024-11-25 15:34:51 migration cachesize: 512.0 MiB
2024-11-25 15:34:51 set migration parameters
2024-11-25 15:34:51 start migrate command to unix:/run/qemu-server/103.migrate
2024-11-25 15:34:52 migration active, transferred 111.6 MiB of 3.0 GiB VM-state, 154.3 MiB/s
2024-11-25 15:34:53 migration active, transferred 216.4 MiB of 3.0 GiB VM-state, 562.0 MiB/s
2024-11-25 15:34:54 migration active, transferred 342.5 MiB of 3.0 GiB VM-state, 245.3 MiB/s
2024-11-25 15:34:55 migration active, transferred 452.6 MiB of 3.0 GiB VM-state, 350.1 MiB/s
2024-11-25 15:34:56 migration active, transferred 549.6 MiB of 3.0 GiB VM-state, 477.7 MiB/s
2024-11-25 15:34:58 migration active, transferred 694.6 MiB of 3.0 GiB VM-state, 243.2 MiB/s
query migrate failed: VM 103 not running

2024-11-25 15:34:59 query migrate failed: VM 103 not running
query migrate failed: VM 103 not running

2024-11-25 15:35:00 query migrate failed: VM 103 not running
query migrate failed: VM 103 not running

2024-11-25 15:35:01 query migrate failed: VM 103 not running
query migrate failed: VM 103 not running

2024-11-25 15:35:02 query migrate failed: VM 103 not running
query migrate failed: VM 103 not running

2024-11-25 15:35:03 query migrate failed: VM 103 not running
query migrate failed: VM 103 not running

2024-11-25 15:35:04 query migrate failed: VM 103 not running
2024-11-25 15:35:04 ERROR: online migrate failure - too many query migrate failures - aborting
2024-11-25 15:35:04 aborting phase 2 - cleanup resources
2024-11-25 15:35:04 migrate_cancel
2024-11-25 15:35:04 migrate_cancel error: VM 103 not running
2024-11-25 15:35:04 ERROR: query-status error: VM 103 not running
2024-11-25 15:35:08 ERROR: migration finished with problems (duration 00:00:23)

TASK ERROR: migration problems

Here're what I founded into the source PVE server system logs during the failled migrations:
Code:
2024-11-25T11:45:50.208525+01:00 test-pve-01 kernel: [ 3128.811536] kvm[17244]: segfault at 41b8 ip 00005afcf1cdbb00 sp 00007ca9043fff38 error 4 in qemu-system-x86_64[5afcf17f8000+6a4000] likely on CPU 1 (core 1, socket 0)
2024-11-25T11:46:14.828141+01:00 test-pve-01 kernel: [ 3153.430456] kvm[17025]: segfault at 41b8 ip 0000637f34093b00 sp 00007b86894fef38 error 4 in qemu-system-x86_64[637f33bb0000+6a4000] likely on CPU 3 (core 3, socket 0)
2024-11-25T13:26:03.043440+01:00 test-pve-01 kernel: [  480.749017] kvm[2219]: segfault at 41b8 ip 000060ab8de25b00 sp 00007284055d5f38 error 4 in qemu-system-x86_64[60ab8d942000+6a4000] likely on CPU 2 (core 2, socket 0)
2024-11-25T13:26:28.676741+01:00 test-pve-01 kernel: [  506.380955] kvm[2058]: segfault at 41b8 ip 00005df40925eb00 sp 000070abe59fff38 error 4 in qemu-system-x86_64[5df408d7b000+6a4000] likely on CPU 2 (core 2, socket 0)
2024-11-25T14:15:13.045343+01:00 test-pve-01 kernel: [  313.101418] kvm[1829]: segfault at 41b8 ip 000060607b764b00 sp 00007b33f9fe1f38 error 4 in qemu-system-x86_64[60607b281000+6a4000] likely on CPU 2 (core 2, socket 0)
2024-11-25T14:39:18.392760+01:00 test-pve-01 kernel: [  676.009453] kvm[2609]: segfault at 41b8 ip 00005b71bdecbb00 sp 00007222f02ccf38 error 4 in qemu-system-x86_64[5b71bd9e8000+6a4000] likely on CPU 1 (core 1, socket 0)
2024-11-25T15:34:59.208184+01:00 test-pve-01 kernel: [ 1373.633994] kvm[7867]: segfault at 41b8 ip 00005817d7fcfb00 sp 00007979361fff38 error 4 in qemu-system-x86_64[80eb00,5817d7aec000+6a4000] likely on CPU 0 (core 0, socket 0)

Thanks for your help.

Fabien
 
Hi,
please share the VM configuration qm config <ID> and the output of pveversion -v from both source and target, as well as the full system log around the time the issue happened from the source node.

To further debug the issue please run apt install pve-qemu-kvm-dbgsym gdb systemd-coredump. The next time a crash happens afterwards, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace.
 
I did tests with old versions of both kernel and qemu-kvm packages and my problem seems to have been introduces with the 9.0.2 version of the pve-qemu-kvm packages (OK with 9.0.0-6 and KO with 9.0.2-1).
 
Here's the informations you requested:

VM config:
Code:
# qm config 103
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES,flags=+aes
ide2: none,media=cdrom
memory: 3072
meta: creation-qemu=7.2.0,ctime=1692804811
name: test-migration4
net0: virtio=72:8D:E9:1E:80:8C,bridge=vmbr0
numa: 0
onboot: 0
ostype: l26
parent: before_test_20240228
scsi0: vmdata:vm-103-disk-0,iothread=1,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=297d5ed3-b6d5-4ff3-81c2-40ef5f612c2f
sockets: 1
vmgenid: c87cd4cc-f4aa-428c-a119-62464721ecb2

Source PVE server versions:
Code:
# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.0 (running version: 8.3.0/c1689ccb1065a83b)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.2.9
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.1
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.1
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.0
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1


target PVE server versions:
Code:
# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.0 (running version: 8.3.0/c1689ccb1065a83b)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.2.9
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.1
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.1
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.0
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Full system logs between the begin and the end of the migration task:
Code:
2024-11-25T15:34:45.240894+01:00 test-pve-01 pvesh[9805]: <root@pam> starting task UPID:test-pve-01:00002650:00021311:67448B05:qmigrate:103:root@pam:
2024-11-25T15:34:47.628741+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:34:49.131153+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:34:49.756651+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:34:49.836678+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:34:59.167725+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:34:59.208184+01:00 test-pve-01 kernel: [ 1373.633994] kvm[7867]: segfault at 41b8 ip 00005817d7fcfb00 sp 00007979361fff38 error 4 in qemu-system-x86_64[80eb00,5817d7aec000+6a4000] likely on CPU 0 (core 0, socket 0)
2024-11-25T15:34:59.208225+01:00 test-pve-01 kernel: [ 1373.634027] Code: 48 8d 0d 23 8d 35 00 ba bb 16 00 00 48 8d 35 04 66 35 00 48 8d 3d 2f 66 35 00 e8 7b cb b1 ff 66 66 2e 0f 1f 84 00 00 00 00 00 <48> 8b 87 b8 41 00 00 31 d2 48 85 c0 74 19 66 90 f6 40 18 10 74 08
2024-11-25T15:34:59.250665+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:34:59.390163+01:00 test-pve-01 kernel: [ 1373.816528] vmbr0: port 3(tap103i0) entered disabled state
2024-11-25T15:34:59.391140+01:00 test-pve-01 kernel: [ 1373.817173] tap103i0 (unregistering): left allmulticast mode
2024-11-25T15:34:59.391151+01:00 test-pve-01 kernel: [ 1373.817189] vmbr0: port 3(tap103i0) entered disabled state
2024-11-25T15:34:59.423159+01:00 test-pve-01 systemd[1]: 103.scope: Deactivated successfully.
2024-11-25T15:34:59.423384+01:00 test-pve-01 systemd[1]: 103.scope: Consumed 4min 45.492s CPU time.
2024-11-25T15:34:59.732170+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:35:00.352961+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:01.087613+01:00 test-pve-01 qmeventd[9858]: Starting cleanup for 103
2024-11-25T15:35:01.087828+01:00 test-pve-01 qmeventd[9858]: trying to acquire lock...
2024-11-25T15:35:01.453975+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:02.558800+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:03.659816+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:04.761840+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:04.764167+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:04.764360+01:00 test-pve-01 pvesh[9808]: VM 103 qmp command failed - VM 103 not running
2024-11-25T15:35:06.885187+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:35:06.957243+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:35:08.467740+01:00 test-pve-01 qmeventd[9858]:  OK
2024-11-25T15:35:08.468404+01:00 test-pve-01 pvesh[9808]: migration problems
2024-11-25T15:35:08.483830+01:00 test-pve-01 pvesh[9805]: <root@pam> end task UPID:test-pve-01:00002650:00021311:67448B05:qmigrate:103:root@pam: migration problems
2024-11-25T15:35:08.506431+01:00 test-pve-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln103i0
2024-11-25T15:35:08.511300+01:00 test-pve-01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named fwln103i0
2024-11-25T15:35:08.541631+01:00 test-pve-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap103i0
2024-11-25T15:35:08.547338+01:00 test-pve-01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named tap103i0
2024-11-25T15:35:08.644216+01:00 test-pve-01 qmeventd[9858]: Finished cleanup for 103
2024-11-25T15:35:09.150654+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log
2024-11-25T15:35:09.737504+01:00 test-pve-01 pmxcfs[1194]: [status] notice: received log

I attached the coredump result

This is not the same VM but the config should be about the same:
Code:
# qm config 104
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES,flags=+aes
ide2: none,media=cdrom
memory: 3072
meta: creation-qemu=7.2.0,ctime=1692804811
name: test-migration5
net0: virtio=DE:30:DE:BE:7F:DC,bridge=vmbr0
numa: 0
onboot: 0
ostype: l26
parent: before_test_20240228
scsi0: vmdata:vm-104-disk-0,iothread=1,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=fc573b82-f2d7-4e65-b629-11feeba5200b
sockets: 1
vmgenid: c5910980-ea6e-4617-9dfc-655e968d0b92

Many thanks.
Kind regards.

Fabien
 

Attachments

Thanks! Did you do any other operations like snapshot, backup, resize that might do something with the disk before the failure? Please also share the storage configuration for vmdata from /etc/pve/storage.cfg.

Was the krbd setting ever changed on the storage?
 
Last edited:
Nothing changed instead the pve-qemu-kvm package version during my morning tests. This VMs has got old snapshots (from february, still here). I can test without it if needed.

Here's my vmdata storage config:
Code:
rbd: vmdata
        content rootdir,images
        krbd 0
        monhost 10.xx.xx.xx 10.xx.xx.xx 10.xx.xx.xx
        pool vmdata
        username admin

I don't remember changed anything into the krdb settings.

Kind regards.

Fabien
 
Hallo, Proxmox - Upgrade to version 8.3 - Live migration aborts with error.
Given are: 2-node cluster with Ceph and ISCSI storage
So far, a live migration using HA or manually has worked perfectly and the error is independent of the storage used (CEPH & ISCSI).
The status of Proxmox, HA and CEPH are each flawless.
Apart from the addition of a shared ISCSI storage, there were no HW / SW technical changes after the upgrade.
The error message:

2024-11-28 08:54:53 [pven2] TERM environment variable not set.
2024-11-28 08:54:59 start remote tunnel
TERM environment variable not set.
2024-11-28 08:54:59 tunnel still running - terminating now with SIGTERM
2024-11-28 08:55:00 ERROR: online migrate failure - can't open tunnel - got strange reply from tunnel
2024-11-28 08:55:00 aborting phase 2 - cleanup resources
2024-11-28 08:55:00 migrate_cancel
2024-11-28 08:55:01 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

The proposed solutions here and all other solutions to this problem that have already been found have not yet resulted in any improvement after use.
 
Hi,

I didn't specify yesterday, the snapshot was done without memory.
I removed this snapshot and the problem still occur.

Best regards.

Fabien
Hallo, Proxmox - Upgrade to version 8.3 - Live migration aborts with error.
Given are: 2-node cluster with Ceph and ISCSI storage
So far, a live migration using HA or manually has worked perfectly and the error is independent of the storage used (CEPH & ISCSI).
The status of Proxmox, HA and CEPH are each flawless.
Apart from the addition of a shared ISCSI storage, there were no HW / SW technical changes after the upgrade.
The error message:

2024-11-28 08:54:53 [pven2] TERM environment variable not set.
2024-11-28 08:54:59 start remote tunnel
TERM environment variable not set.
2024-11-28 08:54:59 tunnel still running - terminating now with SIGTERM
2024-11-28 08:55:00 ERROR: online migrate failure - can't open tunnel - got strange reply from tunnel
2024-11-28 08:55:00 aborting phase 2 - cleanup resources
2024-11-28 08:55:00 migrate_cancel
2024-11-28 08:55:01 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

The proposed solutions here and all other solutions to this problem that have already been found have not yet resulted in any improvement after use.
Hello @dtt
Have you validated that your not matching with this issue :
https://forum.proxmox.com/threads/live-migration-fails-cant-open-tunnel.119423/
Because the "can't open tunnel - got strange reply from tunnel" seems to be related to ssh interaction
between pve nodes and not qemu-kvm.
Have you tried to connect from each pve nodes cli to each other pve nodes through ssh (root user)
to validate ssh interraction ?

Best regards,
Cédric
 
  • Like
Reactions: fiona
Unfortunately, I'm not able to reproduce the issue. The backtrace shows that there was a NULL pointer passed to a flush subroutine in QEMU, which does not expect it. But it's not clear yet where exactly that subroutine is called with the invalid value.

Before migration, please do the following (for both commands, replace 104 with the actual VMID should it be different):

1. Save the following script as query-block.pm
Code:
#!/bin/perl

use strict;
use warnings;

use JSON;

use PVE::QemuServer::Monitor qw(mon_cmd);

my $vmid = shift or die "need to specify vmid\n";

my $res = eval { mon_cmd($vmid, "query-block" ) };
die $@ if $@;
print to_json($res, { pretty => 1, canonical => 1 });
and run it with perl query-block.pm 104. The output would be interesting.

2. Save the following script as gdb-script.txt
Code:
break bdrv_flush
break bdrv_co_flush
commands 1 2
bt
c
end
c
and gdb --batch -x gdb-script.txt -p $(cat /var/run/qemu-server/104.pid) &> /tmp/gdb-log.txt.

Then run the migration, wait for the failure and share the log file that the second command produced.
 
Unfortunately, there seems to have been a signal that interrupted GDB early.
Could you try again with the following script:
Code:
handle SIGUSR1 noprint nostop
handle SIGPIPE noprint nostop
break bdrv_flush
break bdrv_co_flush
commands 1 2
bt
c
end
c
?
 
Hmm, unfortunately still not showing the smoking gun. How long before the migration did you attach the debugger?

Is there a lot of IO going on inside the VMs? Or a lot of IO latency? Do you use the same network interface for migration and for Ceph?
 
I launched the migration just after the debugger attachment using a script. I added a 1 second "sleep" between it but the gdb log output seems to be about the same. I joined it too.

I didn't tell you but the command claim's some missing files:
Code:
42      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
909     block/block-gen.c: No such file or directory.
2947    ../block/io.c: No such file or directory.
909     block/block-gen.c: No such file or directory.
2947    ../block/io.c: No such file or directory.
909     block/block-gen.c: No such file or directory.
2947    ../block/io.c: No such file or directory.
909     block/block-gen.c: No such file or directory.
2947    ../block/io.c: No such file or directory.

I launch inside each VM a "stress --cpu 2 --vm 2" command (hardware is 2 vCPU and 1.5 GB RAM). A top command inside the VM let me see 0.0 as wait value. I use dedicated network interfaces for CEPH.

Thanks.

Fabien
 

Attachments

I haven't been able to reproduce the issue yet. Could you try attaching the debugger a few minutes before attempting the migration? If we are lucky, we catch the moment where the problematic flush is first queued then.
 
Hi,

First, I wish a happy new year to all the team.

I could make another tests that could help to understand/reproduce my problem.

My original infrastructure was nested: One bare metal server use Promox 8.3 as hypervisor. Into it, we installed 6 VM with Proxmox 8.3: The first 3 machines are used for the CEPH cluster. The 3 others are used for the Proxmox 8.3 hypervisor cluster, using the CEPH cluster as storage. The https://pve.proxmox.com/wiki/Nested_Virtualization have been applied.

We get 3 other bare metal machines. We installed it with Proxmox 8.3 too, plugged to the same CEPH cluster and used it as a Proxmox 8.3 hypervisor cluster. We could launch the same test without issue.

So we can think the problem occur when the compute hypervisor are nested (running into another hypervisor).

Kind regards.

Fabien
 
So we can think the problem occur when the compute hypervisor are nested (running into another hypervisor).
My suspicion is that it is performance-related. I think the bug can trigger when flush requests to the guest disk during migration take a long time to complete. I'll try to reproduce this again later this week.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!