Node crashes on migration of vm - host is PVE 8.0.4

jptechnical · Aug 23, 2023

I have a single node in a 3 node cluster that ALWAYS crashes on incoming migration. I have reinstalled the OS twice. This began after upgrading to v8.x, and was not an issue on v7.x. I have tried this with an addon USB 2.5GBE dedicated migration lan and with the onboard 1GBE shared lan. Hosts are Beelink SER5pro. I just did an `apt update && apt upgrade -y && reboot` on the erring host.

Steps to reproduce:

1. Right-click a VM, click migrate
2. Choose pve2, no other options
3. Watch the migration in the task viewer.
4. Task fails at arbitrary number of GBs, it appears to stall as the destination host restarts.

output:

Code:

```
2023-08-23 10:30:03 use dedicated network address for sending migration traffic (10.2.2.62)
2023-08-23 10:30:03 starting migration of VM 105 to node 'pve2' (10.2.2.62)
2023-08-23 10:30:03 found local disk 'local-zfs:vm-105-disk-0' (attached)
2023-08-23 10:30:03 found local disk 'local-zfs:vm-105-disk-1' (attached)
2023-08-23 10:30:03 found generated disk 'local-zfs:vm-105-disk-2' (in current VM config)
2023-08-23 10:30:03 copying local disk images
2023-08-23 10:30:04 full send of rpool/data/vm-105-disk-0@__migration__ estimated size is 573K
2023-08-23 10:30:04 total estimated size is 573K
2023-08-23 10:30:05 successfully imported 'local-zfs:vm-105-disk-0'
2023-08-23 10:30:05 volume 'local-zfs:vm-105-disk-0' is 'local-zfs:vm-105-disk-0' on the target
2023-08-23 10:30:05 full send of rpool/data/vm-105-disk-1@__migration__ estimated size is 13.9G
2023-08-23 10:30:05 total estimated size is 13.9G
2023-08-23 10:30:06 TIME        SENT   SNAPSHOT rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:06 10:30:06   27.5M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:07 10:30:07   59.1M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:08 10:30:08   90.8M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:09 10:30:09    123M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:10 10:30:10    154M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:11 10:30:11    186M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:13 10:30:12    218M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:14 10:30:13    249M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:15 10:30:15    281M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:16 10:30:16    313M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:17 10:30:17    345M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:18 10:30:18    376M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:19 10:30:19    408M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:20 10:30:20    439M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:21 10:30:21    471M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:22 10:30:22    494M   rpool/data/vm-105-disk-1@__migration__
...message repeats every second
2023-08-23 10:30:58 10:30:58    494M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:30:58 client_loop: send disconnect: Broken pipe
2023-08-23 10:30:58 command 'zfs send -Rpv -- rpool/data/vm-105-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2023-08-23 10:30:58 ERROR: storage migration for 'local-zfs:vm-105-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.2.2.62 -- pvesm import local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 255
2023-08-23 10:30:58 aborting phase 1 - cleanup resources
2023-08-23 10:30:58 ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.2.2.62 pvesm free local-zfs:vm-105-disk-0' failed: exit code 255
2023-08-23 10:30:58 ERROR: migration aborted (duration 00:00:55): storage migration for 'local-zfs:vm-105-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.2.2.62 -- pvesm import local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 255
TASK ERROR: migration aborted
```

If run from the terminal, the output is as follows:

Code:

```
root@pve3:~# qm migrate 105 pve2
2023-08-23 10:35:28 use dedicated network address for sending migration traffic (10.2.2.62)
2023-08-23 10:35:29 starting migration of VM 105 to node 'pve2' (10.2.2.62)
2023-08-23 10:35:29 found local disk 'local-zfs:vm-105-disk-0' (attached)
2023-08-23 10:35:29 found local disk 'local-zfs:vm-105-disk-1' (attached)
2023-08-23 10:35:29 found generated disk 'local-zfs:vm-105-disk-2' (in current VM config)
2023-08-23 10:35:29 copying local disk images
2023-08-23 10:35:30 full send of rpool/data/vm-105-disk-0@__migration__ estimated size is 573K
2023-08-23 10:35:30 total estimated size is 573K
2023-08-23 10:35:30 volume 'rpool/data/vm-105-disk-0' already exists - importing with a different name
2023-08-23 10:35:30 successfully imported 'local-zfs:vm-105-disk-1'
2023-08-23 10:35:31 volume 'local-zfs:vm-105-disk-0' is 'local-zfs:vm-105-disk-1' on the target
2023-08-23 10:35:33 full send of rpool/data/vm-105-disk-1@__migration__ estimated size is 13.9G
2023-08-23 10:35:33 total estimated size is 13.9G
2023-08-23 10:35:33 volume 'rpool/data/vm-105-disk-1' already exists - importing with a different name
2023-08-23 10:35:35 TIME        SENT   SNAPSHOT rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:35 10:35:35   33.2M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:36 10:35:36   64.8M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:37 10:35:37   96.5M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:38 10:35:38    128M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:39 10:35:39    160M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:35:40 10:35:40    165M   rpool/data/vm-105-disk-1@__migration__
...message repeats every second
2023-08-23 10:36:08 10:36:08    165M   rpool/data/vm-105-disk-1@__migration__
2023-08-23 10:36:08 client_loop: send disconnect: Broken pipe
2023-08-23 10:36:09 command 'zfs send -Rpv -- rpool/data/vm-105-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2023-08-23 10:36:10 ERROR: storage migration for 'local-zfs:vm-105-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' [EMAIL]root@10.2.2.62[/EMAIL] -- pvesm import local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 255
2023-08-23 10:36:10 aborting phase 1 - cleanup resources
2023-08-23 10:36:11 ERROR: migration aborted (duration 00:00:43): storage migration for 'local-zfs:vm-105-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' [EMAIL]root@10.2.2.62[/EMAIL] -- pvesm import local-zfs:vm-105-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 255
migration aborted
```

Code:

```erring-host
root@pve2:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-10-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-10-pve: 6.2.16-10
proxmox-kernel-6.2: 6.2.16-10
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-network-perl: 0.8.1
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
```

Code:

```source-host
root@pve3:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-4-pve: 6.2.16-5
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-network-perl: 0.8.1
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
```

fiona · Aug 24, 2023

Hi,
always use apt dist-upgrade instead of apt upgrade, because the latter won't remove a package if that is required to properly upgrade the system as a whole. Is there anything in the system logs of the crashing node? How is the ZFS configured on that node?

jptechnical · Aug 24, 2023

The zfs setup is unremarkable. The system has two drives, the OS is installed on the SATA drive and the NVME drive is data only. Both disks are setup as a zfs single disk stripe. This is the same setup for all three nodes.

Code:

zpool list
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
local-nvme  1.86T  5.04G  1.85T        -         -     0%     0%  1.00x    ONLINE  -
rpool        928G  14.6G   913G        -         -     2%     1%  1.00x    ONLINE  -

The logs don't provide anything helpful... no logs before it crashes. It just powercycles.

Here is the log up to and right after the crash.

Code:

Aug 24 07:15:50 pve2 sshd[7266]: Accepted publickey for root from 10.1.1.63 port 53542 ssh2: RSA SHA256:xxxxxx
Aug 24 07:15:50 pve2 sshd[7266]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 24 07:15:50 pve2 systemd-logind[2229]: New session 13 of user root.
Aug 24 07:15:50 pve2 systemd[1]: Started session-13.scope - Session 13 of User root.
Aug 24 07:15:50 pve2 sshd[7266]: pam_env(sshd:session): deprecated reading of user environment enabled
Aug 24 07:16:04 pve2 zebra[2383]: [WPPMZ-G9797] if_zebra_speed_update: tap105i0 old speed: 0 new speed: 10000
Aug 24 07:16:41 pve2 pmxcfs[2896]: [dcdb] notice: data verification successful
Aug 24 07:16:41 pve2 pmxcfs[2896]: [dcdb] notice: data verification successful
Aug 24 07:17:01 pve2 CRON[86526]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 24 07:17:01 pve2 CRON[86527]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 24 07:17:01 pve2 CRON[86526]: pam_unix(cron:session): session closed for user root
Aug 24 07:17:17 pve2 chronyd[2754]: Selected source 74.6.168.73 (2.debian.pool.ntp.org)
-- Boot 6c5fba39818b48f9bf6132ba07834dc5 --
Aug 24 07:18:36 pve2 kernel: Linux version 6.2.16-10-pve (wolfgangb@sbuild) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils>
Aug 24 07:18:36 pve2 kernel: Command line: initrd=\EFI\proxmox\6.2.16-10-pve\initrd.img-6.2.16-10-pve root=ZFS=rpool/ROOT/pve-1 >
Aug 24 07:18:36 pve2 kernel: KERNEL supported cpus:
...

fiona · Aug 25, 2023

jptechnical said:
The logs don't provide anything helpful... no logs before it crashes. It just powercycles.

That's unfortunate. You could try connecting from another host via ssh and running journalctl --follow. If you are lucky, you can see more of the log there (it might never make it to disk).

mrgohin · Dec 15, 2023

I can't find anythin in the logs but have the same problem.

Here is the migration log:

Code:

Header
Proxmox
Virtual Environment 8.1.3
Virtual Machine 101801 (ip-10-1-80-1) on node 'ip-10-1-131-1'
(migrate)
No Tags
Logs
2023-12-14 22:35:16 starting migration of VM 101801 to node 'ip-10-1-130-1' (10.1.130.1)
2023-12-14 22:35:16 found local disk 'npool:101801/vm-101801-disk-0.raw' (attached)
2023-12-14 22:35:16 found local disk 'npool:101801/vm-101801-disk-1.raw' (attached)
2023-12-14 22:35:16 found local disk 'npool:101801/vm-101801-disk-2.raw' (attached)
2023-12-14 22:35:16 starting VM 101801 on remote node 'ip-10-1-130-1'
2023-12-14 22:35:17 volume 'npool:101801/vm-101801-disk-0.raw' is 'hpool:101801/vm-101801-disk-0.raw' on the target
2023-12-14 22:35:17 volume 'npool:101801/vm-101801-disk-1.raw' is 'hpool:101801/vm-101801-disk-1.raw' on the target
2023-12-14 22:35:17 volume 'npool:101801/vm-101801-disk-2.raw' is 'hpool:101801/vm-101801-disk-2.raw' on the target
2023-12-14 22:35:17 start remote tunnel
2023-12-14 22:35:18 ssh tunnel ver 1
2023-12-14 22:35:18 starting storage migration
2023-12-14 22:35:18 scsi1: start migration to nbd:unix:/run/qemu-server/101801_nbd.migrate:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 13.6 MiB of 16.0 GiB (0.08%) in 0s
drive-scsi1: transferred 414.0 MiB of 16.0 GiB (2.53%) in 1s
drive-scsi1: transferred 1.6 GiB of 16.0 GiB (9.99%) in 2s
drive-scsi1: transferred 2.5 GiB of 16.0 GiB (15.93%) in 3s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 4s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 5s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 6s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 7s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 8s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 9s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 10s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 11s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 12s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 13s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 14s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 15s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 16s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 17s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 18s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 19s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 20s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 21s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 22s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 23s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 24s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 25s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 26s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 27s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 28s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 29s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 30s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 31s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 32s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 33s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 34s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 35s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 36s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 37s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 38s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 39s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 40s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 41s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 42s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 43s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 44s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 45s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 46s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 47s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 48s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 49s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 50s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 51s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 52s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 53s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 54s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 55s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 56s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 57s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 58s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 59s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 1s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 2s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 3s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 4s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 5s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 6s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 7s
drive-scsi1: Cancelling block job
drive-scsi1: Done.
2023-12-14 22:36:26 ERROR: online migrate failure - block job (mirror) error: interrupted by signal
2023-12-14 22:36:26 aborting phase 2 - cleanup resources
2023-12-14 22:36:26 migrate_cancel
2023-12-14 22:36:26 ERROR: unable to open file '/etc/pve/nodes/ip-10-1-131-1/qemu-server/101801.conf.tmp.1087185' - Permission denied

The strange part I am seriously confused about is, it seems like when I migrate my vm onto the other node my whole corosync network crashes:

Code:

[  152.438453] ------------[ cut here ]------------
[  152.438462] NETDEV WATCHDOG: enp6s0f1 (e1000e): transmit queue 0 timed out 5824 ms
[  152.438475] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x260/0x270
[  152.438483] Modules linked in: xxhash_generic dm_crypt ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables softdog sunrpc binfmt_misc 8021q garp mrp bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel aesni_intel eeepc_wmi crypto_simd asus_wmi ledtrig_audio sparse_keymap cryptd platform_profile rapl video wmi_bmof k10temp ccp pcspkr mac_hid drm efi_pstore vhost_net vhost vhost_iotlb tap dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c mlx4_ib ib_uverbs simplefb ib_core mlx4_en xhci_pci r8169 xhci_pci_renesas crc32_pclmul i2c_piix4 e1000e realtek mlx4_core ahci xhci_hcd libahci wmi gpio_amdpt
[  152.438548] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O       6.5.11-7-pve #1
[  152.438552] Hardware name: System manufacturer System Product Name/PRIME B450-PLUS, BIOS 3810 11/21/2022
[  152.438556] RIP: 0010:dev_watchdog+0x260/0x270
[  152.438560] Code: ff ff 48 89 df c6 05 77 3b 78 01 01 e8 b9 80 f9 ff 44 8b 45 cc 44 89 f9 48 89 de 48 89 c2 48 c7 c7 b0 9e c3 8c e8 70 ce 33 ff <0f> 0b e9 1d ff ff ff 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[  152.438565] RSP: 0018:ffffba66c028ce40 EFLAGS: 00010246
[  152.438569] RAX: 0000000000000000 RBX: ffff93c501600000 RCX: 0000000000000000
[  152.438571] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  152.438574] RBP: ffffba66c028ce78 R08: 0000000000000000 R09: 0000000000000000
[  152.438576] R10: 0000000000000000 R11: 0000000000000000 R12: ffff93c5016004c8
[  152.438579] R13: ffff93c50160041c R14: 0000000000000000 R15: 0000000000000000
[  152.438582] FS:  0000000000000000(0000) GS:ffff93c81ea40000(0000) knlGS:0000000000000000
[  152.438585] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  152.438588] CR2: 000055a152cc1068 CR3: 000000010eefc000 CR4: 0000000000350ee0
[  152.438591] Call Trace:
[  152.438593]  <IRQ>
[  152.438596]  ? show_regs+0x6d/0x80
[  152.438601]  ? __warn+0x89/0x160
[  152.438605]  ? dev_watchdog+0x260/0x270
[  152.438608]  ? report_bug+0x17e/0x1b0
[  152.438613]  ? handle_bug+0x46/0x90
[  152.438617]  ? exc_invalid_op+0x18/0x80
[  152.438619]  ? asm_exc_invalid_op+0x1b/0x20
[  152.438625]  ? dev_watchdog+0x260/0x270
[  152.438628]  ? __pfx_dev_watchdog+0x10/0x10
[  152.438630]  call_timer_fn+0x2c/0x160
[  152.438634]  ? __pfx_dev_watchdog+0x10/0x10
[  152.438637]  __run_timers+0x259/0x310
[  152.438642]  run_timer_softirq+0x1d/0x40
[  152.438645]  __do_softirq+0xd4/0x303
[  152.438649]  __irq_exit_rcu+0x75/0xa0
[  152.438652]  irq_exit_rcu+0xe/0x20
[  152.438654]  sysvec_apic_timer_interrupt+0x92/0xd0
[  152.438658]  </IRQ>
[  152.438660]  <TASK>
[  152.438661]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  152.438665] RIP: 0010:cpuidle_enter_state+0xce/0x470
[  152.438669] Code: 28 10 ff e8 64 f6 ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 22 25 0f ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
[  152.438674] RSP: 0018:ffffba66c015fe50 EFLAGS: 00000246
[  152.438677] RAX: 0000000000000000 RBX: ffff93c5042f3000 RCX: 0000000000000000
[  152.438679] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[  152.438682] RBP: ffffba66c015fe88 R08: 0000000000000000 R09: 0000000000000000
[  152.438684] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[  152.438687] R13: ffffffff8d677c60 R14: 000000237e09c419 R15: 0000000000000002
[  152.438691]  cpuidle_enter+0x2e/0x50
[  152.438695]  call_cpuidle+0x23/0x60
[  152.438699]  do_idle+0x202/0x260
[  152.438702]  cpu_startup_entry+0x2a/0x30
[  152.438705]  start_secondary+0x119/0x140
[  152.438709]  secondary_startup_64_no_verify+0x17e/0x18b
[  152.438714]  </TASK>
[  152.438716] ---[ end trace 0000000000000000 ]---
[  152.438727] e1000e 0000:06:00.1 enp6s0f1: Reset adapter unexpectedly
[  153.263542] vmbr12: port 1(enp6s0f1) entered disabled state
[  187.619701] vmbr1001: port 1(tap4002i3) entered disabled state
[  187.712840] tap4002i3: left allmulticast mode
[  187.712865] vmbr1001: port 1(tap4002i3) entered disabled state
[  187.969352] vmbr21: port 2(tap4002i2) entered disabled state
[  187.969861] vmbr21: port 1(enp5s0) entered disabled state
[  188.036487] tap4002i2: left allmulticast mode
[  188.036510] vmbr21: port 2(tap4002i2) entered disabled state
[  188.100600] mlx4_core 0000:05:00.0 enp5s0: left promiscuous mode
[  188.100674] mlx4_core 0000:05:00.0 enp5s0: left allmulticast mode
[  188.100680] vmbr21: port 1(enp5s0) entered disabled state
[  188.197011] mlx4_en: enp5s0: Close port called
[  188.235564] mlx4_en: enp5s0: Link Down
[  188.381386] vmbr11: port 2(tap4002i0) entered disabled state
[  188.381876] vmbr11: port 1(enp6s0f0) entered disabled state
[  188.452446] tap4002i0: left allmulticast mode
[  188.452467] vmbr11: port 2(tap4002i0) entered disabled state
[  188.501696] e1000e 0000:06:00.0 enp6s0f0: left promiscuous mode
[  188.501812] e1000e 0000:06:00.0 enp6s0f0: left allmulticast mode
[  188.660835] vmbr11: port 1(enp6s0f0) entered disabled state
[  189.547603] e1000e 0000:06:00.0 enp6s0f0: NIC Link is Down
[  189.701730] vmbr12: port 2(tap4002i1) entered disabled state
[  189.772696] tap4002i1: left allmulticast mode
[  189.772715] vmbr12: port 2(tap4002i1) entered disabled state
[  189.837873] e1000e 0000:06:00.1 enp6s0f1: left promiscuous mode
[  189.837959] e1000e 0000:06:00.1 enp6s0f1: left allmulticast mode
[  189.991574] vmbr12: port 1(enp6s0f1) entered disabled state
[  190.269041] e1000e 0000:06:00.1 enp6s0f1: NIC Link is Down
[  190.973076] vmbr21: port 1(enp5s0) entered blocking state
[  190.973084] vmbr21: port 1(enp5s0) entered disabled state
[  190.973098] mlx4_core 0000:05:00.0 enp5s0: entered allmulticast mode
[  191.003182] mlx4_en: enp5s0: Steering Mode 1
[  191.017173] mlx4_en: enp5s0: Link Down
[  191.017227] 8021q: adding VLAN 0 to HW filter on device enp5s0
[  191.057352] mlx4_en: enp5s0: Link Up
[  191.058038] vmbr21: port 1(enp5s0) entered blocking state
[  191.058042] vmbr21: port 1(enp5s0) entered forwarding state
[  191.204610] vmbr11: port 1(enp6s0f0) entered blocking state
[  191.204618] vmbr11: port 1(enp6s0f0) entered disabled state
[  191.204632] e1000e 0000:06:00.0 enp6s0f0: entered allmulticast mode
[  191.358949] e1000e 0000:06:00.0 enp6s0f0: MSI interrupt test failed, using legacy interrupt.
[  191.359475] 8021q: adding VLAN 0 to HW filter on device enp6s0f0
[  191.572422] vmbr12: port 1(enp6s0f1) entered blocking state
[  191.572431] vmbr12: port 1(enp6s0f1) entered disabled state
[  191.572488] e1000e 0000:06:00.1 enp6s0f1: entered allmulticast mode
[  191.726996] e1000e 0000:06:00.1 enp6s0f1: MSI interrupt test failed, using legacy interrupt.
[  191.727470] 8021q: adding VLAN 0 to HW filter on device enp6s0f1

When dmesg shows this stuff we are done, next stop is reboot the failed node. Never saw a network card crashing because of an incoming migration. That issue will be a difficult case to debug I guess. journalctl -f doesn't helped as well.

Ironically the node is still reachable by the "management" ip but the corosync interfaces are dead.

Before that I had run migration and corosync over the same network. This is proofen to be not the problem in my case. My management lies on the single 10 Gbit interface "vmbr21" and my corosync runs on my dual port 1 Gbit network card with each one vlan interface on "vmbr11" and "vmbr12". So they are separated as good as possible.

Both my nodes run the latest proxmox kernel (6.5.11) and release (8.1.3).

jptechnical · Dec 15, 2023

It appears to no longer be happening for me now. I just moved 30 or so gb of different VMs back and forth wiht no issues.

CPU(s) 16 x AMD Ryzen 7 5800H with Radeon Graphics (1 Socket)
Kernel Version Linux 6.5.11-7-pve (2023-12-05T09:44Z)
Boot Mode EFI
Manager Version pve-manager/8.1.3/b46aac3b42da5d15

fiona · Dec 15, 2023

Hi,

m4k5ym said:
Code:

[ 152.438727] e1000e 0000:06:00.1 enp6s0f1: Reset adapter unexpectedly [ 153.263542] vmbr12: port 1(enp6s0f1) entered disabled state [ 187.619701] vmbr1001: port 1(tap4002i3) entered disabled state

When dmesg shows this stuff we are done, next stop is reboot the failed node. Never saw a network card crashing because of an incoming migration. That issue will be a difficult case to debug I guess. journalctl -f doesn't helped as well.

that does sound very much like an issue with the network card/driver. Not sure if the same (hardware) issue as reported here, but you might want to give disabling segmentation offloading a try: https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-7#post-368615

m4k5ym said:

So this IP belongs to the management network and has nothing to do with enp6s0f1?

m4k5ym said:

Code:

drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 5s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 6s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 7s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 8s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 9s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 10s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 11s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 12s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 13s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 14s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 15s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 16s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 17s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 18s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 19s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 20s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 21s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 22s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 23s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 24s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 25s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 26s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 27s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 28s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 29s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 30s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 31s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 32s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 33s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 34s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 35s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 36s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 37s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 38s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 39s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 40s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 41s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 42s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 43s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 44s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 45s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 46s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 47s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 48s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 49s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 50s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 51s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 52s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 53s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 54s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 55s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 56s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 57s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 58s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 59s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 1s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 2s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 3s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 4s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 5s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 6s
drive-scsi1: transferred 2.8 GiB of 16.0 GiB (17.24%) in 1m 7s

Is there enough free space on the target? How does the load on the target during migration look like? You could also try setting a bandwidth limit.

mrgohin · Dec 15, 2023

Hi Fiona,

thanks for your answer I could figure out some things.

First of all I had a design flaw in my networking.

One node has two onboard gbit nics and the other node has a dual port gbit pcie card.
Each nic is assigned to a dedicated bridge (vmbr11,vmbr12).
Each bridge has an vlan interface for corosync(vlan124,vlan125).
By that I have "dedicated" "redundant" corosync communication, we just ignore the fact its a single network card, I know.

Then we got in each node a 10gbit card, as well, with it's own bridge (vmbr21).
And here was my flaw. The bridge holds the vlan92 (10.1.130.1/24) on node0 and vlan93 (10.1.131.1/24) on node1 with the management IP which is routed. Therefore if I migrate my routing vm it might explain a crash.

Thats why I configured a dedicated vlan126(192.168.254.8/30) interface on vmbr21 on both nodes which is not routed and configured it as migration network in datacenter.

After that a few migration's worked flawless and now I have the same situation that my node just crash. I also applied the "workaround" from the post you mentioned.

Storage is no issue at all in that matter. They have both plenty of space probably way to much for a private household lol.

I just have some curious theory might it be that our good friend io_uring makes some problem?

esi_y · Dec 15, 2023

fiona said:
Hi,
always use apt dist-upgrade instead of apt upgrade, because the latter won't remove a package if that is required to properly upgrade the system as a whole. Is there anything in the system logs of the crashing node? How is the ZFS configured on that node?

So this is "old" but "general" piece of advice, I just hoped anyone coming across this takes note of the man apt-get and upgrade vs dist-upgrade:

upgrade: ... Packages currently installed with new versions available are retrieved and upgraded; under no circumstances are currently installed packages removed, nor are packages that are not already installed retrieved and installed. New versions of currently installed packages that cannot be upgraded without changing the install status of another package will be left at their current version. ...

dist-upgrade: In addition to performing the function of upgrade, this option also intelligently handles changing dependencies with new versions of packages; apt-get has a "smart" conflict resolution system, and it will attempt to upgrade the most important packages at the expense of less important ones, if necessary.

There's a place for both, sometimes having a package unexpectedly removed is the last thing one wants.

fiona · Dec 15, 2023

tempacc346235 said:
There's a place for both, sometimes having a package unexpectedly removed is the last thing one wants.

No, with Proxmox VE you can break your system when just using apt upgrade, just search the forum. Of course you should check which packages apt will remove, it will tell you.

fiona · Dec 15, 2023

m4k5ym said:
After that a few migration's worked flawless and now I have the same situation that my node just crash. I also applied the "workaround" from the post you mentioned.

An actual crash or again the issue with the network device/driver? Did it happen again during a migration?

m4k5ym said:
Storage is no issue at all in that matter. They have both plenty of space probably way to much for a private household lol.

I just have some curious theory might it be that our good friend io_uring makes some problem?

Why do you think so? It's true that io_uring caused some issues when initially introduced and it's still deactivated by default for certain storage types, but I'm not aware of any recent reports of it causing such trouble. You can of course attempt turning it off on the VM's disks and see if that helps.

esi_y · Dec 15, 2023

fiona said:
No, with Proxmox VE you can break your system when just using apt upgrade, just search the forum. Of course you should check which packages apt will remove, it will tell you.

Alright. I didn't expect this behaviour and but I also never went through major release upgrade. Is this something PVE specific? I thought one can safely upgrade as long as staying on one major version, then use the upgrade scripts (which would do dist-upgrade) for jumping from major to major.

fiona · Dec 15, 2023

tempacc346235 said:
Alright. I didn't expect this behaviour and but I also never went through major release upgrade. Is this something PVE specific? I thought one can safely upgrade as long as staying on one major version, then use the upgrade scripts (which would do dist-upgrade) for jumping from major to major.

AFAIK, Debian guarantees that during minor releases, it's enough to do apt upgrade, but Proxmox VE sometimes does package changes which are incompatible with such a guarantee.

Sure you can stay on the same major version and use apt dist-upgrade while doing so. That will not automatically jump to a new major version. For that you first need to explicitly change the code name in your repository configuration.

mrgohin · Dec 15, 2023

fiona said:
An actual crash or again the issue with the network device/driver? Did it happen again during a migration?

At least I have no indicator that my network falls apart anymore. The node is just "gone" and journalctl has nothing to offer.

fiona said:
Why do you think so? It's true that io_uring caused some issues when initially introduced and it's still deactivated by default for certain storage types, but I'm not aware of any recent reports of it causing such trouble. You can of course attempt turning it off on the VM's disks and see if that helps.

I conducting some tests right now. Two vm's reconfigured from no_cache/io_uring to direct_sync/native and I can throw them forth and back with no problems.

This is really strange but I am not a Proxmox sciensist so I can't really proof my theory with facts.

At least this is my outcome for now I also don't know what impact direct_sync/native has in comparsion to no_cache/io_uring in accordance to nvme/ssd

mrgohin · Dec 15, 2023

I played a lot ping pong now and the node did not crashed anymore if io_uring is not used / disabled in vm.

I am open for tips and instructions for debugging this issue.

mrgohin · Dec 16, 2023

I am really confused. When I use writethrough/threads my node crashes immediatly as well. Only direct_sync/native works without crashing. The logs indicate definitely that the node just crashes inbound without any notice.

esi_y · Dec 16, 2023

jptechnical said:
Just to clarify for some of the linux newbies likely to find this. Your statement is true but potentially dangerous if the reader doesn't understand the mechanism.

First, `apt dist-upgrade` is deprecated and is now `apt full-upgrade` and is a direct replacement, it will still work both ways probably for the forseeable future. The Proxmox Backup Server Upgrade instructions refers to `apt full-upgrade`, kudos to the documentation wizards for staying up to date!

According to `man apt | grep

```
upgrade (apt-get(8))
upgrade is used to install available upgrades of all packages currently installed on the
system from the sources configured via sources.list(5). New packages will be installed if
required to satisfy dependencies, but existing packages will never be removed. If an
upgrade for a package requires the removal of an installed package the upgrade for this
package isn't performed.

full-upgrade (apt-get(8))
full-upgrade performs the function of upgrade but will remove currently installed packages
if this is needed to upgrade the system as a whole.
```

`apt upgrade` is a *safer* upgrade specifically because it doesn't remove packages. But it could leave packages in an unstable state.
`apt dist-upgrade` is more aggressive and will do things like upgrade the kernel and kind of strong-arm any dependencies.

Long story short, there is no reason not to use `apt upgrade` if you are just looking for updates. But if you find things are held back or you want to upgrade your kernel, etc, then `apt dist-upgrade` is what you want. Stick to what the [the Proxmox Upgrade Docs](https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates) tell you to do, which is `apt dist-upgrade`.

Excellent TL;DR

I stopped there because I found it a logical bug of PVE that it publishes updates with minor version changes that require dist-upgrade (full-upgrade as per new nomenclature). I do understand dist-upgrade never went on to update the sources list for me, but it used to be call that precisely because it was meant only to be necessary during distribution upgrades, otherwise what's the point of distinction between minor and major version.

But this thread was not about that and PVE has been one spaghetti bowl since day one and never got fixed in this respect, so is the advice of staff here to newbies, so that's why I stopped right there.

mrgohin · Dec 16, 2023

I give up. I destroyed the cluster and turned off my secondary node. The command pvecm delnode even sent the inbound node to grave and by that the inbound node still thinks it is member of the cluster....

It makes no sense operating such a fragile setup. Normally updates with Proxmox run so smooth but right now I am not sure whats going on.

I also downgraded to pve-qemu-kvm=8.1.2-4 which sends the inbound node even faster to grave then pve-qemu-kvm=8.1.2-5 with io_uring. Also I can say that direct_sync/native has impact which is not nice. I really like to know what is going here. Something weird is happening since I went to latest 8.1.3.

esi_y · Dec 16, 2023

mrgohin said:
I give up. I destroyed the cluster and turned off my secondary node. The command pvecm delnode even sent the inbound node to grave and by that the inbound node still thinks it is member of the cluster....

It makes no sense operating such a fragile setup. Normally updates with Proxmox run so smooth but right now I am not sure whats going on.

I also downgraded to pve-qemu-kvm=8.1.2-4 which sends the inbound node even faster to grave then pve-qemu-kvm=8.1.2-5 with io_uring. Also I can say that direct_sync/native has impact which is not nice. I really like to know what is going here. Something weird is happening since I went to latest 8.1.3.

I am sorry for not contributing anything useful to you. The thing with tracing (potential) bugs is to try to make it reproducible, the other side following you reproducing the issue. Without that and logs record (which need to be good, e.g. with debug mode on) it's all flying in the dark (also for the future).

fiona · Dec 18, 2023

Hi,

jptechnical said:
Long story short, there is no reason not to use `apt upgrade` if you are just looking for updates.

this is just not true for Proxmox VE. It does not have the same packaging guarantees as Debian. Repeating what I already said:

fiona said:
No, with Proxmox VE you can break your system when just using apt upgrade, just search the forum. Of course you should check which packages apt will remove, it will tell you.

E.g.: https://www.reddit.com/r/Proxmox/comments/ujqig9/use_apt_distupgrade_or_the_gui_not_apt_upgrade/

Node crashes on migration of vm - host is PVE 8.0.4

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Active Member

New Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Proxmox Staff Member