This is an odd one, but the tl:dr linked clone works, full clone will crash the host.
Steps I've taken:
pveversion -v
head -n -0 /etc/apt/sources.list /etc/apt/sources.list.d/*
apt update && apt -d -y full-upgrade
The closest node that matches this with the 5600u is running an nvme that claims it's an mp44 but Teamgroup sent me a different nvme back than what I RMA'd.
I have other machines running MP44's in LVM mode and several others running MP44's in zfs mirror.
Steps to reproduce:
Most of my proxmox nodes are controlled by Hostbill, as this gives me a full CRM that lets me provision packages and lets friends spawn their own servers on free credit. This issue was found after a server went down when hostbill tried to do a routine provision and the Proxmox VE crashed.
Typically there's a qemu-img command running that takes no IO but 100% of 1 core of cpu, that's un-stoppable. pkill does not work, kill -9 doesn't either.
I'm willing to try any advice on how to fix this, I'm pretty sure reinstalling proxmox would fix this - but I'd hate to run into this bug if I were hosting dozens of VM's on this (and it was my only node like a lot of homelabs have) Really just want to figure out what's going on, so this can be prevented in the future. I've not ruled out NVMe issues, or possible hardware issues - anything to help rule that out is also helpful.
Steps I've taken:
- Confirm the same template exists on multiple nodes
- Test linked clone and full clone on multiple nodes
- Remove and re-create template on affected node - no difference still crashes on full clone
- Multiple nodes range from VE 8.3 to 8.3.2 - Affected node is fully up to date.
- I have two nodes with nearly identical specifications, issue only persists on one.
- There is 1 VM on the affected node that was a full clone, the error still happened and it crashed, but this VM does work. No further VMs could be spawned as full clones.
- AMD Ryzen 5 5600u
- 4TB Teamgroup MP44 NVMe
- 64GB DDR4 G.Skill Ram
pveversion -v
Code:
pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
Code:
head -n -0 /etc/apt/sources.list /etc/apt/sources.list.d/*
==> /etc/apt/sources.list <==
deb http://ftp.us.debian.org/debian bookworm main contrib
deb http://ftp.us.debian.org/debian bookworm-updates main contrib
# security updates
deb http://security.debian.org bookworm-security main contrib
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
==> /etc/apt/sources.list.d/ceph.list <==
# deb https://enterprise.proxmox.com/debian/ceph-quincy bookworm enterprise
==> /etc/apt/sources.list.d/ookla_speedtest-cli.list <==
# this file was generated by packagecloud.io for
# the repository at https://packagecloud.io/ookla/speedtest-cli
deb [signed-by=/etc/apt/keyrings/ookla_speedtest-cli-archive-keyring.gpg] https://packagecloud.io/ookla/speedtest-cli/debian/ bookworm main
deb-src [signed-by=/etc/apt/keyrings/ookla_speedtest-cli-archive-keyring.gpg] https://packagecloud.io/ookla/speedtest-cli/debian/ bookworm main
==> /etc/apt/sources.list.d/pve-enterprise.list <==
# deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise
Code:
apt update && apt -d -y full-upgrade
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 http://ftp.us.debian.org/debian bookworm InRelease
Hit:3 http://ftp.us.debian.org/debian bookworm-updates InRelease
Hit:4 http://download.proxmox.com/debian/pve bookworm InRelease
Hit:5 https://packagecloud.io/ookla/speedtest-cli/debian bookworm InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
The closest node that matches this with the 5600u is running an nvme that claims it's an mp44 but Teamgroup sent me a different nvme back than what I RMA'd.
I have other machines running MP44's in LVM mode and several others running MP44's in zfs mirror.
Steps to reproduce:
- Create full clone of any template
- system crashes and ssh becomes inoperable, reboot fails as the clone task won't yield to a reboot now command.
- When you power cycle the server the template logs will say Error: Unexpected Status
- Any VMs that were present and running will boot / start as normal.
Most of my proxmox nodes are controlled by Hostbill, as this gives me a full CRM that lets me provision packages and lets friends spawn their own servers on free credit. This issue was found after a server went down when hostbill tried to do a routine provision and the Proxmox VE crashed.
Typically there's a qemu-img command running that takes no IO but 100% of 1 core of cpu, that's un-stoppable. pkill does not work, kill -9 doesn't either.
I'm willing to try any advice on how to fix this, I'm pretty sure reinstalling proxmox would fix this - but I'd hate to run into this bug if I were hosting dozens of VM's on this (and it was my only node like a lot of homelabs have) Really just want to figure out what's going on, so this can be prevented in the future. I've not ruled out NVMe issues, or possible hardware issues - anything to help rule that out is also helpful.
Last edited: