Proxmox VE full node crash on full-clone operation.

ZizzyDizzyMC

Renowned Member
Feb 25, 2017
8
2
68
31
This is an odd one, but the tl:dr linked clone works, full clone will crash the host.

Steps I've taken:
  1. Confirm the same template exists on multiple nodes
  2. Test linked clone and full clone on multiple nodes
  3. Remove and re-create template on affected node - no difference still crashes on full clone
  4. Multiple nodes range from VE 8.3 to 8.3.2 - Affected node is fully up to date.
  5. I have two nodes with nearly identical specifications, issue only persists on one.
  6. There is 1 VM on the affected node that was a full clone, the error still happened and it crashed, but this VM does work. No further VMs could be spawned as full clones.
Affected node specs:
  • AMD Ryzen 5 5600u
  • 4TB Teamgroup MP44 NVMe
  • 64GB DDR4 G.Skill Ram
Helpful command outputs of the affected node:
pveversion -v
Code:
pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
head -n -0 /etc/apt/sources.list /etc/apt/sources.list.d/*
Code:
head -n -0 /etc/apt/sources.list /etc/apt/sources.list.d/*
==> /etc/apt/sources.list <==
deb http://ftp.us.debian.org/debian bookworm main contrib

deb http://ftp.us.debian.org/debian bookworm-updates main contrib

# security updates
deb http://security.debian.org bookworm-security main contrib

deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription


==> /etc/apt/sources.list.d/ceph.list <==
# deb https://enterprise.proxmox.com/debian/ceph-quincy bookworm enterprise


==> /etc/apt/sources.list.d/ookla_speedtest-cli.list <==
# this file was generated by packagecloud.io for
# the repository at https://packagecloud.io/ookla/speedtest-cli

deb [signed-by=/etc/apt/keyrings/ookla_speedtest-cli-archive-keyring.gpg] https://packagecloud.io/ookla/speedtest-cli/debian/ bookworm main
deb-src [signed-by=/etc/apt/keyrings/ookla_speedtest-cli-archive-keyring.gpg] https://packagecloud.io/ookla/speedtest-cli/debian/ bookworm main

==> /etc/apt/sources.list.d/pve-enterprise.list <==
# deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise
apt update && apt -d -y full-upgrade
Code:
apt update && apt -d -y full-upgrade
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 http://ftp.us.debian.org/debian bookworm InRelease
Hit:3 http://ftp.us.debian.org/debian bookworm-updates InRelease
Hit:4 http://download.proxmox.com/debian/pve bookworm InRelease
Hit:5 https://packagecloud.io/ookla/speedtest-cli/debian bookworm InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

The closest node that matches this with the 5600u is running an nvme that claims it's an mp44 but Teamgroup sent me a different nvme back than what I RMA'd.
I have other machines running MP44's in LVM mode and several others running MP44's in zfs mirror.

Steps to reproduce:

  • Create full clone of any template
  • system crashes and ssh becomes inoperable, reboot fails as the clone task won't yield to a reboot now command.
  • When you power cycle the server the template logs will say Error: Unexpected Status
  • Any VMs that were present and running will boot / start as normal.
I'm at a loss, I'm not sure what happened to this poor node, I've already tried the `dpkg-reconfigure tzdata` as the symptoms of the issue initially matched that of an old but still relevant bug that seems to happen. https://forum.proxmox.com/threads/command-systemctl-show-pve-cluster-failed-exit-code-1.124810/

Most of my proxmox nodes are controlled by Hostbill, as this gives me a full CRM that lets me provision packages and lets friends spawn their own servers on free credit. This issue was found after a server went down when hostbill tried to do a routine provision and the Proxmox VE crashed.

Typically there's a qemu-img command running that takes no IO but 100% of 1 core of cpu, that's un-stoppable. pkill does not work, kill -9 doesn't either.

I'm willing to try any advice on how to fix this, I'm pretty sure reinstalling proxmox would fix this - but I'd hate to run into this bug if I were hosting dozens of VM's on this (and it was my only node like a lot of homelabs have) Really just want to figure out what's going on, so this can be prevented in the future. I've not ruled out NVMe issues, or possible hardware issues - anything to help rule that out is also helpful.
 
Last edited:
I've reinstalled fresh proxmox a few times on identical hardware and cannot reproduce, however the affected box still doesn't work. Waiting on a new NVMe to get in so I can reinstall on the affected box and see if it's reproducible with the hardware or if it's the installation of proxmox itself that's at fault.

Any suggestions welcome.
 
Hello,

Do you see any message in the system logs before or during the crash? You can get the last lines of the previous boot via

Code:
journalctl -e -b -1

the `-b` flag accepts a numerical parameter, -1 for example would give you the logs for the previous boot and -2 the boot before it. `-e` send you to the end of the logs (for the boot specified with -b).
 
Hello,

Do you see any message in the system logs before or during the crash? You can get the last lines of the previous boot via

Code:
journalctl -e -b -1

the `-b` flag accepts a numerical parameter, -1 for example would give you the logs for the previous boot and -2 the boot before it. `-e` send you to the end of the logs (for the boot specified with -b).
This took me a while to get back to. I had seen 'socat' in process list just before it'd crash so I thought it was compromised not knowing that socat is used for viewing consoles. Well, now that I had time to investigate and watch in detail what was going on with a monitor plugged in locally I have good news!

Well, good news for Proxmox. It would Segfault CPU0 CPU5 etc, on full clone operation. Decided to go back to basics and hit it with a Memtest86+ and it failed.
Tested three kits (the original set, an identical set, and a smaller set) and the original would fail consistently, while the alternate kits would pass.
Leaving an alternate passing kit in, booting all the way and performing a full clone works as expected.

New symptom of bad ram I guess. First kit of ram I've had fail in a decade. A pair of G.Skill Ripjaws 3200 32GB SODIMMs are on RMA now.

For other folks who might happen upon the thread - the bad sticks DO pass in another laptop with a similar cpu, (5800H vs 5600u). Talked with an engineer and we determined the flakey nature of the ram is likely being compensated for inside the laptop via auto adjusted timings during ram training. The laptop was considerably more expensive / Better quality (Dell) and also likely has better trace quality / placement / less cross talk etc. Multitude of factors.

The issues were also recorded to dmesg in real time which is how I spotted it.