Select VMs fail to start after latest update

jonathan.young

Active Member
Apr 26, 2020
33
0
26
59
I have a weird recent problem that some of my VMs fail to start after updating Proxmox. Some start fine but 2 always fail. The weird part is that if I copy them over to another Proxmox machine (which has not been updated) they start without any problem but I do have to make a couple of config changes.

Updated machine is a Ryzen 3900x with 128Gb Ram and an Intel x710 nic. SR-IOV vf is applied to each VM. Worked perfectly before update. Package versions for this machine are:
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-7
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-6
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.2-7
pve-firmware: 3.6-4
pve-ha-manager: 3.5.1
pve-i18n: 2.8-3
pve-qemu-kvm: 7.2.0-7
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

The non updated machine is an Intel i7-4770 with 32Gb Ram using virtio for VM nics. Config for this machine is:
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-6
pve-kernel-5.15: 7.3-2
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-6
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-3
pve-qemu-kvm: 7.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

These are the differences:
Code:
< pve-kernel-helper: 7.3-7
< pve-kernel-5.15: 7.3-3
< pve-kernel-5.15.102-1-pve: 5.15.102-1
---
> pve-kernel-helper: 7.3-6
> pve-kernel-5.15: 7.3-2
> pve-kernel-5.13: 7.1-9
8,9c8,10
< pve-kernel-5.15.30-2-pve: 5.15.30-3
< ceph-fuse: 15.2.16-pve1
---
> pve-kernel-5.13.19-6-pve: 5.13.19-15
> pve-kernel-5.13.19-2-pve: 5.13.19-4
> ceph-fuse: 15.2.15-pve1
41c42
< pve-firmware: 3.6-4
---
> pve-firmware: 3.6-3
44c45
< pve-qemu-kvm: 7.2.0-7
---
> pve-qemu-kvm: 7.2.0-5

Config for non working VM:
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 1
cpu: host,flags=+aes
cpuunits: 10
efidisk0: nvme-thin:vm-4800-disk-2,efitype=4m,pre-enrolled-keys=1,size=4M
machine: q35
memory: 1024
meta: creation-qemu=7.1.0,ctime=1674682172
name: nvidia-dls
numa: 1
onboot: 1
ostype: l26
scsi0: nvme-thin:vm-4800-disk-1,cache=none,discard=on,iothread=1,size=8G
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=<redacted>
sockets: 1
startup: order=16,up=10
tablet: 0
vmgenid: <redacted>

Screenshot from VM during boot (Proxmox has plenty of memory available):
Screenshot from 2023-03-15 15-06-47.jpg


Does anyone have any ideas? Although the VMs are running, I really need to get them back on my main machine. As mentioned earlier, the main machine is running several other VMs and LXCs without problem. The journal doesn't give any info; neither does dmesg.

Thanks in advance.
 
Hi,
what's strange here is that the other node also has/had the upgraded pve-edk2-firmware package. So that can't be the only reason. Could you try if it works with pve-edk2-firmware: 3.20221111-1 and pve-qemu-kvm: 7.2.0-5 on the Ryzen node? If it doesn't, maybe the CPU vendor/model plays a role.
 
Hi Fiona,

I tried it with pve-edk2-firmware: 3.20221111-1 and pve-qemu-kvm: 7.2.0-5 on the Ryzen node but the VM fails as before. If I go back to pve-edk2-firmware: 3.20220526-1 and pve-qemu-kvm: 7.2.0-7, it works again.

So I think the problem lies with pve-edk2-firmware: 3.20221111-1

Thanks for the help,

Jonathan
 
Yes, but likely only in combination with having an AMD (Ryzen) CPU. The people in the other thread also have that (and me too).
Hopefully someone will be able to diagnose what the problem is with Ryzen and this firmware. Until then I will stick with the old firmware.

Thanks for all the help!

Kind regards,

Jonathan
 
Yes, but likely only in combination with having an AMD (Ryzen) CPU. The people in the other thread also have that (and me too).
Here is an overview of what stopped working and work-arounds for my current VMs:

VM (Windows 10 Home) with PCIe passthrough and memory hotplug (12G) and EPYC-Rome: still works fine.
VM (Proxmox) without PCIe passthrough and with memory hotplug (1.5-3G) and EPYC-Rome: still works fine.
VM (Ubuntu Server) with PCIe passthrough and without memory hotplug and only minimal memory (576M) and any CPU: out of memory but fixed with little more memory.
VM (Linux Mint) with PCIe passthrough and memory hotplug (12G) and host (5950X) or EPYC-Rome: out of memory.
VM (Linux Mint) with PCIe passthrough and memory hotplug (12G) and kvm64 or EPYC-IBPB: works fine as a work-around.

Reverting the pve-edks2-firmware update and everything works as before. The newer version appears to need a little more minimum memory than before (for VMs with Ubuntu-like kernels). Using EPYC-IBPB instead of host/EPYC-Rome appears to be a work-around for VMs with PCIe passthrough and memory hotplug (for VMs with Ubuntu-like kernels).

EDIT: With the latest update of pve-edk2-firmware to 3.20230228-1 today, EPYC-Rome works again but host still gives out of memory.
 
Last edited:
On my Ryzen 3900x:

All my LXCs (5 of them) work fine
2 x VM (Windows 10 Pro 12Gb and 8Gb RAM respectively) with PCIe passthrough: still work fine
VM (Nethserver based on CentOS 3Gb RAM) with PCIe passthrough: can't find any drives
VM Ubuntu with PCIe passthrough (8Gb RAM): works fine
VM Ubuntu with PCIe passthrough (1Gb RAM): out of memory error
VM Home Assistant (4Gb RAM): works fine
VM pfSense (4Gb Ram): works fine

As above, reverting the pve-edks2-firmware update and everything works as before. On reflection, the problem does seem to be memory related
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!