Hi there,
I am setting up proxmox on a new beefy system (dual epyc 7742, 512gb of memory, and 4 nvme drives).
Upon setting up my first VM, I went all in by assigning 128 cores in 2 sockets with numa enabled, with a good chunk of memory. I was able to boot it up and install my OS as normal. Booting the VM on and off is instantaneous at that stage.
However as soon as I assign any device for pcie passthrough, such as my nvme drives, the following steps occur in order:
- host memory usage ramps up to maximum guest allowance
- guest hangs for about 15-20 minutes and prevents me from loading its vnc console, or even ssh (proxmox tasks indicates "Error: Failed to run vnc proxy")
- once booted in, I can see my passed through pcie devices listed via `lspci` but none of the drives listed via `lsblk`
I tried reducing the amount of memory and cores per socket, reducing socket back to 1, disabling numa, but same behavior and results. Now and oddly enough when lowering the total amount of cores to around 32 or less, these issues seem to resolve partly:
- host memory usage still ramps up upon start of the VM
- no hanging, VM boots immediately after than
- nvme drives are all showing via both `lspci` and `lsblk`
I couldn't find any information online with regards to beefy VMs so unsure if that is common behavior for one reason or another. But I should also mention none of these issues seem to happen when I assign these same nvme drives as hard disks to the VM, on the contrary it appears to boot just as fast as before adding pcie passthrough devices. I would greatly appreciate any advice on how to resolve this.
This is my VM config and pve version in case it helps, please let me know any other information I may provide to help understand what is happening.
I am setting up proxmox on a new beefy system (dual epyc 7742, 512gb of memory, and 4 nvme drives).
Upon setting up my first VM, I went all in by assigning 128 cores in 2 sockets with numa enabled, with a good chunk of memory. I was able to boot it up and install my OS as normal. Booting the VM on and off is instantaneous at that stage.
However as soon as I assign any device for pcie passthrough, such as my nvme drives, the following steps occur in order:
- host memory usage ramps up to maximum guest allowance
- guest hangs for about 15-20 minutes and prevents me from loading its vnc console, or even ssh (proxmox tasks indicates "Error: Failed to run vnc proxy")
- once booted in, I can see my passed through pcie devices listed via `lspci` but none of the drives listed via `lsblk`
I tried reducing the amount of memory and cores per socket, reducing socket back to 1, disabling numa, but same behavior and results. Now and oddly enough when lowering the total amount of cores to around 32 or less, these issues seem to resolve partly:
- host memory usage still ramps up upon start of the VM
- no hanging, VM boots immediately after than
- nvme drives are all showing via both `lspci` and `lsblk`
I couldn't find any information online with regards to beefy VMs so unsure if that is common behavior for one reason or another. But I should also mention none of these issues seem to happen when I assign these same nvme drives as hard disks to the VM, on the contrary it appears to boot just as fast as before adding pcie passthrough devices. I would greatly appreciate any advice on how to resolve this.
This is my VM config and pve version in case it helps, please let me know any other information I may provide to help understand what is happening.
Code:
# cat /etc/pve/qemu-server/112.conf
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 128
cpu: host
hostpci0: 0000:81:00.0,pcie=1
hostpci1: 0000:82:00.0,pcie=1
hostpci2: 0000:83:00.0,pcie=1
hostpci3: 0000:84:00.0,pcie=1
machine: q35
memory: 458752
meta: creation-qemu=7.0.0,ctime=1663244103
name: fedora
net0: virtio=62:8E:05:39:8E:8A,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: local-zfs:vm-112-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=afb7e0d3-b63f-433c-bf77-a6c4870bcfb8
sockets: 2
vmgenid: 265cd566-241a-45bc-9cb3-2a4133db5879
Code:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1