Hi all,
I’ve been running a Proxmox server set up for heavy simulations. The idea is simple: I only run either the Windows or the Linux VM (never both at once, I am using a hookscript), and I want them to use as much CPU and RAM as possible. There’s also a TrueNAS VM running permanently to provide shared storage to both.
The issue is with the Windows VM. Whenever I start a simulation, at some point during execution the entire server becomes unreachable — no web UI, no SSH, I can’t even ping it. I’ve had to go to the server room to hard reset it multiple times.
Windows VM
Linux VM
TrueNAS VM
Everything works fine at first. Then as soon as the Windows VM starts doing serious work (a simulation), the whole host becomes unreachable. It’s not a VM crash — it’s the entire Proxmox node.
I've already:
Still, nothing helps. The only pattern is that it happens when Windows starts a heavy simulation.
I’ve been running a Proxmox server set up for heavy simulations. The idea is simple: I only run either the Windows or the Linux VM (never both at once, I am using a hookscript), and I want them to use as much CPU and RAM as possible. There’s also a TrueNAS VM running permanently to provide shared storage to both.
The issue is with the Windows VM. Whenever I start a simulation, at some point during execution the entire server becomes unreachable — no web UI, no SSH, I can’t even ping it. I’ve had to go to the server room to hard reset it multiple times.
System Overview
- Proxmox VE: 6.8.12-9
- CPU: AMD Ryzen Threadripper 7980X (64 cores / 128 threads)
- RAM: 512 GB
- Boot disk: 1TB Samsung 990 PRO (ZFS)
- Shared disk: 500 GB partition from same SSD, exported via NFS
- Swap: 16 GB file-based
VM Setup
Windows VM
- 400 GB RAM (ballooning disabled)
- 56 cores, 1 socket
- CPU: host
- GPU passthrough enabled
- Main disk on local-zfs
Linux VM
- Not running at the same time as Windows
- Also intended for heavy simulations with similar resource assignment
TrueNAS VM
- 16 GB RAM
- Disk stored in a rpool to avoid ZFS-on-ZFS issues
- Always running for NFS shared storage
What’s Happening
Everything works fine at first. Then as soon as the Windows VM starts doing serious work (a simulation), the whole host becomes unreachable. It’s not a VM crash — it’s the entire Proxmox node.
I've already:
- Disabled ballooning
- Checked for OOM kills or PCI errors in dmesg and journalctl (nothing obvious)
- Added swap
- Verified ZFS is not using too much ARC memory (I checked ARC stats)
Still, nothing helps. The only pattern is that it happens when Windows starts a heavy simulation.
What I’d Appreciate Help With
- Is 56 cores + 400 GB too much? Should I reserve more for the host?
- Is there a better way to configure the Windows VM for this use case?
- Could GPU passthrough be causing instability even if it works at first?
- Are there known issues with high resource assignments in Proxmox 8.x?
- Would switching from local-zfs to file-based storage help?