Proxmox host crashes during GPU-accelerated AI workload (suspected memory exhaustion)

Nov 11, 2023
12
1
8
Hello fellow Proxmoxoholics,

I’ve run into an issue while experimenting with an AI agent and wanted to see if anyone else has encountered something similar and what mitigation steps you took.
I am using a relatively new gaming PC as a Proxmox host with the following specifications:

Host hardware:
  • CPU: Intel Core i7-12700KF (12th Gen, 20 threads)
  • Memory: 16 GB DDR4
  • Storage:
    • 2 × 2 TB NVMe
    • 2 × 2 TB SSD
  • Networking: 2 × 2 Gb NICs
  • GPUs:
    • NVIDIA RTX 3060
    • NVIDIA RTX 3060 Ti
For this project, I am developing a Mistral-based AI agent inside an Ubuntu VM using Python. One RTX 3060 is passed through directly to the VM.

VM configuration:
  • OS: Ubuntu
  • RAM: 8 GB (remaining 8 GB reserved for Proxmox host)
  • GPU passthrough: RTX 3060
  • No other VMs or containers are running on this host
Issue description:
When the AI agent is running, overall system memory usage on the host slowly but steadily increases from ~45% to ~90%. Once this threshold is reached, the Proxmox host experiences a kernel panic and shuts down entirely.
After a reboot, the host remains stable as long as the AI agent is not running. I have:
  • Checked for application-level memory leaks
  • Verified ZFS ARC limits to ensure it is not consuming excessive RAM
  • Confirmed no other workloads are contributing to memory pressure
Given this behavior, I suspect memory exhaustion or interaction issues between GPU passthrough, the VM, and host memory management.

Question:
For those running AI workloads on Proxmox—particularly GPU-accelerated LLMs—have you experienced similar memory growth leading to host instability? If so, what course of action did you take (e.g., memory tuning, cgroup limits, swapping strategy, containerization, increasing host RAM, or architectural changes)?


Any insights or recommendations would be greatly appreciated.