Hello fellow Proxmoxoholics,
I’ve run into an issue while experimenting with an AI agent and wanted to see if anyone else has encountered something similar and what mitigation steps you took.
I am using a relatively new gaming PC as a Proxmox host with the following specifications:
Host hardware:
VM configuration:
When the AI agent is running, overall system memory usage on the host slowly but steadily increases from ~45% to ~90%. Once this threshold is reached, the Proxmox host experiences a kernel panic and shuts down entirely.
After a reboot, the host remains stable as long as the AI agent is not running. I have:
Question:
For those running AI workloads on Proxmox—particularly GPU-accelerated LLMs—have you experienced similar memory growth leading to host instability? If so, what course of action did you take (e.g., memory tuning, cgroup limits, swapping strategy, containerization, increasing host RAM, or architectural changes)?
Any insights or recommendations would be greatly appreciated.
I’ve run into an issue while experimenting with an AI agent and wanted to see if anyone else has encountered something similar and what mitigation steps you took.
I am using a relatively new gaming PC as a Proxmox host with the following specifications:
Host hardware:
- CPU: Intel Core i7-12700KF (12th Gen, 20 threads)
- Memory: 16 GB DDR4
- Storage:
- 2 × 2 TB NVMe
- 2 × 2 TB SSD
- Networking: 2 × 2 Gb NICs
- GPUs:
- NVIDIA RTX 3060
- NVIDIA RTX 3060 Ti
VM configuration:
- OS: Ubuntu
- RAM: 8 GB (remaining 8 GB reserved for Proxmox host)
- GPU passthrough: RTX 3060
- No other VMs or containers are running on this host
When the AI agent is running, overall system memory usage on the host slowly but steadily increases from ~45% to ~90%. Once this threshold is reached, the Proxmox host experiences a kernel panic and shuts down entirely.
After a reboot, the host remains stable as long as the AI agent is not running. I have:
- Checked for application-level memory leaks
- Verified ZFS ARC limits to ensure it is not consuming excessive RAM
- Confirmed no other workloads are contributing to memory pressure
Question:
For those running AI workloads on Proxmox—particularly GPU-accelerated LLMs—have you experienced similar memory growth leading to host instability? If so, what course of action did you take (e.g., memory tuning, cgroup limits, swapping strategy, containerization, increasing host RAM, or architectural changes)?
Any insights or recommendations would be greatly appreciated.