Problem Description
I'm experiencing systematic VM crashes on multiple Proxmox hosts when VMs face memory pressure. The crashes occur across different hosts and workloads and seem to follow the same pattern.Environment
- Proxmox VE version : 8.4.14 and 9.1.1
- Affected VMs : Multiple VMs across different hosts (GitLab CI and MariaDB running Debian 12 & 13)
- Storage backend : NVME Raid2 (2tb and 512gb)
- Disk controller : SCSI (VirtIO SCSI)
- Host specifications : Intel Xeon-E 2388G, 64gb
Symptoms
Crash pattern :- VM experiences high memory usage
- System starts heavy swapping
- I/O operations slow down dramatically
- SCSI timeouts appear in logs (not always)
- System becomes unresponsive and crashes, a VM restart is needed
Kernel logs show (not including the whole logs here) :
Code:
sd 1:0:0:0: [sda] tag#XXX ABORT operation started
sd 1:0:0:0: ABORT operation timed-out
sd 1:0:0:0: BUS RESET operation started
sym0: SCSI BUS reset detected
task:kswapd0 blocked for more than 120 seconds
task:khugepaged blocked for more than 120 seconds
Timeline :
- MariaDB VM : crashes every ~12 hours during heavy operations
- GitLab CI VM : crashes during concurrent job execution (a few times a week)
- Other VMs on same hosts: no issues so far
Current Workaround
On the GitLab CI VM I solved the issue by allocating more RAM.My MariaDB VM already had a lot of free RAM. Setting vm.swappiness=10 on it has completely resolved the issue, but this feels like treating symptoms rather than the root cause.
Questions
- Is this a known interaction between swap pressure and VirtIO SCSI?
- Why would swap activity cause SCSI controller timeouts?
- Is there a timeout configuration that could be adjusted?
- Storage backend configuration:
- Are there specific storage settings that could prevent this?
- Should I consider different disk controller types (SATA/IDE) for swap-heavy workloads?
- Host-level optimizations:
- Any Proxmox-specific tuning to handle VM swap better?
- Should host swappiness also be reduced?
- Long-term solution:
- Is low swappiness the recommended approach, or are there better alternatives?
- Should I simply increase VM RAM to avoid swap entirely?
Additional Context
- The issue is reproducible across different Proxmox hosts
- Only affects VMs under memory pressure, not regular operation
- ~15 production VMs - I would like to deploy preventive measures fleet-wide
What I've tried
Reducing swappiness (works but feels incomplete)
Monitoring to identify memory-hungry processes
Considering RAM increases for affected VMs