Anyone else had this issue?
So 32gig ram in this host with 2 VM's running, one with 1 gig ram, other with 16 gig.
I started noticing symptoms such as ssh timing out on the VM client and all network traffic randomly stalling, I ruled out a network problem when the console connection to the VM also stalled.
The VM itself didnt report abnormal i/o wait.
Then I noticed on the proxmox host graphs, i/o wait kept going up every time cpu load went down (cpu load down when VM stalling/idle).
Then finally about the 5th time I looked at screen I noticed fairly high swap usage, and it all clicked at once in my head.
I flushed the swap out, and reduced swappiness to 1, the swap kept going up with stalls, reduced to 0 which should only use it an an actual OOM situation, and it went up albeit with a delay this time (14 gig free ram when it was doing this), so finally did the swapoff -a, and problem solved.
I know the issue is basically down to OS kernel developers, deciding that it is preferable to maintain system cache size over not using the pagefile. But it seems in this case the algorithm got it horribly wrong. Just thought I would post this as info to others, or also if anyone has seen this before and has managed to solve it without using the sledgehammer disable swap approach.
So 32gig ram in this host with 2 VM's running, one with 1 gig ram, other with 16 gig.
I started noticing symptoms such as ssh timing out on the VM client and all network traffic randomly stalling, I ruled out a network problem when the console connection to the VM also stalled.
The VM itself didnt report abnormal i/o wait.
Then I noticed on the proxmox host graphs, i/o wait kept going up every time cpu load went down (cpu load down when VM stalling/idle).
Then finally about the 5th time I looked at screen I noticed fairly high swap usage, and it all clicked at once in my head.
I flushed the swap out, and reduced swappiness to 1, the swap kept going up with stalls, reduced to 0 which should only use it an an actual OOM situation, and it went up albeit with a delay this time (14 gig free ram when it was doing this), so finally did the swapoff -a, and problem solved.
I know the issue is basically down to OS kernel developers, deciding that it is preferable to maintain system cache size over not using the pagefile. But it seems in this case the algorithm got it horribly wrong. Just thought I would post this as info to others, or also if anyone has seen this before and has managed to solve it without using the sledgehammer disable swap approach.