Random freezes for VM

ccorbin

New Member
May 4, 2023
2
0
1
Hi,

Issue Description:
For several months now I have had a problem with a VM in Proxmox, it is a SQL Windows server 2019. Initially, I found that when pinging the device it would randomly miss a ping, or just be delayed by several seconds *10-20s* sometimes longer depending on the workload (~150-250 users). After changing out the NIC for the server and doubling its RAM, the issue persisted. I also had a Linux bond with LACP applied to help with redundancy (removed to simplify troubleshooting). Once I tested and these did not resolve the issue, I found that the swap % was at 100% (7.9/8GB used). I resolved this by lowering swappiness and removing some RAM from the VM for the host but was still facing the same delays. Now I have been monitoring the syslog and performance of the machine to see what processes may be causing the random stalling, I found that the ps command is using 100% system and CPU at the time the system completely freezes (every process stops, disk r/w go to 0) this does not seem to correlate with the VM's activity from what I have monitored but could be relative.

Troubleshooting steps taken:
I used "pidstat -u 1 200" to find that the "ps" command was relative to the freeze. (please see attached image for results of this command) The process starts every 1-2 minutes and then completely dies.

PIDStat.png
ping.png

Normally I would not be so concerned with this one missed ping, however, everyone who uses the application that this SQL server is servicing is reporting issues with their session hanging/freezing. I have witnessed this issue and have correlated it with this random freeze.
I have also,
Set up auditd to track invocations of ps.
Searched for scripts and cron jobs that might be invoking these commands. (none found)
Ensured no systemd timers are running these commands excessively. (none found)
Used tools like iotop and atop to monitor disk I/O, as high I/O from the SQL server VM could be related.

Hardware :
the server Chassis is a Dell R730XD
Storage controller PERC H730 mini
  • 6 samsung ssd (870 evo) in RAID 10 (on the h730 mini) are being used for DATA *All new drives*
  • 1 samsung ssd (870 evo) passed through as proxmox OS drive *new drive as well*
I have also ran the Dell diagnostic tools on the server to ensure none of the hardware is going bad or reporting an issue, it came back with a pass for each component.
We did have this server VM previously on the same hardware without issues the only difference is that it is on a PVE hypervisor as that is our standard practice, before it used ESXI.

I tried uploading the logs, however, the files could not be processed even when <3MB. Please let me know which logs you need and I will be happy to pull them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!