Hi,
Issue Description:
For several months now I have had a problem with a VM in Proxmox, it is a SQL Windows server 2019. Initially, I found that when pinging the device it would randomly miss a ping, or just be delayed by several seconds *10-20s* sometimes longer depending on the workload (~150-250 users). After changing out the NIC for the server and doubling its RAM, the issue persisted. I also had a Linux bond with LACP applied to help with redundancy (removed to simplify troubleshooting). Once I tested and these did not resolve the issue, I found that the swap % was at 100% (7.9/8GB used). I resolved this by lowering swappiness and removing some RAM from the VM for the host but was still facing the same delays. Now I have been monitoring the syslog and performance of the machine to see what processes may be causing the random stalling, I found that the ps command is using 100% system and CPU at the time the system completely freezes (every process stops, disk r/w go to 0) this does not seem to correlate with the VM's activity from what I have monitored but could be relative.
Troubleshooting steps taken:
I used "pidstat -u 1 200" to find that the "ps" command was relative to the freeze. (please see attached image for results of this command) The process starts every 1-2 minutes and then completely dies.
Normally I would not be so concerned with this one missed ping, however, everyone who uses the application that this SQL server is servicing is reporting issues with their session hanging/freezing. I have witnessed this issue and have correlated it with this random freeze.
I have also,
Set up auditd to track invocations of ps.
Searched for scripts and cron jobs that might be invoking these commands. (none found)
Ensured no systemd timers are running these commands excessively. (none found)
Used tools like iotop and atop to monitor disk I/O, as high I/O from the SQL server VM could be related.
Hardware :
the server Chassis is a Dell R730XD
Storage controller PERC H730 mini
We did have this server VM previously on the same hardware without issues the only difference is that it is on a PVE hypervisor as that is our standard practice, before it used ESXI.
I tried uploading the logs, however, the files could not be processed even when <3MB. Please let me know which logs you need and I will be happy to pull them.
Issue Description:
For several months now I have had a problem with a VM in Proxmox, it is a SQL Windows server 2019. Initially, I found that when pinging the device it would randomly miss a ping, or just be delayed by several seconds *10-20s* sometimes longer depending on the workload (~150-250 users). After changing out the NIC for the server and doubling its RAM, the issue persisted. I also had a Linux bond with LACP applied to help with redundancy (removed to simplify troubleshooting). Once I tested and these did not resolve the issue, I found that the swap % was at 100% (7.9/8GB used). I resolved this by lowering swappiness and removing some RAM from the VM for the host but was still facing the same delays. Now I have been monitoring the syslog and performance of the machine to see what processes may be causing the random stalling, I found that the ps command is using 100% system and CPU at the time the system completely freezes (every process stops, disk r/w go to 0) this does not seem to correlate with the VM's activity from what I have monitored but could be relative.
Troubleshooting steps taken:
I used "pidstat -u 1 200" to find that the "ps" command was relative to the freeze. (please see attached image for results of this command) The process starts every 1-2 minutes and then completely dies.
Normally I would not be so concerned with this one missed ping, however, everyone who uses the application that this SQL server is servicing is reporting issues with their session hanging/freezing. I have witnessed this issue and have correlated it with this random freeze.
I have also,
Set up auditd to track invocations of ps.
Searched for scripts and cron jobs that might be invoking these commands. (none found)
Ensured no systemd timers are running these commands excessively. (none found)
Used tools like iotop and atop to monitor disk I/O, as high I/O from the SQL server VM could be related.
Hardware :
the server Chassis is a Dell R730XD
Storage controller PERC H730 mini
- 6 samsung ssd (870 evo) in RAID 10 (on the h730 mini) are being used for DATA *All new drives*
- 1 samsung ssd (870 evo) passed through as proxmox OS drive *new drive as well*
We did have this server VM previously on the same hardware without issues the only difference is that it is on a PVE hypervisor as that is our standard practice, before it used ESXI.
I tried uploading the logs, however, the files could not be processed even when <3MB. Please let me know which logs you need and I will be happy to pull them.