Proxmox VMs freezing up with high IO delay at exactly the same time every day for exactly 5 mins

Adamg

Active Member
Feb 8, 2020
23
4
43
44
I am stumped here.

I have a VM that is running windows server 2022. The proxmox server became 100% full about a month ago while doing a local backup. That caused proxmox to become unresponsive and the VM got corrupted and I had to delete the VM and all the local backups to free up space and then restore the VM from a proxmox backup server.

I mention that because it seems like this freezing up with high IO delay started after that, because it was working for a year before then and this didn't happen.

Now, twice a day at the same times every day I see a spike in IO delay that lasts for exactly 5 mins, at 12:30pm and then again at 6:30pm. The VM become frozen during those 5 mins. Then after 5 mins everything works fine again.

I suspected something was wrong with 2 of the SSDs so I replaced 2 of the 5 1GB SSDs in the ZFS pool but that didn't change anything.

I'm sure there must be some job running at 12:30pm and 6:30pm but I can't find it. I see nothing in the proxmox syslog and I can't find any cronjob or windows server 2022 scheduled task that runs at 12:30pm and 6:30pm.

At this point, I don't even know what to try to narrow down the problem?
 
If it's not the Shadow Copy thing as listed above, make sure it's not a scheduled task in an SQL server. Some of the jobs you can automate on there are quite CPU and disk intensive, if there's not enough RAM allocated.
 
Awesome ideas, thanks _gabriel and PwrBank.

I checked the Windows's Volume scheduled Shadow Copies and the times were 7am and 12pm so it isn't that, but good idea because I hadn't looked there before.

PwrBank, you were right. They don't have SQL installed, but there is a custom program on this Windows VM which uses a database. I don't know what the database type is, but I looked in the Program folder and looked around at all the confusing files and finally I found a folder named "snapshots" and there are about 10 files in there, 2 per day, data modified times or 12:30pm and 6:30pm.

It is definitely that software that is making a backup at those times that is causing the freezing up and high IO delay.

Now, I need to get that company to stop that or fix their software or change the times!

Thanks so much for the ideas!!!
 
You can move within guest backup to a dedicated drive.
But your storage seems can't handle substained writes.
Don't forget ZFS require datacenter drives.
RAID10 / Striped Mirrors is recommended for VM / DB.
 
Thanks _gabriel!

When you say "You can move within guest backup to a dedicated drive." are you saying add another SSD to the server and do a passthru of that drive to the VM? or am I misunderstanding?

Support for the program called me and actually suggested disabling "balloning device" for the Ram because they said they have seen that cause problems with their software before. So I disabled it and rebooted the VM, but it actually made things worse, the server went offline at 6:30pm this time and came back online after a few mins and it had that same IO delay problem during that software doing its backup.

I currently have 3 WD Blue SSDs and now 2 WD Red SSDs in a ZFS Raid5. Do you think swapping the 3 Blues for 3 Reds would solve this problem?

Also, I have the VM scsi hard drive set to:
Cache: Write back
Discard: ON
IO Thread: ON
SSD emulation: OFF
Async IO: default (io_uring)

Since I don't really understand those settings I must have set them up following the proxmox best practice guide for the Windows Server 2022 VM.

Do you think adjusting any of those settings could help?
 
are you saying add another SSD to the server
Yes, it can be an HDD , not as passthrough , as virtual disk within a regular Lvmthin datastore. This is a "within guest" backup.
disabling "balloning device" for the Ram
it's a problem if you define "Minimum memory" too low , as Windows can up to kill/crash apps because Ballooning service installed with Guest Agent inflates its RAM balloon when physical host reach 80% usage.
Always keep balloning to get "real" RAM guest requirement ( guest cache is excluded ).
Set "Minimum memory" same as "Memory" to avoid problem and do not overallocate Memory.
3 Blues for 3 Reds would solve this problem?
I don't think so. ZFS requires Datacenter Drive with PLP to substain writes.
Your consumers drives will quickly wearout , becareful to monitor these.
As already said RAID10/Striped mirrors is the recommended perfomance configuration.
There is many threads about it.
Cache: Write back
Set Cache to Default None, this is double caching which slowdown writes when "cache" is full.
 
I don't think so. ZFS requires Datacenter Drive with PLP to substain writes.
Your consumers drives will quickly wearout , becareful to monitor these.

I can attest to this as well, Samsung EVO drives were eaten up by ZFS on a TrueNAS boot drive in about a year.

@Adamg regardless of the file system you will probably want to use enterprise grade drives on it due to it having a DB. DB does a LOT small disk rights. A majority of the IO on our arrays are from small databases. Especially with them doing snapshots and delta backups.
 
Ok, I set the Cache to Default None and rebooted the VM.

Support for the DB program said they don't have this problem with other customers using virtual machines with even slower drives so something must be wrong, he suggested I do a disk read/write speed test with crystal disk.

So I downloaded crystal disk and ran all the disk speed tests. The disk Read tests were fine, above normal even. BUT the write tests took a long time and actually caused the VM to freeze up for several minutes and eventually I stopped the write tests because everything was frozen.

I know I don't have the best grade of SSDs but I wouldn't think a disk write speed test would make the VM freeze up. And it wasn't doing this for the last year that this VM has been in use, it has only started happening since I had to restore the entire VM from the proxmox backup server.

The support guy suggested I update the Windows VirtIO drivers which I can try, any other suggestions? I'm thinking of adding a zfs mirrored pool and moving the VM to that and see if it runs ok, then at least I would know if is one or more disks causing the problem.
 
BUT the write tests took a long time and actually caused the VM to freeze up for several minutes and eventually I stopped the write tests because everything was frozen.
What drives exactly? ZFS with QLC flash memory drive will slow down to speeds below old rotating HDDs and people refuse to believe it until they experience it themselves and even then it takes some convincing (with is no fun for both parties). If this is the case then search for QLC on this forum for examples and suggestions to buy (second-hand) enterprise SSDs.
 
What drives exactly? ZFS with QLC flash memory drive will slow down to speeds below old rotating HDDs and people refuse to believe it until they experience it themselves and even then it takes some convincing (with is no fun for both parties). If this is the case then search for QLC on this forum for examples and suggestions to buy (second-hand) enterprise SSDs.
The ZFS pool originally had 5 x WD Blue SSDs:

WDS100T3B0A-00AXR0

I replaced 2 of them with 2 x WD Red SSDs:

WDS100T1R0A

I've googled them and I don't see anything about them having QLC flash memory, but I can't tell for sure, it's too confusing.

I guess I could try putting in 2 x Kingston DC600M SSD 2.5 Inch Enterprise SATA SSD - SEDC600M/1920G and making a new ZFS mirrored pool and migrating it to that pool and see if it runs properly.
 
I bet they don't use ZFS.
ZFS isn't just a RAID like mdadm or hw raid.
You may be right. The reason I always always set up proxmox using zfs is because then I can use the replicate feature. And the other methods didn't allow replication, or at least that was the case 5 years ago when I was first learning proxmox.

The only other idea I've had is to make a hardware raid, then when I install proxmox I make a zfs pool on a single drive? I've never done that but I'm assuming the replication feature would work. I don't know if I'm trading off any speed doing it that way? Also I couldn't just plug the drives into another machine and recover data using zfs because they would be tied to the hardware raid.
 
The only other idea I've had is to make a hardware raid, then when I install proxmox I make a zfs pool on a single drive? I've never done that but I'm assuming the replication feature would work. I don't know if I'm trading off any speed doing it that way? Also I couldn't just plug the drives into another machine and recover data using zfs because they would be tied to the hardware raid.

You can, but I'm going to say you shouldn't. ZFS shouldn't be used with hardware RAID. It wants to control all the drives itself.

If I were you, I'd create another VM for testing and try the same DiskMark test again. Maybe play around with the CPU type on the test VM to see if that changes anything. If not, try different emulated disk type.

Past that, you could try turning sync off on the ZFS pool temporarily and see if it improves at all.

While you don't have the best case scenario of hardware, it still shouldn't be so bad that it's locking the entire OS up.
 
  • Like
Reactions: Adamg
Great ideas PwrBank! Thanks! I should be able to set up a Windows 11 VM fairly fast and see if it has the same problem, that will really tell me where to focus, VM or hardware.

Thanks again!