Proxmox VMs freezing up with high IO delay at exactly the same time every day for exactly 5 mins

Adamg

Active Member
Feb 8, 2020
26
5
43
44
I am stumped here.

I have a VM that is running windows server 2022. The proxmox server became 100% full about a month ago while doing a local backup. That caused proxmox to become unresponsive and the VM got corrupted and I had to delete the VM and all the local backups to free up space and then restore the VM from a proxmox backup server.

I mention that because it seems like this freezing up with high IO delay started after that, because it was working for a year before then and this didn't happen.

Now, twice a day at the same times every day I see a spike in IO delay that lasts for exactly 5 mins, at 12:30pm and then again at 6:30pm. The VM become frozen during those 5 mins. Then after 5 mins everything works fine again.

I suspected something was wrong with 2 of the SSDs so I replaced 2 of the 5 1GB SSDs in the ZFS pool but that didn't change anything.

I'm sure there must be some job running at 12:30pm and 6:30pm but I can't find it. I see nothing in the proxmox syslog and I can't find any cronjob or windows server 2022 scheduled task that runs at 12:30pm and 6:30pm.

At this point, I don't even know what to try to narrow down the problem?
 
If it's not the Shadow Copy thing as listed above, make sure it's not a scheduled task in an SQL server. Some of the jobs you can automate on there are quite CPU and disk intensive, if there's not enough RAM allocated.
 
Awesome ideas, thanks _gabriel and PwrBank.

I checked the Windows's Volume scheduled Shadow Copies and the times were 7am and 12pm so it isn't that, but good idea because I hadn't looked there before.

PwrBank, you were right. They don't have SQL installed, but there is a custom program on this Windows VM which uses a database. I don't know what the database type is, but I looked in the Program folder and looked around at all the confusing files and finally I found a folder named "snapshots" and there are about 10 files in there, 2 per day, data modified times or 12:30pm and 6:30pm.

It is definitely that software that is making a backup at those times that is causing the freezing up and high IO delay.

Now, I need to get that company to stop that or fix their software or change the times!

Thanks so much for the ideas!!!
 
You can move within guest backup to a dedicated drive.
But your storage seems can't handle substained writes.
Don't forget ZFS require datacenter drives.
RAID10 / Striped Mirrors is recommended for VM / DB.
 
Thanks _gabriel!

When you say "You can move within guest backup to a dedicated drive." are you saying add another SSD to the server and do a passthru of that drive to the VM? or am I misunderstanding?

Support for the program called me and actually suggested disabling "balloning device" for the Ram because they said they have seen that cause problems with their software before. So I disabled it and rebooted the VM, but it actually made things worse, the server went offline at 6:30pm this time and came back online after a few mins and it had that same IO delay problem during that software doing its backup.

I currently have 3 WD Blue SSDs and now 2 WD Red SSDs in a ZFS Raid5. Do you think swapping the 3 Blues for 3 Reds would solve this problem?

Also, I have the VM scsi hard drive set to:
Cache: Write back
Discard: ON
IO Thread: ON
SSD emulation: OFF
Async IO: default (io_uring)

Since I don't really understand those settings I must have set them up following the proxmox best practice guide for the Windows Server 2022 VM.

Do you think adjusting any of those settings could help?
 
are you saying add another SSD to the server
Yes, it can be an HDD , not as passthrough , as virtual disk within a regular Lvmthin datastore. This is a "within guest" backup.
disabling "balloning device" for the Ram
it's a problem if you define "Minimum memory" too low , as Windows can up to kill/crash apps because Ballooning service installed with Guest Agent inflates its RAM balloon when physical host reach 80% usage.
Always keep balloning to get "real" RAM guest requirement ( guest cache is excluded ).
Set "Minimum memory" same as "Memory" to avoid problem and do not overallocate Memory.
3 Blues for 3 Reds would solve this problem?
I don't think so. ZFS requires Datacenter Drive with PLP to substain writes.
Your consumers drives will quickly wearout , becareful to monitor these.
As already said RAID10/Striped mirrors is the recommended perfomance configuration.
There is many threads about it.
Cache: Write back
Set Cache to Default None, this is double caching which slowdown writes when "cache" is full.
 
I don't think so. ZFS requires Datacenter Drive with PLP to substain writes.
Your consumers drives will quickly wearout , becareful to monitor these.

I can attest to this as well, Samsung EVO drives were eaten up by ZFS on a TrueNAS boot drive in about a year.

@Adamg regardless of the file system you will probably want to use enterprise grade drives on it due to it having a DB. DB does a LOT small disk rights. A majority of the IO on our arrays are from small databases. Especially with them doing snapshots and delta backups.
 
Ok, I set the Cache to Default None and rebooted the VM.

Support for the DB program said they don't have this problem with other customers using virtual machines with even slower drives so something must be wrong, he suggested I do a disk read/write speed test with crystal disk.

So I downloaded crystal disk and ran all the disk speed tests. The disk Read tests were fine, above normal even. BUT the write tests took a long time and actually caused the VM to freeze up for several minutes and eventually I stopped the write tests because everything was frozen.

I know I don't have the best grade of SSDs but I wouldn't think a disk write speed test would make the VM freeze up. And it wasn't doing this for the last year that this VM has been in use, it has only started happening since I had to restore the entire VM from the proxmox backup server.

The support guy suggested I update the Windows VirtIO drivers which I can try, any other suggestions? I'm thinking of adding a zfs mirrored pool and moving the VM to that and see if it runs ok, then at least I would know if is one or more disks causing the problem.
 
BUT the write tests took a long time and actually caused the VM to freeze up for several minutes and eventually I stopped the write tests because everything was frozen.
What drives exactly? ZFS with QLC flash memory drive will slow down to speeds below old rotating HDDs and people refuse to believe it until they experience it themselves and even then it takes some convincing (with is no fun for both parties). If this is the case then search for QLC on this forum for examples and suggestions to buy (second-hand) enterprise SSDs.
 
  • Like
Reactions: Kingneutron
What drives exactly? ZFS with QLC flash memory drive will slow down to speeds below old rotating HDDs and people refuse to believe it until they experience it themselves and even then it takes some convincing (with is no fun for both parties). If this is the case then search for QLC on this forum for examples and suggestions to buy (second-hand) enterprise SSDs.
The ZFS pool originally had 5 x WD Blue SSDs:

WDS100T3B0A-00AXR0

I replaced 2 of them with 2 x WD Red SSDs:

WDS100T1R0A

I've googled them and I don't see anything about them having QLC flash memory, but I can't tell for sure, it's too confusing.

I guess I could try putting in 2 x Kingston DC600M SSD 2.5 Inch Enterprise SATA SSD - SEDC600M/1920G and making a new ZFS mirrored pool and migrating it to that pool and see if it runs properly.
 
I bet they don't use ZFS.
ZFS isn't just a RAID like mdadm or hw raid.
You may be right. The reason I always always set up proxmox using zfs is because then I can use the replicate feature. And the other methods didn't allow replication, or at least that was the case 5 years ago when I was first learning proxmox.

The only other idea I've had is to make a hardware raid, then when I install proxmox I make a zfs pool on a single drive? I've never done that but I'm assuming the replication feature would work. I don't know if I'm trading off any speed doing it that way? Also I couldn't just plug the drives into another machine and recover data using zfs because they would be tied to the hardware raid.
 
The only other idea I've had is to make a hardware raid, then when I install proxmox I make a zfs pool on a single drive? I've never done that but I'm assuming the replication feature would work. I don't know if I'm trading off any speed doing it that way? Also I couldn't just plug the drives into another machine and recover data using zfs because they would be tied to the hardware raid.

You can, but I'm going to say you shouldn't. ZFS shouldn't be used with hardware RAID. It wants to control all the drives itself.

If I were you, I'd create another VM for testing and try the same DiskMark test again. Maybe play around with the CPU type on the test VM to see if that changes anything. If not, try different emulated disk type.

Past that, you could try turning sync off on the ZFS pool temporarily and see if it improves at all.

While you don't have the best case scenario of hardware, it still shouldn't be so bad that it's locking the entire OS up.
 
  • Like
Reactions: Adamg
Great ideas PwrBank! Thanks! I should be able to set up a Windows 11 VM fairly fast and see if it has the same problem, that will really tell me where to focus, VM or hardware.

Thanks again!
 
I moved the original Windows Server 2022 VM to another node on a completely different cluster, and it did not cause the freezing up when doing a crystal disk read/write test. So I am concluding that the proxmox OS must be causing the problem in the original configuration.

I wish I could pinpoint what is wrong with proxmox but because this happened after proxmox filled up 100% of the disk space and stopped responding I'm going to guess that something broke in the proxmox OS when that happened so a complete reinstall will fix it.
 
  • Like
Reactions: PwrBank
Thanks again for help @PwrBank and @_gabriel

For anyone that finds this forum. I completely reinstalled the Proxmox OS, migrated the Windows Server 2022 VM back to it and started to rerun it and was still having the freezing up issue. Not as bad as before, but it was still happening during a crystaldisk read/write test, mostly during the write part.

I'm not convinced that this server hardware is working 100%, but all the HP bios stuff says the server is healthy (of course HP would say that about itself), so warranty support wouldn't conclude that the hardware is bad.

Anyways, I think I have it solved for the most part. I ended up simply setting the scsi disk cache to "Write back (unsafe)" and that made the crystaldisk tests run way faster and also the VM doesn't freeze.

I'm happy with this solution. My understanding it "unsafe" means a potential data loss if the power is cut because stuff might still be in ram that hasn't been written to the disk yet. This isn't a machine that can't lose a few seconds of work that someone hasn't saved. Probably the ram is writing to the disk within a few seconds, it wouldn't be losing the last 30 mins of unsaved work or anything. Also, probably while doing updates and changing important system files is risky with "Write back (unsafe)" mode. Oh well, I'll accept the risk for the fix it provided.

Thanks!
 
I think I have it solved for the most part. I ended up simply setting the scsi disk cache to "Write back (unsafe)"
As it's written , it's unsafe for VM, not only if power cut , but also if OOM killer or Physical Hard Reset.
CrystalDiskMark, by default, benchs 1 GB of data, which remains in cache, data doesn't hit disk , try with 32 GB and problem should persist.

Servers grade disables disk's write cache as disks are intended to be used with their hw raid which have write cache accelerator itself protected by BBU.
Disk's write cache can be re-enable in BIOS/UEFI.

Reminder : Don't use consumer disks as ZFS in production, even more WD Blue SA510 which is DRAM less so cannot sustains writes where ZFS amplifies written data.
 
  • Like
Reactions: UdoB
As it's written , it's unsafe for VM, not only if power cut , but also if OOM killer or Physical Hard Reset.
CrystalDiskMark, by default, benchs 1 GB of data, which remains in cache, data doesn't hit disk , try with 32 GB and problem should persist.

Servers grade disables disk's write cache as disks are intended to be used with their hw raid which have write cache accelerator itself protected by BBU.
Disk's write cache can be re-enable in BIOS/UEFI.

Reminder : Don't use consumer disks as ZFS in production, even more WD Blue SA510 which is DRAM less so cannot sustains writes where ZFS amplifies written data.
You're right, when I set Crystal Disk to 32GB it is freezing up the VM again.

Ok, so it sounds like you might be saying I might be able to solve this by going into the BIOS and "re-enable the Disk's write cache" ?
 
might be able to solve this by going into the BIOS and "re-enable the Disk's write cache" ?
That will not solve the problem which is your cheap disks, just delay a bit more.
I doubt that should enough to mitigate these bad disks for ZFS, perhaps more effective for the WD Red model, but as many posts said :
ZFS require enterprise flash disks + striped mirror for best write performance.